Data Orchestrator

Place the right data next to the right compute.

Limestone Data Orchestrator helps AI infrastructure teams move datasets, model weights, containers, artifacts, and logs across fragmented storage and GPU compute environments with high throughput, reliable execution, and clear monitoring.

Data Orchestrator jobs

Control plane view

healthy
hydrate-model-weightsRUNNING
82% complete14.2 GB/s
return-training-artifactsRUNNING
47% complete8.7 GB/s
cleanup-ephemeral-cachePENDING
0% complete-

12.8 PB

found

9.4 PB

copied

37

errors

Problem

Data and compute no longer live in the same place.

AI workloads run across hyperscalers, private clusters, regional capacity, and specialized GPU environments. The data they need is spread across object stores, file systems, databases, data lakes, warehouses, model registries, and local cluster storage.

Teams stitch this together with scripts, cloud-specific tools, and fragile pipelines. When transfers are slow or fail silently, accelerators sit idle and platform engineers end up debugging movement instead of improving infrastructure.

Solution

Durable movement jobs, executed close to the data path.

Limestone provides Data Orchestrator as an orchestration layer for defining, tracking, and operating data movement jobs. The control plane manages state, progress, and errors while deployable workers execute near storage and compute.

Data Orchestrator makes data movement fast, scalable, monitored, reliable, secure, and convenient across file and object storage, private clusters, and public cloud compute. Customers keep control over data residency and execution while reducing the idle time and operational friction around expensive compute.

Product

Data Orchestrator is built for speed, reliability, and monitoring from the first job.

Data Orchestrator makes data movement fast, reliable, observable, and easier to operate as AI infrastructure becomes more distributed.

It is designed for the workflows between storage and compute: hydrating clusters, returning workload outputs, placing model artifacts near inference capacity, and cleaning up ephemeral environments.

High

throughput data transfer

Targets 95%+ network saturation through control and data plane decoupling, distributed parallel execution, and efficient data handling.

durable

job execution

Accepted jobs are persisted, decomposed into retry-safe work, and tracked through monotonic state transitions.

visible

operations

Progress counters, job state, manifests, and structured errors make long-running movement easy to monitor and reason about.

A job model for infrastructure teams

Copy, scan, and delete workflows are submitted as durable jobs with explicit state, counters, manifests, and structured errors. CLI, Python SDK, and web console surfaces can share one operating model.

do job copy \
  --origin s3://training-data/frontier-v4 \
  --destination cluster://h100-pool-a/datasets \
  --follow-symlinks=false

job_id: job_01HX9B7M6R
state: RUNNING
throughput: 14.2 GB/s
files_copied: 18,402,117

Use cases

Data placement workflows for modern AI systems.

01

GPU cluster hydration and return path

Move datasets, model weights, containers, and artifacts into compute before training or inference, then return generated artifacts, logs, model outputs, and completed run data to durable storage.

02

Inference scale-up placement

Place model weights and supporting artifacts near newly available GPU capacity when serving demand increases.

03

Controlled cleanup workflows

Clean up staged data after ephemeral cluster use with durable job state, structured errors, and auditable outcomes.

04

Agent workflow artifact persistence

Stream or bulk-move files created by agentic workflows from local or ephemeral storage to durable storage.

Results

Faster workloads, clearer operations, less idle compute.

Reduced operational burden
Reduced E2E workload completion time
Reduced accelerator idle time
Improved AI researcher and engineer productivity

FAQ

Designed for infrastructure reality.

Where does Limestone run?+

Limestone is designed around a control plane for defining and observing jobs, with data plane workers that execute near the relevant storage and compute environments.

Does Limestone sit in the data path?+

The intended architecture keeps customer data movement close to customer infrastructure. Data planes access storage and credentials directly so customers retain control over execution and security boundaries.

What storage systems are planned?+

The product direction includes object storage such as S3, GCS, and Azure Blob; file systems such as NFS, Lustre, Weka, Vast, CephFS, and related backends; and future support for data lakes and data warehouses.

How is pricing structured?+

Managed service pricing is listed at $0.20 per GB transferred plus $0.50 per job. Enterprise pricing is available for support, SLAs, private deployment, enhanced observability, and on-premise or self-managed data plane needs.