Data Orchestrator

Move data to the compute

Limestone Data Orchestrator helps AI labs move datasets, model weights, containers, artifacts, and logs across fragmented storage and GPU compute environments with high throughput, reliable execution, and clear monitoring.

Schedule a conversation View product preview

Data Orchestrator jobs

Control plane view

healthy

hydrate-model-weightsRUNNING

82% complete14.2 GB/s

return-training-artifactsRUNNING

47% complete8.7 GB/s

cleanup-ephemeral-cachePENDING

0% complete-

12.8 PB

found

9.4 PB

copied

errors

Problem

Data and compute no longer live in the same place.

AI workloads run across hyperscalers, private clusters, regional capacity, and specialized GPU environments. The data they need is spread across object stores, file systems, databases, data lakes, warehouses, model registries, and local cluster storage.

Teams stitch this together with scripts, cloud-specific tools, and fragile pipelines. When transfers are slow or fail silently, accelerators sit idle and platform engineers end up debugging movement instead of improving infrastructure.

Solution

Data movement jobs for AI workloads.

Limestone provides Data Orchestrator as a vendor-neutral orchestration layer for defining, tracking, and operating data movement jobs across object stores, file systems, data lakes, warehouses, model registries, and cluster storage. The control plane manages state, progress, and errors while the data plane executes near storage and compute.

Data Orchestrator makes petabyte-scale data movement fast, scalable, monitored, reliable, secure, and convenient across file and object storage, private clusters, and public cloud compute. Customers keep control over data while Limestone helps move data to the compute before expensive accelerators sit idle.

Product

Data Orchestrator is built for speed, reliability, and monitoring from the first job.

Data Orchestrator makes data movement fast, reliable, observable, and easier to operate as AI infrastructure becomes more distributed.

It is designed for the workflows between storage and compute: hydrating clusters before training or inference, returning post-training outputs, placing model artifacts near inference capacity, and cleaning up ephemeral environments.

High

throughput data transfer

Data Orchestrator achieves 95%+ network saturation and up to 10× higher throughput than existing tooling by scaling the data transfer horizontally.

Technical detail+

Performance scales horizontally by decoupling the control and data planes, using stateless workers deployed close to storage and compute clusters. Distributed, pipelined, and parallelized fan-out execution eliminates noisy neighbor effects and coordination bottlenecks. Streaming I/O with backpressure handles both small and large files efficiently, and zero-copy checksums provide end-to-end data integrity.

Durable

job execution

Accepted jobs are persisted, decomposed into retry-safe work, and tracked through monotonic state transitions.

Visible

operations

Progress counters, job state, manifests, and structured errors make long-running movement easy to monitor and reason about.

A job model for infrastructure teams

Copy, scan, and delete workflows are submitted as durable jobs with explicit state, counters, manifests, and structured errors. CLI, Python SDK, and web console surfaces can share one operating model.

do job copy \
  --origin s3://training-data/frontier-v4 \
  --destination cluster://h100-pool-a/datasets \
  --follow-symlinks=false

job_id: job_01HX9B7M6R
state: RUNNING
throughput: 14.2 GB/s
files_copied: 18,402,117

Use cases

Data placement workflows for modern AI systems.

Pre-training and post-training

Move datasets, model weights, containers, and artifacts into compute before training or inference, then return artifacts, logs, model outputs, and completed run data to durable storage. Clean up staged data after cluster use.

Inference

Place model weights and supporting artifacts near newly available GPU capacity when serving demand increases, without rewriting workflows for each cloud, cluster, or storage backend.

Agentic workloads

Bulk-move files created by agentic workflows to and from ephemeral storage to durable storage.

Results

Faster workloads, clearer operations, less idle compute.

Reduced operational burden

Reduced E2E workload completion time

Reduced accelerator idle time

Improved AI researcher and engineer productivity

FAQ

Designed for infrastructure reality.

Who is Data Orchestrator for?+

Data Orchestrator is designed for AI labs that use a mix of clouds, private clusters, regional GPU capacity, and diverse storage systems. Data Orchestrator ensures your compute isn't waiting on data through full workload portability.

How does this help AI researchers and engineers?+

Researchers get rapid access to the datasets, checkpoints, model weights, containers, and artifacts their workloads need, enabling them to run experiments faster. Infrastructure engineers get durable jobs with fine-grained observability and a consistent way to move data before and after both training runs, inference spikes and agentic workflows.

Does Data Orchestrator access my data?+

Data Orchestrator is available as both a managed and an on-premises offering. Both options allow users to connect existing storage systems while retaining control over when and where data moves. In the managed offering, users provide storage credentials that are securely stored, encrypted, and used only for authorized data movement operations. The on-premises offering provides full control over execution, data residency, and security boundaries, with all data movement occurring within the customer's environment using Kubernetes, Slurm, SkyPilot, or SUNK runtimes.

What storage systems are supported?+

Users can leverage their existing data across object storage (S3, GCS, Azure Blob), file systems (NFS, Lustre, Weka, VAST, CephFS, and related backends), with planned support for data lakes, data warehouses, and container and model registries.