Containers & Cluster Management

Run:ai Supported Software

Run:ai is a cloud-native Kubernetes-based orchestration tool that enables data scientists to accelerate, innovate, and complete AI initiatives faster. Reach peak utilization and efficiency by incorporating an AI-dedicated, high-performant super-scheduler tailored for managing NVIDIA CUDA-enable GPU resources. The platform provides IT and MLOps with visibility and control over job scheduling and dynamic GPU resource provisioning.

Run:ai Atlas

Run:ai Atlas is the top-down organization software to simplify your AI journey from beginning to end. Running through their cloud-native operating system built on top of Kubernetes, Run:ai delivers everything you need.

Applications

Develop and run your AI applications on accelerated infrastructure using the tools you are most familiar with. Atlas integrates frameworks and MLOps toolset to ensure flexibility and freedom of choice.

Control Plane

Gain centralized visibility and control over the entire cluster with Run:ai user interface. Set up granular parameters and policies automated for GPU allocation for departments, teams, and jobs.

Operating System

Schedule and manage any AI workload. Build, train, inference across a pool of shared GPU resources. AI workload aware Kubernetes scheduler makes smart decisions for efficient sharing across multiple jobs.

Infrastructure Resources

Orchestrate AI workloads using a connected infrastructure via on-prem or cloud. Compatible with CUDA-enabled NVIDIA GPUs and recommended for teams of 5 or more data scientists.

More than just an Orchestration Tool

Faster Time to Innovation

With automatic resource pooling, queuing, and job prioritization researchers can focus more on data science projects and develop ground-breaking innovations.

Increased Productivity

Run:ai platform uses a fairness algorithm and configurable parameters to guarantee all users within the cluster a fair share of the resources. Get jobs done faster with the highest utilization.

Improved GPU Utilization

An integrated automatic super scheduler allows users to easily make use of fractional or multiple GPUs for all kinds of workloads. For the highest optimization, all GPUs should be allocated at any given time.

Capabilities and Features

Fair Scheduling

Allows departments, teams, and jobs to easily share GPU resources automatically with the Run:ai platform. Their GPU quota system enables a guaranteed amount of GPU resources allocated to a specific job. If resources are available elsewhere in the cluster, automatically receive over-quota GPUs to accelerate your task. When other jobs require their GPU quota again, Run:ai will automatically preempt and reallocate over-quota resources.

Fractional GPUs

Optimize workloads that require significantly less GPU utilization like building or inferencing. Instead of allocating 10 GPUs to 10 data scientists (each utilizing about 1/10th of the compute), creating a fractional GPU instance places all 10 data scientists onto a single GPU. The other 9 GPUS can now be used in a more productive training task that requires massive GPU resources.

Distributed Training

Enables the ability to leverage multi-GPU utilization to tackle large AI, training models. Distributed training is executed automatically with little to no intervention from the data scientist. There is no need to code or enable distributed training since it is built into the Run:ai platform.

Visibility

Workloads, resource allocation, and utilization can be viewed through Run:ai platform user interface. Create departments, assign teams, and allocate resources accordingly to specific projects. Monitor usage by cluster, node, project, or a single job user. Visibility of usage can justify additional GPU nodes.

Utilization Only Goes Up

The GPU quota system and fair scheduling work in tandem to dynamically allocate resources enabling peak GPU utilization. When resources from 1 job are not in use, another job can pick up those unused resources to accelerate its task. Automatically.

Almost always, your training tasks will receive over quota resources from tasks with less utilization
Worst case, tasks will continue to execute jobs over your requested GPU quota, the same way it would run without Run:ai
Effectively, all GPUs in your cluster are allocated to a job and utilized to their fullest potential. Every minute a GPU is not utilized is an expensive minute.