RISEML Clusters

Introducing Exxact RiseML Clusters

RiseML simplifies running, monitoring, and scaling deep learning experiments on Kubernetes. Kubernetes is an open-source system for orchestrating and scheduling containerized workloads that was initiated by Google. RiseML supports all major deep learning frameworks such as Caffe, Tensorflow, Theano, and Torch.

Using RiseML, we can combine individual GPU servers as a single compute cluster so your team can better utilize it for sharing its resources for training machine learning models. This allows your team to focus on what really matters, namely deep learning research and not repetitive manual work, hardware setup headaches, or software configuration issues. With RiseML Clusters, you can:

  • Effortlessly run machine learning experiments: No more SSH. Use the RiseML command-line interface to prepare, run, monitor, and describe reproducible experiments and their results directly from your workstation.
  • Partition and manage resources for multiple users: With RiseML, you can partition and manage resources for multiple users. RiseML maximizes your cluster’s resource utilization by running experiments in parallel using isolated container environments.
  • Keep your data private and code private: No more worrying about cloud storage security. By having your own in-house GPU cluster, you can take full control of your own data to ensure that it’s protected.

Exxact RiseML Cluster Options

RiseML Cluster

  • Fully Turn Key
  • GbE Network Only
  • 1x Headnode with 40TB Raw Storage
  • 4x Compute Nodes (4x Double Wide Cards, 96GB DDR4 Memory, 4TB HDD)

RiseML Cluster

  • Fully Turn Key
  • GbE Network + 10GbE
  • 1x Headnode with 80TB Raw Storage
  • 8x Compute Nodes (4x Double Wide Cards, 96GB DDR4 Memory, 4TB HDD, 2x 10GBASE-T)

RiseML Cluster

  • Fully Turn Key
  • GbE Network + 10GbE
  • 1x Headnode with 80TB Raw Storage
  • 20x Compute Nodes (4x Double Wide Cards, 192GB DDR4 Memory, 4TB HDD, 2x 10GBASE-T)

RiseML Key Features

Powered by Kubernetes

The most advanced container orchestrator

Data Privacy

Keep your data and code private

Effortless Automation

Hyperparameter tuning and distributed training

Multi User

Share your cluster with your team

Framework Support

Choose your favorite machine learning framework

Monitor Experiment

Track progress and analyze results

NVIDIA Elite Partner

As a leading NVIDIA Elite Partner, Exxact Corporation works closely with the NVIDIA Tesla GPU team to ensure seamless factory development and support. We pride ourselves on providing value-added service standards unmatched by our competitors.

Understanding RiseML

RiseML provides a simple yet powerful abstraction for machine learning engineers working on GPU infrastructure both in the cloud and on bare metal. ML engineers access RiseML via a command-line interface (CLI) that hides the working details of the underlying cluster hardware which introduces a new layer tightly coupled to machine learning concepts. Researchers using RiseML think in terms of experiments that train a model. Under the hood, RiseML takes care of executing these experiments on the infrastructure in a robust manner, including deciding on which nodes specific parts of an experiment are executed.

An Exxact RiseML Cluster consists of a hardware layer with a number of nodes and GPUs, a Kubernetes layer that orchestrates machine learning jobs and a RiseML layer that manages experiments and turns them into Kubernetes jobs. Each Exxact RiseML Cluster contains storage systems configured to meet the specific requirements and challenges of varied types of training and model data.

The RiseML layer consists of multiple components which also run on top of Kubernetes next to all machine learning jobs. For example, RiseML takes care of versioning, gathering logs, and tracking the state of each experiment. This is the core function of RiseML. On top RiseML provides a REST API that can be either accessed programmatically or via the RiseML client.

All experiments on the RiseML cluster run in containers, lightweight "virtual machines", running Linux. This enables installing project dependencies that don't interfere with each other. For example, it is possible to run different Linux distributions on the same cluster or even the same node at the same time. Each container is started from an image. The image contains the container's filesystem, including all system libraries, your machine learning code, and other dependencies you need to run your code.