TensorRT Inference Server (TRTIS)
What is TensorRT Inference Server? Lets say you've created a TensorFlow or ONNX model that meets your specifications. What you need now is a deployable inference solution. This solution should optimize the available GPUs for the maximum performance. Furthermore, other requirements need to be met, such as the need to A/B test capabilities, or the ability to support servers that have multiple homogeneous or heterogeneous GPU configurations.
The solution you seek is the NVIDIA TensorRT inference server. The TensorRT inference server is a platform that expands on the utility of models and frameworks and improves utilization of both GPUs and CPUs. Exxact Deep Learning Inference Solutions ship with the TensorRT inference server, which encapsulates everything you need to deploy a high-performance inference server.
The NVIDIA TensorRT Inference Server provides a solution optimized for GPU Inference. Effectively, the server acts as a service via an HTTP or gRPC endpoint, allowing remote clients to request inference for any model being managed by the server.
TensorRT Inference Server Features
Multiple framework support
The inference server can manage any number & mix of models (system disk and memory resource limitations apply). It supports TensorRT, TensorFlow GraphDef, TensorFlow SavedModel and Caffe2 NetDef model formats. Furthermore, it supports TensorFlow-TensorRT integrated models.
Custom backend support
The inference server allows individual models to be implemented with custom backends instead of by a deep-learning framework. With a custom backend a model can implement any logic desired, while still benefiting from the GPU support, concurrent execution, dynamic batching and other features provided by the server.
The inference server monitors the model repository
This allows for any change to dynamically reloads the model(s) when necessary, without requiring a server reboot. Models and model versions can be added and removed, and configurations can be changed while the server is running.
The server can effciently distribute inference tasks across all available GPUs.
Concurrent model execution support
Multiple models (or multiple instances of the same model) can run simultaneously on the same GPU.
For models that support batching, the server accepts requests for a batch of inputs and responds with the corresponding batch of outputs. The inference server also supports dynamic batching where individual inference requests are dynamically combined together to improve inference throughput.
Model repositories may reside on a locally accessible file system (e.g. NFS) or in cloud storage.
Readiness and liveness
Health endpoints suitable for any orchestration or deployment framework, such as Kubernetes.
Track metrics indicating GPU utiliization, server throughput, and server latency.
The stable release of the TensorRT Inference Server ships with Exxact Deep Learning Inference Solutions.