NVIDIA TensorRT™ is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. TensorRT-based applications perform up to 40x faster than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in all major frameworks, calibrate for lower precision with high accuracy, and finally deploy to hyperscale data centers, embedded, or automotive product platforms.
TensorRT is built on CUDA, NVIDIA’s parallel programming model, and enables you to optimize inference for all deep learning frameworks leveraging libraries, development tools and technologies in CUDA-X AI for artificial intelligence, autonomous machines, high-performance computing, and graphics.
TensorRT provides INT8 and FP16 optimizations for production deployments of deep learning inference applications such as video streaming, speech recognition, recommendation and natural language processing. Reduced precision inference significantly reduces application latency, which is a requirement for many real-time services, auto and embedded applications.
You can import trained models from every deep learning framework into TensorRT. After applying optimizations, TensorRT selects platform specific kernels to maximize performance on Tesla GPUs in the data center, Jetson embedded platforms, and NVIDIA DRIVE autonomous driving platforms.
To use AI models in data center production, the TensorRT Inference Server is a containerized microservice that maximizes GPU utilization and runs multiple models from different frameworks concurrently on a node. It leverages Docker and Kubernetes to integrate seamlessly into DevOps architectures.
With TensorRT developers can focus on creating novel AI-powered applications rather than performance tuning for inference deployment.
TensorRT Optimizations and Performance
Weight & Activation Precision Calibration
Maximizes throughput by quantizing models to INT8 while preserving accuracy
Layer & Tensor Fusion
Optimizes use of GPU memory and bandwidth by fusing nodes in a kernel
Selects best data layers and algorithms based on target GPU platform
Dynamic Tensor Memory
Minimizes memory footprint and re-uses memory for tensors efficiently
Scalable design to process multiple input streams in parallel
TensorRT dramatically accelerates deep learning inference performance on NVIDIA GPUs. See how it can power your inference needs across multiple networks with high throughput and ultra-low latency.
Integrated with All Major Frameworks
NVIDIA works closely with deep learning framework developers to achieve optimized performance for inference on AI platforms using TensorRT. If your training models are in the ONNX format or other popular frameworks such as TensorFlow and MATLAB, there are easy ways for you to import models into TensorRT for inference. Below are few integrations with information on how to get started.
TensorRT and TensorFlow are tightly integrated so you get the flexibility of TensorFlow with the powerful optimizations of TensorRT. Learn more in the TensorRT integrated with TensorFlow blog post.
TensorRT provides an ONNX parser so you can easily import ONNX models from frameworks such as Caffe 2, Chainer, Microsoft Cognitive Toolkit, MxNet and PyTorch into TensorRT. Learn more about ONNX support in TensorRT here.
TensorRT is also integrated with ONNX Runtime, providing an easy way to achieve high-performance inference for machine learning models in the ONNX format. Learn more about ONNX Runtime - TensorRT integration here.
“In our evaluation of TensorRT running our deep learning-based recommendation application on NVIDIA Tesla V100 GPUs, we experienced a 45x increase in inference speed and throughput compared with a CPU-based platform. We believe TensorRT could dramatically improve productivity for our enterprise customers.”
— Markus Noga, Head of Machine Learning at SAP
“By using tensor cores on the V100, the most recently optimized CUDA libraries and the TF-TRT backend we were able to speed up our already fast DL network by a factor of 4x”
— Kris Bhaskar, KLA Senior Fellow, VP AI initiatives, KLA
“Criteo uses Nvidia's TensorRT over T4 cards to optimize its deep-learning models for faster inference on GPUs. Now, removing inappropriate images over billions of them is 4 times faster. It also consumes half less energy.”
— Suju Rajan, SVP Research, Criteo
NVIDIA TensorRT Inference Server
NVIDIA TensorRT Inference Server simplifies deploying AI inference in data center production. It’s an inference microservice for data center production that maximizes GPU utilization and seamlessly integrates into DevOps deployments with Docker and Kubernetes integration.
TensorRT Inference Server:
- Maximizes utilization by enabling inference for multiple models on one or more GPUs
- Supports all popular AI frameworks
- Supports audio streaming inputs
- Dynamically batches requests to increase throughput
- Provides latency and health metrics for auto scaling and load balancing
TensorRT Inference Server is available in a ready-to-deploy container from the NGC container registry, making it simple to use in production environments. It is also available as open source, allowing developers to customize and extend functionality of the software to fit their specific data center workflows.
With TensorRT Inference Server, there’s now a common solution for AI inference, allowing researchers to focus on creating high-quality trained models, DevOps engineers to focus on deployment, and developers to focus on their applications, without needing to reinvent the plumbing for each AI-powered application.
Learn more in the NVIDIA TensorRT Inference Server developer blog.
What's New in TensorRT 5.1 and TensorRT Inference Server
TensorRT 5.1 includes support for 20+ new Tensorflow and ONNX operations, ability to update model weights in engines quickly, and a new padding mode to match native framework formats for higher performance. With this new version, applications perform up to 40x faster during inference using mixed precision on Turing GPUs for image/video, translation and speech applications.
- Optimize models such as DenseNet and TinyYOLO with support 20+ new layers, activations and operations in TensorFlow and ONNX
- Update model weights in an existing engine without rebuilding it
- Deploy applications in INT8 precision on Xavier-based NVIDIA AGX platforms using the NVIDIA DLA accelerator
While TensorRT supports every framework, it is also included in TensorFlow 2.0 providing an easy path for TensorFlow users to get powerful TensorRT optimizations. TensorRT is also integrated with ONNX runtime, enabling high-performance inference for a wide set of machine learning models in the ONNX format. In addition, TensorRT 5.1 includes new samples, new debugging capabilities through support for the NVTX format and bug fixes
TensorRT 5.1 GA is available for download now to members of the NVIDIA Developer Program.
NVIDIA TensorRT Inference Server 1.0 includes an audio streaming API, bug fixes and enhancements and all future versions will be backward compatible with this version. It is available as a ready-to-deploy container from the NGC container registry and as an open source project from GitHub.
Get Started With Hands-On Training
The NVIDIA Deep Learning Institute (DLI) offers hands-on training for developers, data scientists, and researchers in AI and accelerated computing.Get hands-on experience with TensorRT in self-paced electives on Optimization and Deployment of TensorFlow Models with TensorRT and Deployment for Intelligent Video Analytics using TensorRT today.
TensorRT is freely available to members of the NVIDIA Developer Program from the TensorRT product page for development and deployment.
Developers can also get TensorRT in the TensorRT Container from the NGC container registry.
TensorRT is included in: