Review the latest GPU acceleration factors of popular HPC applications.


NVIDIA’s complete solution stack, from GPUs to libraries, and containers on NVIDIA GPU Cloud (NGC), allows data scientists to quickly get up and running with deep learning. NVIDIA® Tesla® V100 Tensor Core GPUs leverage mixed precision to accelerate deep learning training throughputs across every framework and every type of neural network. NVIDIA breaks performance records on MLPerf, the AI’s first industry-wide benchmark, a testament to our GPU-accelerated platform approach.

NVIDIA Performance on MLPerf 0.6 AI Benchmarks

ResNet-50 v1.5 Time to Solution on V100

MXNet | Batch Size refer to CNN V100 Training table below | Precision: Mixed | Dataset: ImageNet2012 | Convergence criteria - refer to MLPerf requirements

Training Image Classification on CNNs

ResNet-50 V1.5 Throughput on V100

DGX-1: 8x Tesla V100-SXM2-16GB for MXNet, PyTorch and TensorFlow, E5-2698 v4 2.2 GHz | Batch Size = 208 for MXNet, PyTorch and TensorFlow = 256 | MXNet and TensorFlow = 19.09-py3, PyTorch = 19.10-py3 | Precision: Mixed | Dataset: ImageNet2012

ResNet-50 V1.5 Throughput on T4

Supermicro SYS-4029GP-TRT T4: 8x Tesla T4, Gold 6140 2.3 GHz for MXNet and TensorFlow, Gold 6240 2.6GHz for PyTorch | Batch Size = 208 for MXNet, PyTorch and TensorFlow = 256 | MXNet = 19.05-py3, PyTorch = 19.09-py3, TensorFlow = 19.06-py3 | Precision: Mixed | Dataset: ImageNet2012

Training Performance

NVIDIA Performance on MLPerf 0.6 AI Benchmarks

FrameworkNetworkNetwork TypeTime to Solution GPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetResNet-50 v1.5CNN115.22 minutes8x V100DGX-10.6-8MixedImageNet2012V100-SXM2-16GB
CNN57.87 minutes16x V100DGX-20.6-17MixedImageNet2012V100-SXM3-32GB
CNN52.74 minutes16x V100DGX-2H0.6-19MixedImageNet2012V100-SXM3-32GB-H
CNN2.59 minutes512x V100DGX-2H0.6-29MixedImageNet2012V100-SXM3-32GB-H
CNN1.69 minutes1040x V100DGX-10.6-16MixedImageNet2012V100-SXM2-16GB
CNN1.33 minutes1536x V100DGX-2H0.6-30MixedImageNet2012V100-SXM3-32GB-H
PyTorchSSD-ResNet-34CNN22.36 minutes8x V100DGX-10.6-9MixedCOCO2017V100-SXM2-16GB
CNN12.21 minutes16x V100DGX-20.6-18MixedCOCO2017V100-SXM3-32GB
CNN11.41 minutes16x V100DGX-2H0.6-20MixedCOCO2017V100-SXM3-32GB-H
CNN4.78 minutes64x V100DGX-2H0.6-21MixedCOCO2017V100-SXM3-32GB-H
CNN2.67 minutes240x V100DGX-10.6-13MixedCOCO2017V100-SXM2-16GB
CNN2.56 minutes240x V100DGX-2H0.6-24MixedCOCO2017V100-SXM3-32GB-H
CNN2.23 minutes240x V100DGX-2H0.6-27MixedCOCO2017V100-SXM3-32GB-H
Mask R-CNNCNN207.48 minutes8x V100DGX-10.6-9MixedCOCO2017V100-SXM2-16GB
CNN101 minutes16x V100DGX-20.6-18MixedCOCO2017V100-SXM3-32GB
CNN95.2 minutes16x V100DGX-2H0.6-20MixedCOCO2017V100-SXM3-32GB-H
CNN32.72 minutes64x V100DGX-2H0.6-21MixedCOCO2017V100-SXM3-32GB-H
CNN22.03 minutes192x V100DGX-10.6-12MixedCOCO2017V100-SXM2-16GB
CNN18.47 minutes192x V100DGX-2H0.6-23MixedCOCO2017V100-SXM3-32GB-H
PyTorchGNMTRNN20.55 minutes8x V100DGX-10.6-9MixedWMT16 English-GermanV100-SXM2-16GB
RNN10.94 minutes16x V100DGX-20.6-18MixedWMT16 English-GermanV100-SXM3-32GB
RNN9.87 minutes16x V100DGX-2H0.6-20MixedWMT16 English-GermanV100-SXM3-32GB-H
RNN2.12 minutes256x V100DGX-2H0.6-25MixedWMT16 English-GermanV100-SXM3-32GB-H
RNN1.99 minutes384x V100DGX-10.6-14MixedWMT16 English-GermanV100-SXM2-16GB
RNN1.8 minutes384x V100DGX-2H0.6-26MixedWMT16 English-GermanV100-SXM3-32GB-H
PyTorchTransformerAttention20.34 minutes8x V100DGX-10.6-9MixedWMT17 English-GermanV100-SXM2-16GB
Attention11.04 minutes16x V100DGX-20.6-18MixedWMT17 English-GermanV100-SXM3-32GB
Attention9.8 minutes16x V100DGX-2H0.6-20MixedWMT17 English-GermanV100-SXM3-32GB-H
Attention2.41 minutes160x V100DGX-2H0.6-22MixedWMT17 English-GermanV100-SXM3-32GB-H
Attention2.05 minutes480x V100DGX-10.6-15MixedWMT17 English-GermanV100-SXM2-16GB
Attention1.59 minutes480x V100DGX-2H0.6-28MixedWMT17 English-GermanV100-SXM3-32GB-H
TensorFlowMiniGoReinforcement Learning27.39 minutes8x V100DGX-10.6-10MixedN/AV100-SXM2-16GB
Reinforcement Learning13.57 minutes24x V100DGX-10.6-11MixedN/AV100-SXM2-16GB

V100 Training Performance

FrameworkNetworkNetwork TypeThroughput GPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNetInception V3CNN552 images/sec1x V100DGX-119.09-py3Mixed208ImageNet2012V100-SXM2-16GB
CNN563 images/sec1x V100DGX-219.09-py3Mixed384ImageNet2012V100-SXM3-32GB
CNN4253 images/sec8x V100DGX-119.09-py3Mixed208ImageNet2012V100-SXM2-16GB
CNN4345 images/sec8x V100DGX-219.09-py3Mixed384ImageNet2012V100-SXM3-32GB
ResNet-50CNN1409 images/sec1x V100DGX-119.02-py3Mixed128ImageNet2012V100-SXM2-16GB
CNN1442 images/sec1x V100DGX-219.02-py3Mixed256ImageNet2012V100-SXM3-32GB
CNN10380 images/sec8x V100DGX-119.02-py3Mixed128ImageNet2012V100-SXM2-16GB
CNN10530 images/sec8x V100DGX-219.02-py3Mixed256ImageNet2012V100-SXM3-32GB
ResNet-50 v1.5CNN1461 images/sec1x V100DGX-119.10-py3Mixed208ImageNet2012V100-SXM2-16GB
CNN1660 images/sec1x V100DGX-2H19.10-py3Mixed256ImageNet2012V100-SXM3-32GB-H
CNN10824 images/sec8x V100DGX-119.09-py3Mixed208ImageNet2012V100-SXM2-16GB
CNN11479 images/sec8x V100DGX-219.10-py3Mixed256ImageNet2012V100-SXM3-32GB
CNN12241 images/sec8x V100DGX-2H19.10-py3Mixed256ImageNet2012V100-SXM3-32GB-H
PyTorchInception V3CNN571 images/sec1x V100DGX-119.09-py3Mixed256ImageNet2012V100-SXM2-16GB
CNN582 images/sec1x V100DGX-219.09-py3Mixed512ImageNet2012V100-SXM3-32GB
CNN4466 images/sec8x V100DGX-119.09-py3Mixed256ImageNet2012V100-SXM2-16GB
Mask R-CNNCNN15 images/sec1x V100DGX-119.10-py3Mixed16COCO2014V100-SXM2-32GB
CNN17 images/sec1x V100DGX-219.09-py3Mixed16COCO2014V100-SXM3-32GB
CNN95 images/sec8x V100DGX-119.10-py3Mixed16COCO2014V100-SXM2-32GB
ResNet-50CNN819 images/sec1x V100DGX-119.02_py3Mixed256ImageNet2012V100-SXM2-16GB
CNN820 images/sec1x V100DGX-219.02-py3Mixed256ImageNet2012V100-SXM3-32GB
CNN6218 images/sec8x V100DGX-119.02-py3Mixed256ImageNet2012V100-SXM2-16GB
ResNet-50 v1.5CNN931 images/sec1x V100DGX-119.10-py3Mixed256ImageNet2012V100-SXM2-16GB
CNN1046 images/sec1x V100DGX-2H19.10-py3Mixed512ImageNet2012V100-SXM3-32GB-H
CNN7308 images/sec8x V100DGX-119.10-py3Mixed256ImageNet2012V100-SXM2-16GB
SSD v1.1CNN237 images/sec1x V100DGX-119.09-py3Mixed64COCO 2017V100-SXM2-16GB
CNN276 images/sec1x V100DGX-219.06-py3Mixed64COCO 2017V100-SXM3-32GB
CNN2132 images/sec8x V100DGX-119.06-py3Mixed64COCO 2017V100-SXM2-16GB
Tacotron2CNN19988 total output mels/sec1x V100DGX-119.09-py3Mixed128LJ Speech 1.1V100-SXM2-32GB
CNN23105 total output mels/sec1x V100DGX-219.09-py3Mixed128LJ Speech 1.1V100-SXM3-32GB
CNN120654 total output mels/sec8x V100DGX-119.10-py3Mixed128LJ Speech 1.1V100-SXM2-32GB
CNN140622 total output mels/sec8x V100DGX-219.10-py3Mixed128LJ Speech 1.1V100-SXM3-32GB
WaveGlowCNN82692 output samples/sec1x V100DGX-119.08-py3Mixed10LJ Speech 1.1V100-SXM2-16GB
CNN90150 output samples/sec1x V100DGX-219.08-py3Mixed10LJ Speech 1.1V100-SXM3-32GB
CNN594479 output samples/sec8x V100DGX-119.08-py3Mixed10LJ Speech 1.1V100-SXM2-16GB
TensorFlowInception V3CNN544 images/sec1x V100DGX-119.09-py3Mixed384ImageNet2012V100-SXM2-32GB
CNN576 images/sec1x V100DGX-219.09-py3Mixed384ImageNet2012V100-SXM3-32GB
CNN4154 images/sec8x V100DGX-119.09-py3Mixed384ImageNet2012V100-SXM2-32GB
ResNet-50 V1.5CNN845 images/sec1x V100DGX-119.10-py3Mixed256ImageNet2012V100-SXM2-16GB
CNN967 images/sec1x V100DGX-2H19.10-py3Mixed512ImageNet2012V100-SXM3-32GB-H
CNN6484 images/sec8x V100DGX-119.09-py3Mixed256ImageNet2012V100-SXM2-16GB
SSD v1.1CNN117 images/sec1x V100DGX-119.10-py3Mixed32COCO 2017V100-SXM2-16GB
CNN127 images/sec1x V100DGX-219.10-py3Mixed32COCO 2017V100-SXM3-32GB
CNN672 images/sec8x V100DGX-119.06-py3Mixed32COCO 2017V100-SXM2-16GB
CNN768 images/sec8x V100DGX-219.06-py3Mixed32COCO 2017V100-SXM3-32GB
U-Net IndustrialCNN104 images/sec1x V100DGX-119.10-py3Mixed16DAGM2007V100-SXM2-16GB
CNN106 images/sec1x V100DGX-219.10-py3Mixed16DAGM2007V100-SXM3-32GB
CNN516 images/sec8x V100DGX-119.10-py3Mixed2DAGM2007V100-SXM2-16GB
CNN543 images/sec8x V100DGX-219.10-py3Mixed2DAGM2007V100-SXM3-32GB
PyTorchGNMT V2RNN76722 total tokens/sec1x V100DGX-119.08-py3Mixed512WMT16 English-GermanV100-SXM2-32GB
RNN83948 total tokens/sec1x V100DGX-219.08-py3Mixed512WMT16 English-GermanV100-SXM3-32GB
RNN585249 total tokens/sec8x V100DGX-119.08-py3Mixed512WMT16 English-GermanV100-SXM2-32GB
RNN606769 total tokens/sec8x V100DGX-219.08-py3Mixed512WMT16 English-GermanV100-SXM3-32GB
TensorFlowGNMT V2RNN26263 total tokens/sec1x V100DGX-119.09-py3Mixed192WMT16 English-GermanV100-SXM2-32GB
RNN29384 total tokens/sec1x V100DGX-219.09-py3Mixed192WMT16 English-GermanV100-SXM3-32GB
RNN174733 total tokens/sec8x V100DGX-119.09-py3Mixed192WMT16 English-GermanV100-SXM2-32GB
PyTorchNCFRecommender21922278 samples/sec1x V100DGX-119.08-py3Mixed1048576MovieLens 20 MillionV100-SXM2-16GB
Recommender22009822 samples/sec1x V100DGX-219.08-py3Mixed1048576MovieLens 20 MillionV100-SXM3-32GB
Recommender104122673 samples/sec8x V100DGX-119.07-py3Mixed1048576MovieLens 20 MillionV100-SXM2-16GB
Recommender109969915 samples/sec8x V100DGX-2H19.07-py3Mixed1048576MovieLens 20 MillionV100-SXM3-32GB-H
TensorFlowNCFRecommender26071208 samples/sec1x V100DGX-119.08-py3Mixed1048576MovieLens 20 MillionV100-SXM2-16GB
Recommender57083723 samples/sec8x V100DGX-119.10-py3Mixed1048576MovieLens 20 MillionV100-SXM2-16GB
PyTorchBERT-LARGEAttention50 sentences/sec1x V100DGX-119.09-py3Mixed10SQuaD v1.1V100-SXM2-32GB
Attention372 sentences/sec8xV100DGX-219.10-py3Mixed10SQuaD v1.1V100-SXM3-32GB
TensorFlowBERT-LARGEAttention34 sentences/sec1x V100DGX-119.10-py3Mixed10SQuaD v1.1V100-SXM2-32GB
Attention37 sentences/sec1x V100DGX-219.10-py3Mixed10SQuaD v1.1V100-SXM3-32GB
Attention182 sentences/sec8xV100DGX-119.10-py3Mixed10SQuaD v1.1V100-SXM2-32GB
Attention195 sentences/sec8x V100DGX-219.10-py3Mixed10SQuaD v1.1V100-SXM3-32GB

T4 Training Performance

FrameworkNetworkNetwork TypeThroughput GPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNetInception V3CNN180 images/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed208ImageNet2012Tesla T4
CNN1379 images/sec8x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed208ImageNet2012Tesla T4
ResNet-50 v1.5CNN480 images/sec1x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed208ImageNet2012Tesla T4
CNN4116 images/sec8x T4Supermicro SYS-4029GP-TRT T419.05-py3Mixed208ImageNet2012Tesla T4
PyTorchInception V3CNN185 images/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed256ImageNet2012Tesla T4
CNN1372 images/sec8x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed256ImageNet2012Tesla T4
Mask R-CNNCNN7 images/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed4COCO2014Tesla T4
CNN40 images/sec8x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed4COCO2014Tesla T4
ResNet-50 v1.5CNN288 images/sec1x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed256ImageNet2012Tesla T4
CNN2307 images/sec8x T4Supermicro SYS-4029GP-TRT T419.09-py3Mixed256ImageNet2012Tesla T4
SSD v1.1CNN85 images/sec1x T4Supermicro SYS-4029GP-TRT T419.06-py3Mixed64COCO 2017Tesla T4
CNN691 images/sec8x T4Supermicro SYS-4029GP-TRT T419.06-py3Mixed64COCO 2017Tesla T4
Tacotron2CNN14538 total output mels/sec1x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed128LJ Speech 1.1Tesla T4
CNN103679 total output mels/sec8x T4Supermicro SYS-4029GP-TRT T419.09-py3Mixed128LJ Speech 1.1Tesla T4
WaveGlowCNN33271 output samples/sec1x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed10LJ Speech 1.1Tesla T4
CNN247439 output samples/sec8x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed10LJ Speech 1.1Tesla T4
TensorFlowInception V3CNN177 images/sec1x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed192ImageNet2012Tesla T4
CNN1334 images/sec8x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed192ImageNet2012Tesla T4
ResNet-50 V1.5CNN265 images/sec1x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed256ImageNet2012Tesla T4
CNN2121 images/sec8x T4Supermicro SYS-4029GP-TRT T419.06-py3Mixed256ImageNet2012Tesla T4
SSD v1.1CNN51 images/sec1x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed32COCO 2017Tesla T4
CNN279 images/sec8x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed32COCO 2017Tesla T4
U-Net IndustrialCNN28 images/sec1x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed16DAGM2007Tesla T4
CNN191 images/sec8x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed2DAGM2007Tesla T4
PyTorchGNMT V2RNN26083 total tokens/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed256WMT16 English-GermanTesla T4
RNN181049 total tokens/sec8x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed256WMT16 English-GermanTesla T4
TensorFlowGNMT V2RNN11771 total tokens/sec1x T4Supermicro SYS-4029GP-TRT T419.09-py3Mixed192WMT16 English-GermanTesla T4
RNN58338 total tokens/sec8x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed128WMT16 English-GermanTesla T4
PyTorchNCFRecommender7502907 samples/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed1048576MovieLens 20 MillionTesla T4
Recommender26544719 samples/sec8x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed1048576MovieLens 20 MillionTesla T4
TensorFlowNCFRecommender10025545 samples/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed1048576MovieLens 20 MillionTesla T4
Recommender19050484 samples/sec8x T4Supermicro SYS-4029GP-TRT T419.06-py3Mixed1048576MovieLens 20 MillionTesla T4
TensorFlowBERTAttention9 sentences/sec1x T4Supermicro SYS-4029GP-TRT T419.09-py3Mixed3SQuaD v1.1Tesla T4
Attention39 sentences/sec8x T4Supermicro SYS-4029GP-TRT T419.10-py3Mixed3SQuaD v1.1Tesla T4

 

NVIDIA® TensorRT™ running on NVIDIA GPUs enable the most efficient deep learning inference performance across multiple application areas and models. This versatility provides wide latitude to data scientists to create the optimal low-latency solution. Visit NVIDIA GPU Cloud (NGC) to download any of these containers.

NVIDIA Tesla® V100 Tensor Cores GPUs leverage mixed-precision to combine high throughput with low latencies across every type of neural network. Tesla P4 is an inference GPU, designed for optimal power consumption and latency, for ultra-efficient scale-out servers. Read the inference whitepaper to learn more about NVIDIA’s inference platform.

Measuring the inference performance involves balancing a lot of variables. PLASTER is an acronym that describes the key elements for measuring deep learning performance. Each letter identifies a factor (Programmability, Latency, Accuracy, Size of Model, Throughput, Energy Efficiency, Rate of Learning) that must be considered to arrive at the right set of tradeoffs and to produce a successful deep learning implementation. Refer to NVIDIA’s PLASTER whitepaper for more details.

Inference Image Classification on CNNs with TensorRT

ResNet-50 v1.5 Throughput

DGX-1: 1x Tesla V100-SXM2-16GB, E5-2698 v4 2.2 GHz | TensorRT 6.0 | Batch Size = 128 | 19.09-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x Tesla T4, Gold 6240 2.6 GHz | TensorRT 6.0 | Batch Size = 128 | 19.09-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 v1.5 Latency

DGX-1: 1x Tesla V100-SXM2-16GB, E5-2698 v4 2.2 GHz | TensorRT 6.0 | Batch Size = 1 | 19.09-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x Tesla T4, Gold 6240 2.6 GHz | TensorRT 6.0 | Batch Size = 1 | 19.09-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 v1.5 Power Efficiency

DGX-1: 1x Tesla V100-SXM2-16GB, E5-2698 v4 2.2 GHz | TensorRT 6.0 | Batch Size = 128 | 19.09-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x Tesla T4, Gold 6240 2.6 GHz | TensorRT 6.0 | Batch Size = 128 | 19.09-py3 | Precision: INT8 | Dataset: Synthetic

 

Inference Performance

V100 Inference Performance

NetworkNetwork
Type
Batch
Size
Throughput Efficiency LatencyGPUServerContainerPrecisionDatasetGPU
Version
GoogleNetCNN11610 images/sec15 images/sec/watt0.621x V100DGX-119.08-py3INT8SyntheticV100-SXM2-16GB
CNN22162 images/sec18 images/sec/watt0.931x V100DGX-119.07-py3INT8SyntheticV100-SXM2-16GB
CNN85368 images/sec35 images/sec/watt1.51x V100DGX-219.09-py3INT8SyntheticV100-SXM3-32GB
CNN8211869 images/sec45 images/sec/watt6.91x V100DGX-119.07-py3INT8SyntheticV100-SXM2-16GB
CNN12812697 images/sec47 images/sec/watt101x V100DGX-219.09-py3INT8SyntheticV100-SXM3-32GB
MobileNet V1CNN14543 images/sec30 images/sec/watt0.221x V100DGX-219.09-py3INT8SyntheticV100-SXM3-32GB
CNN26426 images/sec47 images/sec/watt0.311x V100DGX-119.09-py3INT8SyntheticV100-SXM2-16GB
CNN814788 images/sec96 images/sec/watt0.541x V100DGX-119.07-py3INT8SyntheticV100-SXM2-32GB
CNN12829914 images/sec104 images/sec/watt4.31x V100DGX-119.06-py3INT8SyntheticV100-SXM2-16GB
ResNet-50CNN11156 images/sec8.7 images/sec/watt0.871x V100DGX-119.07-py3INT8SyntheticV100-SXM2-16GB
CNN21612 images/sec10 images/sec/watt1.21x V100DGX-219.08-py3INT8SyntheticV100-SXM3-32GB
CNN83315 images/sec21 images/sec/watt2.41x V100DGX-119.07-py3MixedSyntheticV100-SXM2-16GB
CNN1287720 images/sec27 images/sec/watt171x V100DGX-119.06-py3MixedSyntheticV100-SXM2-16GB
CNN1287907 images/sec23 images/sec/watt161x V100DGX-219.09-py3MixedSyntheticV100-SXM3-32GB
ResNet-50v1.5CNN1955 images/sec7.6 images/sec/watt1.11x V100DGX-119.09-py3INT8SyntheticV100-SXM2-16GB
CNN21407 images/sec9.8 images/sec/watt1.41x V100DGX-119.06-py3INT8SyntheticV100-SXM2-16GB
CNN83226 images/sec20 images/sec/watt2.51x V100DGX-119.08-py3MixedSyntheticV100-SXM2-16GB
CNN1287226 images/sec25 images/sec/watt181x V100DGX-119.09-py3MixedSyntheticV100-SXM2-16GB
CNN1287517 images/sec22 images/sec/watt171x V100DGX-219.09-py3MixedSyntheticV100-SXM3-32GB
VGG16CNN1821 images/sec4 images/sec/watt1.21x V100DGX-119.07-py3INT8SyntheticV100-SXM2-16GB
CNN21145 images/sec5.5 images/sec/watt1.81x V100DGX-119.06-py3MixedSyntheticV100-SXM2-16GB
CNN82067 images/sec8.2 images/sec/watt3.91x V100DGX-119.06-py3MixedSyntheticV100-SXM2-16GB
CNN1282845 images/sec9.7 images/sec/watt451x V100DGX-119.07-py3MixedSyntheticV100-SXM2-16GB
NMTRNN14013 total tokens/sec tokens/sec/watt131x V100DGX-1-Mixedwmt16-English-GermanV100-SXM2-32GB
RNN26290 total tokens/sec tokens/sec/watt161x V100DGX-1-Mixedwmt16-English-GermanV100-SXM2-32GB
RNN6456531 total tokens/sec tokens/sec/watt581x V100DGX-1-Mixedwmt16-English-GermanV100-SXM2-32GB
RNN12873375 total tokens/sec tokens/sec/watt891x V100DGX-1-Mixedwmt16-English-GermanV100-SXM2-32GB
NCFRecommender104857661130538 samples/sec samples/sec/watt1x V100DGX-119.08-py3MixedMovieLens 20 MillionV100-SXM2-16GB
BERT-BASEAttention1557 sentences/sec10.3 sentences/sec/watt1.81x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
Attention2978 sentences/sec18.8 sentences/sec/watt21x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
Attention81847 sentences/sec34.1 sentences/sec/watt4.31x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
Attention242419 sentences/sec43.7 sentences/sec/watt9.91x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
Attention1282645 sentences/sec46 sentences/sec/watt48.41x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
BERT-LARGEAttention1239 sentences/sec4.3 sentences/sec/watt4.21x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
Attention2407 sentences/sec7.5 sentences/sec/watt4.91x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
Attention4562 sentences/sec10.6 sentences/sec/watt7.11x V101Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
Attention8636 sentences/sec11.8 sentences/sec/watt12.61x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
Attention128823 sentences/sec13.6 sentences/sec/watt155.51x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB

TensorRT 6.0 and sequence length=128 for BERT-BASE and BERT-LARGE | PyTorch for NCF | TensorRT 5.1 and TensorRT 6.0 for all other models | Efficiency based on board power

 

T4 Inference Performance

NetworkNetwork
Type
Batch
Size
Throughput Efficiency LatencyGPUServerContainerPrecisionDatasetGPU
Version
GoogleNetCNN11745 images/sec28 images/sec/watt0.581x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
CNN22409 images/sec38 images/sec/watt0.831x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
CNN86282 images/sec91 images/sec/watt1.31x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
CNN527580 images/sec109 images/sec/watt6.91x T4Supermicro SYS-4029GP-TRT T419.07-py3INT8SyntheticTesla T4
CNN1288968 images/sec128 images/sec/watt141x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
MobileNet V1CNN14442 images/sec80 images/sec/watt0.231x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
CNN28057 images/sec129 images/sec/watt0.251x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
CNN814660 images/sec210 images/sec/watt0.551x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
CNN12817768 images/sec254 images/sec/watt7.21x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
ResNet-50CNN11162 images/sec17 images/sec/watt0.861x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
CNN21804 images/sec26 images/sec/watt1.11x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
CNN84060 images/sec58 images/sec/watt21x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
CNN1285681 images/sec81 images/sec/watt231x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
ResNet-50v1.5CNN11109 images/sec16 images/sec/watt0.91x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
CNN21730 images/sec26 images/sec/watt1.21x T4Supermicro SYS-4029GP-TRT T419.08-py3INT8SyntheticTesla T4
CNN83922 images/sec56 images/sec/watt21x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
CNN1285344 images/sec77 images/sec/watt241x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
VGG16CNN1792 images/sec11 images/sec/watt1.31x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
CNN21083 images/sec16 images/sec/watt1.91x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
CNN81670 images/sec24 images/sec/watt4.81x T4Supermicro SYS-4029GP-TRT T419.07-py3INT8SyntheticTesla T4
CNN1281956 images/sec28 images/sec/watt651x T4Supermicro SYS-4029GP-TRT T419.07-py3INT8SyntheticTesla T4
NCFRecommender112394 samples/sec442 samples/sec/watt0.081x T4Supermicro SYS-4029GP-TRT T419.09-py3MixedSyntheticTesla T4
Recommender64697677 samples/sec23933 samples/sec/watt0.091x T4Supermicro SYS-4029GP-TRT T419.09-py3MixedSyntheticTesla T4
Recommender2500051102744 samples/sec731187 samples/sec/watt0.491x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
Recommender10000055301530 samples/sec795155 samples/sec/watt1.81x T4Supermicro SYS-4029GP-TRT T419.09-py3INT8SyntheticTesla T4
BERT-BASEAttention1484 sentences/sec11 sentences/sec/watt2.071x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1Tesla T4
Attention2754 sentences/sec17 sentences/sec/watt2.651x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1Tesla T4
Attention8827 sentences/sec20 sentences/sec/watt9.671x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1Tesla T4
Attention128800 sentences/sec16 sentences/sec/watt160.021x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1Tesla T4
BERT-LARGEAttention1171 sentences/sec4 sentences/sec/watt5.841x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1Tesla T4
Attention2168 sentences/sec4 sentences/sec/watt11.881x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1Tesla T4
Attention8244 sentences/sec6 sentences/sec/watt32.741x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1Tesla T4
Attention128254 sentences/sec5 sentences/sec/watt504.331x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1Tesla T4

TensorRT 6.0 and sequence length=128 for BERT-BASE and BERT-LARGE | TensorRT 5.1 and TensorRT 6.0 for all other models | Efficiency based on board power | NCF uses facebook dataset

 

Last updated: Nov 6th, 2019