Review the latest GPU acceleration factors of popular HPC applications.


NVIDIA’s complete solution stack, from GPUs to libraries, and containers on NVIDIA GPU Cloud (NGC), allows data scientists to quickly get up and running with deep learning. NVIDIA® Tesla® V100 Tensor Core GPUs leverage mixed precision to accelerate deep learning training throughputs across every framework and every type of neural network. NVIDIA breaks performance records on MLPerf, the AI’s first industry-wide benchmark, a testament to our GPU-accelerated platform approach.

NVIDIA Performance on MLPerf 0.6 AI Benchmarks

ResNet-50 v1.5 Time to Solution on V100

MXNet | Batch Size refer to CNN V100 Training table below | Precision: Mixed | Dataset: ImageNet2012 | Convergence criteria - refer to MLPerf requirements

Training Image Classification on CNNs

ResNet-50 V1.5 Throughput on V100

DGX-1: 8x Tesla V100-SXM2-32GB for MXNet, PyTorch and TensorFlow use V100-SXM2-16GB, E5-2698 v4 2.2 GHz | Batch Size = 256 | MXNet = 19.06-py3, PyTorch and Tensorflow = 19.07_py3 | Precision: Mixed | Dataset: ImageNet2012

ResNet-50 V1.5 Throughput on T4

Supermicro SYS-4029GP-TRT T4: 8x Tesla T4, Gold 6140 2.3 GHz | Batch Size = 208 for MXNet, PyTorch and TensorFlow = 256 | MXNet = 19.05-py3, PyTorch = 19.07-py3, TensorFlow = 19.06-py3 | Precision: Mixed | Dataset: ImageNet2012

Training Performance

NVIDIA Performance on MLPerf 0.6 AI Benchmarks

FrameworkNetworkNetwork TypeTime to Solution GPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetResNet-50 v1.5CNN115.22 minutes8x V100DGX-10.6-8MixedImageNet2012V100-SXM2-16GB
CNN57.87 minutes16x V100DGX-20.6-17MixedImageNet2012V100-SXM3-32GB
CNN52.74 minutes16x V100DGX-2H0.6-19MixedImageNet2012V100-SXM3-32GB-H
CNN2.59 minutes512x V100DGX-2H0.6-29MixedImageNet2012V100-SXM3-32GB-H
CNN1.69 minutes1040x V100DGX-10.6-16MixedImageNet2012V100-SXM2-16GB
CNN1.33 minutes1536x V100DGX-2H0.6-30MixedImageNet2012V100-SXM3-32GB-H
PyTorchSSD-ResNet-34CNN22.36 minutes8x V100DGX-10.6-9MixedCOCO2017V100-SXM2-16GB
CNN12.21 minutes16x V100DGX-20.6-18MixedCOCO2017V100-SXM3-32GB
CNN11.41 minutes16x V100DGX-2H0.6-20MixedCOCO2017V100-SXM3-32GB-H
CNN4.78 minutes64x V100DGX-2H0.6-21MixedCOCO2017V100-SXM3-32GB-H
CNN2.67 minutes240x V100DGX-10.6-13MixedCOCO2017V100-SXM2-16GB
CNN2.56 minutes240x V100DGX-2H0.6-24MixedCOCO2017V100-SXM3-32GB-H
CNN2.23 minutes240x V100DGX-2H0.6-27MixedCOCO2017V100-SXM3-32GB-H
Mask R-CNNCNN207.48 minutes8x V100DGX-10.6-9MixedCOCO2017V100-SXM2-16GB
CNN101 minutes16x V100DGX-20.6-18MixedCOCO2017V100-SXM3-32GB
CNN95.2 minutes16x V100DGX-2H0.6-20MixedCOCO2017V100-SXM3-32GB-H
CNN32.72 minutes64x V100DGX-2H0.6-21MixedCOCO2017V100-SXM3-32GB-H
CNN22.03 minutes192x V100DGX-10.6-12MixedCOCO2017V100-SXM2-16GB
CNN18.47 minutes192x V100DGX-2H0.6-23MixedCOCO2017V100-SXM3-32GB-H
PyTorchGNMTRNN20.55 minutes8x V100DGX-10.6-9MixedWMT16 English-GermanV100-SXM2-16GB
RNN10.94 minutes16x V100DGX-20.6-18MixedWMT16 English-GermanV100-SXM3-32GB
RNN9.87 minutes16x V100DGX-2H0.6-20MixedWMT16 English-GermanV100-SXM3-32GB-H
RNN2.12 minutes256x V100DGX-2H0.6-25MixedWMT16 English-GermanV100-SXM3-32GB-H
RNN1.99 minutes384x V100DGX-10.6-14MixedWMT16 English-GermanV100-SXM2-16GB
RNN1.8 minutes384x V100DGX-2H0.6-26MixedWMT16 English-GermanV100-SXM3-32GB-H
PyTorchTransformerAttention20.34 minutes8x V100DGX-10.6-9MixedWMT17 English-GermanV100-SXM2-16GB
Attention11.04 minutes16x V100DGX-20.6-18MixedWMT17 English-GermanV100-SXM3-32GB
Attention9.8 minutes16x V100DGX-2H0.6-20MixedWMT17 English-GermanV100-SXM3-32GB-H
Attention2.41 minutes160x V100DGX-2H0.6-22MixedWMT17 English-GermanV100-SXM3-32GB-H
Attention2.05 minutes480x V100DGX-10.6-15MixedWMT17 English-GermanV100-SXM2-16GB
Attention1.59 minutes480x V100DGX-2H0.6-28MixedWMT17 English-GermanV100-SXM3-32GB-H
TensorFlowMiniGoReinforcement Learning27.39 minutes8x V100DGX-10.6-10MixedN/AV100-SXM2-16GB
Reinforcement Learning13.57 minutes24x V100DGX-10.6-11MixedN/AV100-SXM2-16GB

V100 Training Performance

FrameworkNetworkNetwork TypeThroughput GPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNetInception V3CNN547 images/sec1x V100DGX-119.08-py3Mixed128ImageNet2012V100-SXM2-16GB
CNN607 images/sec1x V100DGX-2H19.08-py3Mixed256ImageNet2012V100-SXM3-32GB-H
CNN4216 images/sec8x V100DGX-119.08-py3Mixed128ImageNet2012V100-SXM2-16GB
CNN4652 images/sec8x V100DGX-2H19.08-py3Mixed256ImageNet2012V100-SXM3-32GB-H
ResNet-50CNN1409 images/sec1x V100DGX-119.02-py3Mixed128ImageNet2012V100-SXM2-16GB
CNN1442 images/sec1x V100DGX-219.02-py3Mixed256ImageNet2012V100-SXM3-32GB
CNN10380 images/sec8x V100DGX-119.02-py3Mixed128ImageNet2012V100-SXM2-16GB
CNN10530 images/sec8x V100DGX-219.02-py3Mixed256ImageNet2012V100-SXM3-32GB
ResNet-50 v1.5CNN1437 images/sec1x V100DGX-119.08-py3Mixed208ImageNet2012V100-SXM2-16GB
CNN1604 images/sec1x V100DGX-2H19.08-py3Mixed256ImageNet2012V100-SXM3-32GB-H
CNN9566 images/sec8x V100DGX-119.06-py3Mixed256ImageNet2012V100-SXM2-32GB
CNN11056 images/sec8x V100DGX-219.05-py3Mixed128ImageNet2012V100-SXM3-32GB
CNN11507 images/sec8x V100DGX-2H19.05-py3Mixed256ImageNet2012V100-SXM3-32GB-H
PyTorchInception V3CNN569 images/sec1x V100DGX-119.08-py3Mixed128ImageNet2012V100-SXM2-32GB
CNN632 images/sec1x V100DGX-2H19.08-py3Mixed256ImageNet2012V100-SXM3-32GB-H
CNN4269 images/sec8x V100DGX-119.08-py3Mixed128ImageNet2012V100-SXM2-16GB
Mask R-CNNCNN14 images/sec1x V100DGX-119.07-py3Mixed4COCO2014V100-SXM2-32GB
CNN17 images/sec1x V100DGX-2H19.05-py3Mixed16COCO2014V100-SXM3-32GB-H
CNN93 images/sec8x V100DGX-119.08-py3Mixed16COCO2014V100-SXM2-32GB
ResNet-50CNN819 images/sec1x V100DGX-119.02_py3Mixed256ImageNet2012V100-SXM2-16GB
CNN820 images/sec1x V100DGX-219.02-py3Mixed256ImageNet2012V100-SXM3-32GB
CNN6218 images/sec8x V100DGX-119.02-py3Mixed256ImageNet2012V100-SXM2-16GB
ResNet-50 v1.5CNN928 images/sec1x V100DGX-119.07-py3Mixed256ImageNet2012V100-SXM2-16GB
CNN1036 images/sec1x V100DGX-2H19.07-py3Mixed256ImageNet2012V100-SXM3-32GB-H
CNN7288 images/sec8x V100DGX-119.07-py3Mixed256ImageNet2012V100-SXM2-16GB
SSD v1.1CNN237 images/sec1x V100DGX-119.08-py3Mixed64COCO 2017V100-SXM2-16GB
CNN299 images/sec1x V100DGX-2H19.05-py3Mixed64COCO 2017V100-SXM3-32GB-H
CNN2132 images/sec8x V100DGX-119.06-py3Mixed64COCO 2017V100-SXM2-16GB
Tacotron2CNN19931 total input tokens/sec1x V100DGX-119.08-py3Mixed128LJ Speech 1.1V100-SXM2-32GB
CNN24590 total input tokens/sec1x V100DGX-2H19.08-py3Mixed128LJ Speech 1.1V100-SXM3-32GB-H
CNN110370 total input tokens/sec8x V100DGX-119.08-py3Mixed128LJ Speech 1.1V100-SXM2-32GB
CNN144330 total input tokens/sec8x V100DGX-2H19.08-py3Mixed128LJ Speech 1.1V100-SXM3-32GB-H
WaveGlowCNN82692 output samples/sec1x V100DGX-119.08-py3Mixed10LJ Speech 1.1V100-SXM2-16GB
CNN96370 output samples/sec1x V100DGX-2H19.08-py3Mixed10LJ Speech 1.1V100-SXM3-32GB-H
CNN594479 output samples/sec8x V100DGX-119.08-py3Mixed10LJ Speech 1.1V100-SXM2-16GB
TensorFlowInception V3CNN541 images/sec1x V100DGX-119.08-py3Mixed256ImageNet2012V100-SXM2-32GB
CNN627 images/sec1x V100DGX-2H19.08-py3Mixed256ImageNet2012V100-SXM3-32GB-H
CNN4041 images/sec8x V100DGX-119.08-py3Mixed256ImageNet2012V100-SXM2-32GB
ResNet-50 V1.5CNN841 images/sec1x V100DGX-119.08-py3Mixed256ImageNet2012V100-SXM2-16GB
CNN964 images/sec1x V100DGX-2H19.08-py3Mixed256ImageNet2012V100-SXM3-32GB-H
CNN6474 images/sec8x V100DGX-119.07-py3Mixed256ImageNet2012V100-SXM2-16GB
SSD v1.1CNN115 images/sec1x V100DGX-119.08-py3Mixed32COCO 2017V100-SXM2-32GB
CNN124 images/sec1x V100DGX-219.08-py3Mixed32COCO 2017V100-SXM3-32GB
CNN672 images/sec8x V100DGX-119.06-py3Mixed32COCO 2017V100-SXM2-16GB
CNN770 images/sec8x V100DGX-219.05-py3Mixed32COCO 2017V100-SXM3-32GB
U-Net IndustrialCNN104 images/sec1x V100DGX-119.06-py3Mixed16DAGM2007V100-SXM2-16GB
CNN116 images/sec1x V100DGX-2H19.06-py3Mixed16DAGM2007V100-SXM3-32GB-H
CNN503 images/sec8x V100DGX-119.07-py3Mixed2DAGM2007V100-SXM2-16GB
CNN544 images/sec8x V100DGX-2H19.08-py3Mixed2DAGM2007V100-SXM3-32GB-H
PyTorchGNMT V2RNN76722 total tokens/sec1x V100DGX-119.08-py3Mixed128WMT16 English-GermanV100-SXM2-32GB
RNN92364 total tokens/sec1x V100DGX-2H19.08-py3Mixed128WMT16 English-GermanV100-SXM3-32GB-H
RNN585249 total tokens/sec8x V100DGX-119.08-py3Mixed128WMT16 English-GermanV100-SXM2-32GB
RNN663657 total tokens/sec8x V100DGX-2H19.08-py3Mixed128WMT16 English-GermanV100-SXM3-32GB-H
TensorFlowGNMT V2RNN22471 total tokens/sec1x V100DGX-119.07-py3Mixed192WMT16 English-GermanV100-SXM2-16GB
RNN26039 total tokens/sec1x V100DGX-2H19.07-py3Mixed192WMT16 English-GermanV100-SXM3-32GB-H
RNN149008 total tokens/sec8x V100DGX-119.07-py3Mixed192WMT16 English-GermanV100-SXM2-16GB
PyTorchNCFRecommender22093850 samples/sec1x V100DGX-119.07-py3Mixed1048576MovieLens 20 MillionV100-SXM2-16GB
Recommender24473776 samples/sec1x V100DGX-2H19.07-py3Mixed1048576MovieLens 20 MillionV100-SXM3-32GB-H
Recommender104122673 samples/sec8x V100DGX-119.07-py3Mixed1048576MovieLens 20 MillionV100-SXM2-16GB
Recommender109969915 samples/sec8x V100DGX-2H19.07-py3Mixed1048576MovieLens 20 MillionV100-SXM3-32GB-H
TensorFlowNCFRecommender26415693 samples/sec1x V100DGX-119.07-py3Mixed1048576MovieLens 20 MillionV100-SXM2-16GB
Recommender56991307 samples/sec8x V100DGX-119.08-py3Mixed1048576MovieLens 20 MillionV100-SXM2-32GB
PyTorchBERT-LARGEAttention50 sentences/sec1x V100DGX-119.09-py3Mixed10SQuaD v1.1V100-SXM2-32GB
Attention353 sentences/sec8xV100DGX-219.09-py3Mixed10SQuaD v1.1V100-SXM2-32GB
TensorFlowBERT-LARGEAttention32 sentences/sec1x V100DGX-119.07-py3Mixed10SQuaD v1.1V100-SXM2-32GB
Attention37 sentences/sec1x V100DGX-2H19.08-py3Mixed10SQuaD v1.1V100-SXM3-32GB-H
Attention158 sentences/sec8xV100DGX-119.08-py3Mixed10SQuaD v1.1V100-SXM2-32GB
Attention189 sentences/sec8x V100DGX-2H19.07-py3Mixed10SQuaD v1.1V100-SXM3-32GB-H

T4 Training Performance

FrameworkNetworkNetwork TypeThroughput GPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNetInception V3CNN180 images/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed128ImageNet2012Tesla T4
CNN1391 images/sec8x T4Supermicro SYS-4029GP-TRT T419.06-py3Mixed128ImageNet2012Tesla T4
ResNet-50 v1.5CNN446 images/sec1x T4Supermicro SYS-4029GP-TRT T419.07-py3Mixed208ImageNet2012Tesla T4
CNN4116 images/sec8x T4Supermicro SYS-4029GP-TRT T419.05-py3Mixed208ImageNet2012Tesla T4
PyTorchInception V3CNN185 images/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed128ImageNet2012Tesla T4
CNN1386 images/sec8x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed128ImageNet2012Tesla T4
Mask R-CNNCNN7 images/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed4COCO2014Tesla T4
CNN40 images/sec8x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed4COCO2014Tesla T4
ResNet-50 v1.5CNN287 images/sec1x T4Supermicro SYS-4029GP-TRT T419.09-py3Mixed256ImageNet2012Tesla T4
CNN2307 images/sec8x T4Supermicro SYS-4029GP-TRT T419.09-py3Mixed256ImageNet2012Tesla T4
SSD v1.1CNN85 images/sec1x T4Supermicro SYS-4029GP-TRT T419.06-py3Mixed64COCO 2017Tesla T4
CNN691 images/sec8x T4Supermicro SYS-4029GP-TRT T419.06-py3Mixed64COCO 2017Tesla T4
Tacotron2CNN14864 total input tokens/sec1x T4Supermicro SYS-4029GP-TRT T419.09-py3Mixed128LJ Speech 1.1Tesla T4
CNN103679 total input tokens/sec8x T4Supermicro SYS-4029GP-TRT T419.09-py3Mixed128LJ Speech 1.1Tesla T4
WaveGlowCNN34577 output samples/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed10LJ Speech 1.1Tesla T4
CNN244716 output samples/sec8x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed10LJ Speech 1.1Tesla T4
TensorFlowInception V3CNN182 images/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed128ImageNet2012Tesla T4
CNN1336 images/sec8x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed128ImageNet2012Tesla T4
ResNet-50 V1.5CNN274 images/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed256ImageNet2012Tesla T4
CNN2121 images/sec8x T4Supermicro SYS-4029GP-TRT T419.06-py3Mixed256ImageNet2012Tesla T4
SSD v1.1CNN52 images/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed32COCO 2017Tesla T4
CNN280 images/sec8x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed32COCO 2017Tesla T4
U-Net IndustrialCNN29 images/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed16DAGM2007Tesla T4
CNN191 images/sec8x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed2DAGM2007Tesla T4
PyTorchGNMT V2RNN26083 total tokens/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed128WMT16 English-GermanTesla T4
RNN183225 total tokens/sec8x T4Supermicro SYS-4029GP-TRT T419.06-py3Mixed128WMT16 English-GermanTesla T4
TensorFlowGNMT V2RNN9862 total tokens/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed128WMT16 English-GermanTesla T4
RNN58118 total tokens/sec8x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed128WMT16 English-GermanTesla T4
PyTorchNCFRecommender7584587 samples/sec1x T4Supermicro SYS-4029GP-TRT T419.07-py3Mixed1048576MovieLens 20 MillionTesla T4
Recommender27011297 samples/sec8x T4Supermicro SYS-4029GP-TRT T419.07-py3Mixed1048576MovieLens 20 MillionTesla T4
TensorFlowNCFRecommender10297010 samples/sec1x T4Supermicro SYS-4029GP-TRT T419.07-py3Mixed1048576MovieLens 20 MillionTesla T4
Recommender19050484 samples/sec8x T4Supermicro SYS-4029GP-TRT T419.06-py3Mixed1048576MovieLens 20 MillionTesla T4
TensorFlowBERTAttention9 sentences/sec1x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed3SQuaD v1.1Tesla T4
Attention32 sentences/sec8x T4Supermicro SYS-4029GP-TRT T419.08-py3Mixed3SQuaD v1.1Tesla T4

 

NVIDIA® TensorRT™ running on NVIDIA GPUs enable the most efficient deep learning inference performance across multiple application areas and models. This versatility provides wide latitude to data scientists to create the optimal low-latency solution. Visit NVIDIA GPU Cloud (NGC) to download any of these containers.

NVIDIA Tesla® V100 Tensor Cores GPUs leverage mixed-precision to combine high throughput with low latencies across every type of neural network. Tesla P4 is an inference GPU, designed for optimal power consumption and latency, for ultra-efficient scale-out servers. Read the inference whitepaper to learn more about NVIDIA’s inference platform.

Measuring the inference performance involves balancing a lot of variables. PLASTER is an acronym that describes the key elements for measuring deep learning performance. Each letter identifies a factor (Programmability, Latency, Accuracy, Size of Model, Throughput, Energy Efficiency, Rate of Learning) that must be considered to arrive at the right set of tradeoffs and to produce a successful deep learning implementation. Refer to NVIDIA’s PLASTER whitepaper for more details.

Inference Image Classification on CNNs with TensorRT

ResNet-50 v1.5 Throughput

DGX-1: 1x Tesla V100-SXM2-16GB, E5-2698 v4 2.2 GHz | TensorRT 5.1 | Batch Size = 128 | 19.07-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x Tesla T4, Gold 6140 2.3 GHz | TensorRT 5.1 | Batch Size = 128 | 19.07-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 v1.5 Latency

DGX-1: 1x Tesla V100-SXM2-16GB, E5-2698 v4 2.2 GHz | TensorRT 5.1 | Batch Size = 1 | 19.07-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x Tesla T4, Gold 6140 2.3 GHz | TensorRT 5.1 | Batch Size = 1 | 19.07-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 v1.5 Power Efficiency

DGX-1: 1x Tesla V100-SXM2-16GB, E5-2698 v4 2.2 GHz | TensorRT 5.1 | Batch Size = 128 | 19.07-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT T4: 1x Tesla T4, Gold 6140 2.3 GHz | TensorRT 5.1 | Batch Size = 128 | 19.07-py3 | Precision: INT8 | Dataset: Synthetic

 

Inference Performance

V100 Inference Performance

NetworkNetwork
Type
Batch
Size
Throughput Efficiency LatencyGPUServerContainerPrecisionDatasetGPU
Version
GoogleNetCNN11610 images/sec15 images/sec/watt0.621x V100DGX-119.08-py3INT8SyntheticV100-SXM2-16GB
CNN22162 images/sec18 images/sec/watt0.931x V100DGX-119.07-py3INT8SyntheticV100-SXM2-16GB
CNN85228 images/sec35 images/sec/watt1.51x V100DGX-119.06-py3INT8SyntheticV100-SXM2-16GB
CNN8211869 images/sec45 images/sec/watt6.91x V100DGX-119.07-py3INT8SyntheticV100-SXM2-16GB
CNN12812400 images/sec44 images/sec/watt101x V100DGX-119.07-py3INT8SyntheticV100-SXM2-16GB
MobileNet V1CNN13814 images/sec29 images/sec/watt0.261x V100DGX-119.07-py3INT8SyntheticV100-SXM2-32GB
CNN25594 images/sec45 images/sec/watt0.361x V100DGX-119.07-py3INT8SyntheticV100-SXM2-16GB
CNN814788 images/sec96 images/sec/watt0.541x V100DGX-119.07-py3INT8SyntheticV100-SXM2-32GB
CNN12829914 images/sec104 images/sec/watt4.31x V100DGX-119.06-py3INT8SyntheticV100-SXM2-16GB
ResNet-50CNN11156 images/sec8.7 images/sec/watt0.871x V100DGX-119.07-py3INT8SyntheticV100-SXM2-16GB
CNN21580 images/sec10 images/sec/watt1.31x V100DGX-119.07-py3INT8SyntheticV100-SXM2-16GB
CNN83315 images/sec21 images/sec/watt2.41x V100DGX-119.07-py3MixedSyntheticV100-SXM2-16GB
CNN1287720 images/sec27 images/sec/watt171x V100DGX-119.06-py3MixedSyntheticV100-SXM2-16GB
CNN1287830 images/sec23 images/sec/watt161x V100DGX-219.06-py3MixedSyntheticV100-SXM3-32GB
ResNet-50v1.5CNN1949 images/sec7.1 images/sec/watt1.11x V100DGX-119.08-py3INT8SyntheticV100-SXM2-16GB
CNN21407 images/sec9.8 images/sec/watt1.41x V100DGX-119.06-py3INT8SyntheticV100-SXM2-16GB
CNN83226 images/sec20 images/sec/watt2.51x V100DGX-119.08-py3MixedSyntheticV100-SXM2-16GB
CNN1287223 images/sec25 images/sec/watt181x V100DGX-119.07-py3MixedSyntheticV100-SXM2-16GB
CNN1287454 images/sec22 images/sec/watt171x V100DGX-219.08-py3MixedSyntheticV100-SXM3-32GB
VGG16CNN1821 images/sec4 images/sec/watt1.21x V100DGX-119.07-py3INT8SyntheticV100-SXM2-16GB
CNN21145 images/sec5.5 images/sec/watt1.81x V100DGX-119.06-py3MixedSyntheticV100-SXM2-16GB
CNN82067 images/sec8.2 images/sec/watt3.91x V100DGX-119.06-py3MixedSyntheticV100-SXM2-16GB
CNN1282845 images/sec9.7 images/sec/watt451x V100DGX-119.07-py3MixedSyntheticV100-SXM2-16GB
NMTRNN14013 total tokens/sec tokens/sec/watt131x V100DGX-1-Mixedwmt16-English-GermanV100-SXM2-32GB
RNN26290 total tokens/sec tokens/sec/watt161x V100DGX-1-Mixedwmt16-English-GermanV100-SXM2-32GB
RNN6456531 total tokens/sec tokens/sec/watt581x V100DGX-1-Mixedwmt16-English-GermanV100-SXM2-32GB
RNN12873375 total tokens/sec tokens/sec/watt891x V100DGX-1-Mixedwmt16-English-GermanV100-SXM2-32GB
NCFRecommender104857661130538 samples/sec samples/sec/watt1x V100DGX-119.08-py3MixedMovieLens 20 MillionV100-SXM2-16GB
BERT-BASEAttention1557 sentences/sec10.3 sentences/sec/watt1.81x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
Attention2978 sentences/sec18.8 sentences/sec/watt21x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
Attention81847 sentences/sec34.1 sentences/sec/watt4.31x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
Attention242419 sentences/sec43.7 sentences/sec/watt9.91x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
Attention1282645 sentences/sec46 sentences/sec/watt48.41x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
BERT-LARGEAttention1239 sentences/sec4.3 sentences/sec/watt4.21x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
Attention2407 sentences/sec7.5 sentences/sec/watt4.91x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
Attention4562 sentences/sec10.6 sentences/sec/watt7.11x V101Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
Attention8636 sentences/sec11.8 sentences/sec/watt12.61x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB
Attention128823 sentences/sec13.6 sentences/sec/watt155.51x V100Supermicro SYS-4029GP-TRT-MixedSQuAD v1.1V100-PCIE-16GB

TensorRT 6.0 and sequence length=128 for BERT-BASE and BERT-LARGE | PyTorch for NCF | TensorRT 5.1 for all other models | Efficiency based on board power

 

T4 Inference Performance

NetworkNetwork
Type
Batch
Size
Throughput Efficiency LatencyGPUServerContainerPrecisionDatasetGPU
Version
GoogleNetCNN11703 images/sec26 images/sec/watt0.591x T4Supermicro SYS-4029GP-TRT T419.08-py3INT8SyntheticTesla T4
CNN22356 images/sec35 images/sec/watt0.851x T4Supermicro SYS-4029GP-TRT T419.06-py3INT8SyntheticTesla T4
CNN85816 images/sec84 images/sec/watt1.41x T4Supermicro SYS-4029GP-TRT T419.08-py3INT8SyntheticTesla T4
CNN527580 images/sec109 images/sec/watt6.91x T4Supermicro SYS-4029GP-TRT T419.07-py3INT8SyntheticTesla T4
CNN1287745 images/sec111 images/sec/watt171x T4Supermicro SYS-4029GP-TRT T419.08-py3INT8SyntheticTesla T4
MobileNet V1CNN13685 images/sec70 images/sec/watt0.271x T4Supermicro SYS-4029GP-TRT T419.07-py3INT8SyntheticTesla T4
CNN26006 images/sec97 images/sec/watt0.331x T4Supermicro SYS-4029GP-TRT T419.07-py3INT8SyntheticTesla T4
CNN813859 images/sec199 images/sec/watt0.581x T4Supermicro SYS-4029GP-TRT T419.07-py3INT8SyntheticTesla T4
CNN12817508 images/sec251 images/sec/watt7.31x T4Supermicro SYS-4029GP-TRT T419.08-py3INT8SyntheticTesla T4
ResNet-50CNN11080 images/sec16 images/sec/watt0.931x T4Supermicro SYS-4029GP-TRT T419.08-py3INT8SyntheticTesla T4
CNN21731 images/sec26 images/sec/watt1.21x T4Supermicro SYS-4029GP-TRT T419.08-py3INT8SyntheticTesla T4
CNN83907 images/sec56 images/sec/watt2.11x T4Supermicro SYS-4029GP-TRT T419.08-py3INT8SyntheticTesla T4
CNN1285402 images/sec78 images/sec/watt241x T4Supermicro SYS-4029GP-TRT T419.08-py3INT8SyntheticTesla T4
ResNet-50v1.5CNN11032 images/sec15 images/sec/watt0.971x T4Supermicro SYS-4029GP-TRT T419.08-py3INT8SyntheticTesla T4
CNN21730 images/sec26 images/sec/watt1.21x T4Supermicro SYS-4029GP-TRT T419.08-py3INT8SyntheticTesla T4
CNN83730 images/sec53 images/sec/watt2.21x T4Supermicro SYS-4029GP-TRT T419.07-py3INT8SyntheticTesla T4
CNN1285024 images/sec72 images/sec/watt251x T4Supermicro SYS-4029GP-TRT T419.08-py3INT8SyntheticTesla T4
VGG16CNN1726 images/sec10 images/sec/watt1.41x T4Supermicro SYS-4029GP-TRT T419.07-py3INT8SyntheticTesla T4
CNN21064 images/sec15 images/sec/watt1.91x T4Supermicro SYS-4029GP-TRT T419.07-py3INT8SyntheticTesla T4
CNN81670 images/sec24 images/sec/watt4.81x T4Supermicro SYS-4029GP-TRT T419.07-py3INT8SyntheticTesla T4
CNN1281956 images/sec28 images/sec/watt651x T4Supermicro SYS-4029GP-TRT T419.07-py3INT8SyntheticTesla T4
NCFRecommender17716 samples/sec281 samples/sec/watt0.141x T4Supermicro SYS-4029GP-TRT T419.07-py3INT8SyntheticTesla T4
Recommender64491050 samples/sec16957 samples/sec/watt0.141x T4Supermicro SYS-4029GP-TRT T419.07-py3INT8SyntheticTesla T4
Recommender2500050828702 samples/sec728654 samples/sec/watt1.91x T4Supermicro SYS-4029GP-TRT T419.07-py3INT8SyntheticTesla T4
Recommender10000054034709 samples/sec777433 samples/sec/watt1.91x T4Supermicro SYS-4029GP-TRT T419.08-py3INT8SyntheticTesla T4
BERT-BASEAttention1484 sentences/sec11 sentences/sec/watt2.071x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1Tesla T4
Attention2754 sentences/sec17 sentences/sec/watt2.651x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1Tesla T4
Attention8827 sentences/sec20 sentences/sec/watt9.671x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1Tesla T4
Attention128800 sentences/sec16 sentences/sec/watt160.021x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1Tesla T4
BERT-LARGEAttention1171 sentences/sec4 sentences/sec/watt5.841x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1Tesla T4
Attention2168 sentences/sec4 sentences/sec/watt11.881x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1Tesla T4
Attention8244 sentences/sec6 sentences/sec/watt32.741x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1Tesla T4
Attention128254 sentences/sec5 sentences/sec/watt504.331x T4Supermicro SYS-4029GP-TRT T4-MixedSQuAD v1.1Tesla T4

TensorRT 6.0 and sequence length=128 for BERT-BASE and BERT-LARGE | TensorRT 5.1 for all other models | Efficiency based on board power | NCF uses facebook dataset

 

Last updated: Sept 13th, 2019