AI Training

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.


Click here to view other performance data.


NVIDIA Performance on MLPerf 3.1 Training Benchmarks


NVIDIA Performance on MLPerf 3.1’s AI Benchmarks: Single Node, Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
NemoStable Diffusion46.8FID⇐90 and and CLIP>=0.158x H100XE9680x8H100-SXM-80GB3.1-2019MixedLAION-400M-filteredH100-SXM5-80GB
MXNetResNet-50 v1.513.475.90% classification8x H100ESC-N8-E113.1-2011MixedImageNetH100-SXM5-80GB
3D U-Net13.10.908 Mean DICE score8x H100AS-8125GS-TNHR3.1-2068MixedKiTS19H100-SXM5-80GB
PyTorchBERT5.40.72 Mask-LM accuracy8x H100ESC-N8-E113.1-2011MixedWikipedia 2020/01/01H100-SXM5-80GB
Mask R-CNN19.20.377 Box min AP and 0.339 Mask min AP8x H100Eos_n13.1-2048MixedCOCO2017H100-SXM5-80GB
RNN-T16.20.058 Word Error Rate8x H100GIGABYTE G593-ZD23.1-2028MixedLibriSpeechH100-SXM5-80GB
RetinaNet36.034.0% mAP8x H100ESC-N8-E113.1-2011MixedA subset of OpenImagesH100-SXM5-80GB
NVIDIA Merlin HugeCTRDLRM-dcnv23.90.80275 AUC8x H100Eos_n13.1-2047MixedCriteo 4TBH100-SXM5-80GB

NVIDIA Performance on MLPerf 3.1’s AI Benchmarks: Multi Node, Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
NVIDIA NeMoGPT358.32.69 log perplexity512x H100Eos_n643.1-2057Mixedc4/en/3.0.1H100-SXM5-80GB
40.62.69 log perplexity768x H100Eos_n963.1-2065Mixedc4/en/3.0.1H100-SXM5-80GB
8.62.69 log perplexity4,096x H100Eos-dfw_n5123.1-2008Mixedc4/en/3.0.1H100-SXM5-80GB
6.02.69 log perplexity6,144x H100Eos-dfw_n7683.1-2009Mixedc4/en/3.0.1H100-SXM5-80GB
4.92.69 log perplexity8,192x H100Eos-dfw_n10243.1-2005Mixedc4/en/3.0.1H100-SXM5-80GB
4.12.69 log perplexity10,240x H100Eos-dfw_n12803.1-2006Mixedc4/en/3.0.1H100-SXM5-80GB
3.92.69 log perplexity10,752x H100Eos-dfw_n13443.1-2007Mixedc4/en/3.0.1H100-SXM5-80GB
Stable Diffusion10.0FID⇐90 and and CLIP>=0.1564x H100Eos_n83.1-2060MixedLAION-400M-filteredH100-SXM5-80GB
2.9FID⇐90 and and CLIP>=0.15512x H100Eos_n643.1-2055MixedLAION-400M-filteredH100-SXM5-80GB
2.5FID⇐90 and and CLIP>=0.151,024x H100Eos_n1283.1-2050MixedLAION-400M-filteredH100-SXM5-80GB
MXNetResNet-50 v1.52.575.90% classification64x H100Eos_n83.1-2058MixedImageNetH100-SXM5-80GB
0.275.90% classification3,584x H100coreweave_hgxh100_n448_ngc23.04_mxnet3.1-2010MixedImageNetH100-SXM5-80GB
3D U-Net1.90.908 Mean DICE score72x H100Eos_n93.1-2063MixedKiTS19H100-SXM5-80GB
0.80.908 Mean DICE score768x H100Eos_n963.1-2064MixedKiTS19H100-SXM5-80GB
PyTorchBERT0.90.72 Mask-LM accuracy64x H100Eos_n83.1-2061MixedWikipedia 2020/01/01H100-SXM5-80GB
0.10.72 Mask-LM accuracy3,472x H100Eos_n4343.1-2053MixedWikipedia 2020/01/01H100-SXM5-80GB
Mask R-CNN4.30.377 Box min AP and 0.339 Mask min AP64x H100Eos_n83.1-2061MixedCOCO2017H100-SXM5-80GB
1.50.377 Box min AP and 0.339 Mask min AP384x H100Eos_n483.1-2054MixedCOCO2017H100-SXM5-80GB
RNN-T4.20.058 Word Error Rate64x H100Eos_n83.1-2061MixedLibriSpeechH100-SXM5-80GB
1.70.058 Word Error Rate512x H100Eos_n643.1-2056MixedLibriSpeechH100-SXM5-80GB
RetinaNet6.134.0% mAP64x H100Eos_n83.1-2062MixedA subset of OpenImagesH100-SXM5-80GB
0.934.0% mAP2,048x H100Eos_n2563.1-2052MixedA subset of OpenImagesH100-SXM5-80GB
NVIDIA Merlin HugeCTRDLRM-dcnv21.40.80275 AUC64x H100Eos_n83.1-2059MixedCriteo 4TBH100-SXM5-80GB
1.00.80275 AUC128x H100Eos_n163.1-2051MixedCriteo 4TBH100-SXM5-80GB

MLPerf™ v3.1 Training Closed: 3.1-2005, 3.1-2006, 3.1-2007, 3.1-2008, 3.1-2009, 3.1-2010, 3.1-2011, 3.1-2019, 3.1-2028, 3.1-2047, 3.1-2048, 3.1-2050, 3.1-2051, 3.1-2052, 3.1-2053, 3.1-2054, 3.1-2055, 3.1-2056, 3.1-2057, 3.1-2058, 3.1-2059, 3.1-2060, 3.1-2061, 3.1-2062, 3.1-2063, 3.1-2064, 3.1-2065, 3.1-2068 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For Training rules and guidelines, click here


NVIDIA Performance on MLPerf 3.0’s Training HPC Benchmarks: Closed Division

Framework Network Time to Train
(mins)
MLPerf Quality Target GPU Server MLPerf-ID Precision Dataset GPU Version
PyTorchCosmoFlow2.1Mean average error 0.124512x H100eos3.0-8006MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsH100-SXM5-80GB
DeepCAM0.8IOU 0.822,048x H100eos3.0-8007MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)H100-SXM5-80GB
OpenCatalyst10.7Forces mean absolute error 0.036640x H100eos3.0-8008MixedOpen Catalyst 2020 (OC20) S2EF 2M training split, ID validation setH100-SXM5-80GB
OpenFold7.5Local Distance Difference Test (lDDT-Cα) >= 0.82,080x H100eos3.0-8009MixedOpenProteinSet and Protein Data BankH100-SXM5-80GB

MLPerf™ v3.0 Training HPC Closed: 3.0-8006, 3.0-8007, 3.0-8008, 3.0-8009 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v3.0 Training HPC rules and guidelines, click here



LLM Training Performance on NVIDIA Data Center Products


H100 Training Performance



Framework Framework Version Network Time to Train (days) Throughput per GPU GPU Server Container Sequence Length TP PP Precision Global Batch Size GPU Version
Nemo1.23GPT3 5B0.523,574 tokens/sec64x H100Eosnemo:24.032,04811FP82,048H100 SXM5 80GB
1.23GPT3 20B25,528 tokens/sec64x H100Eosnemo:24.032,04821FP8256H100 SXM5 80GB
1.23Llama2 7B0.716,290 tokens/sec8x H100Eosnemo:24.034,09611FP8128H100 SXM5 80GB
1.23Llama2 13B1.48,317 tokens/sec16x H100Eosnemo:24.034,09614FP8128H100 SXM5 80GB
1.23Llama2 70B6.61,725 tokens/sec64x H100Eosnemo:24.034,09644FP8128H100 SXM5 80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism
Time to Train is estimated time to train on 1T tokens with 1K GPUs


Converged Training Performance on NVIDIA Data Center GPUs


H100 Training Performance



Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.3.0a0Tacotron267.56 Training Loss469,109 total output mels/sec8x H100DGX H10024.02-py3Mixed128LJSpeech 1.1H100 SXM5 80GB
2.3.0a0WaveGlow119-5.8 Training Loss3,645,916 output samples/sec8x H100DGX H10024.02-py3Mixed10LJSpeech 1.1H100 SXM5 80GB
2.3.0a0GNMT v2924.15 BLEU Score1,699,570 total tokens/sec8x H100DGX H10023.12-py3Mixed128wmt16-en-deH100 SXM5 80GB
2.3.0a0NCF0.27.96 Hit Rate at 10218,094,053 samples/sec8x H100DGX H10024.02-py3Mixed131072MovieLens 20MH100 SXM5 80GB
2.3.0a0FastPitch75.17 Training Loss1,331,733 frames/sec8x H100DGX H10024.02-py3TF3232LJSpeech 1.1H100 SXM5 80GB
2.3.0a0Transformer XL Large31817.83 Perplexity262,462 total tokens/sec8x H100DGX H10024.02-py3Mixed16WikiText-103H100 SXM5 80GB
2.3.0a0Transformer XL Base14121.61 Perplexity952,253 total tokens/sec8x H100DGX H10024.02-py3Mixed128WikiText-103H100 SXM5 80GB
2.3.0a0EfficientNet-B41,66782.02 Top 15,231 images/sec8x H100DGX H10024.02-py3Mixed128Imagenet2012H100 SXM5 80GB
2.1.0a0EfficientDet-D0325.33 BBOX mAP2,658 images/sec8x H100DGX H10023.10-py3Mixed150COCO 2017H100 SXM5 80GB
2.3.0a0EfficientNet-WideSE-B41,67382.01 Top 15,218 images/sec8x H100DGX H10024.02-py3Mixed128Imagenet2012H100 SXM5 80GB
2.2.0a0TFT-Electricity2.03 Test P90145,082 items/sec8x H100DGX H10023.12-py3Mixed1024ElectricityH100 SXM5 80GB
2.3.0a0HiFiGAN9489.42 Training Loss115,461 total output mels/sec8x H100DGX H10024.02-py3Mixed16LJSpeech-1.1H100 SXM5 80GB
2.3.0a0GPUNet-01,05278.91 Top 19,950 images/sec8x H100DGX H10024.02-py3Mixed192Imagenet2012H100 SXM5 80GB
2.3.0a0GPUNet-196080.45 Top 110,946 images/sec8x H100DGX H10024.02-py3Mixed192Imagenet2012H100 SXM5 80GB
2.3.0a0MoFlow3589.67 NUV46,451 molecules/sec8x H100DGX H10024.02-py3Mixed512ZINCH100 SXM5 80GB
Tensorflow2.13.0U-Net Medical1.89 DICE Score2,139 images/sec8x H100DGX H10023.10-py3Mixed8EM segmentation challengeH100 SXM5-80GB
2.15.0Electra Fine Tuning292.59 F15,062 sequences/sec8x H100DGX H10024.02-py3Mixed32SQuaD v1.1H100 SXM5 80GB
2.13.0Wide and Deep4.66 MAP at 1212,217,033 samples/sec8x H100DGX H10023.10-py3Mixed16384Kaggle Outbrain Click PredictionH100 SXM5 80GB

A30 Training Performance


Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.3.0a0Tacotron2131.52 Training Loss232,954 total output mels/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed104LJSpeech 1.1A30
2.3.0a0WaveGlow403. Training Loss1,042,579 output samples/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed10LJSpeech 1.1A30
2.3.0a0GNMT v24924.21 BLEU Score309,310 total tokens/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed128wmt16-en-deA30
2.3.0a0NCF1.96 Hit Rate at 1041,848,626 samples/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed131072MovieLens 20MA30
2.3.0a0FastPitch156.17 Training Loss545,724 frames/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed16LJSpeech 1.1A30
2.3.0a0Transformer XL Base19822.87 Perplexity168,704 total tokens/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed32WikiText-103A30
2.3.0a0EfficientNet-B079377.13 Top 111,235 images/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed128Imagenet2012A30
2.3.0a0EfficientNet-WideSE-B082077.21 Top 110,863 images/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed128Imagenet2012A30
2.2.0a0MoFlow10087.86 NUV12,351 molecules/sec8x A30GIGABYTE G482-Z52-0023.12-py3Mixed512ZINCA30
Tensorflow2.13.0U-Net Medical4.89 DICE Score460 images/sec8x A30GIGABYTE G482-Z52-0023.10-py3Mixed8EM segmentation challengeA30
2.15.0Electra Fine Tuning592.63 F11,024 sequences/sec8x A30GIGABYTE G482-Z52-0024.02-py3Mixed16SQuaD v1.1A30
2.14.0SIM1.81 AUC2,481,945 samples/sec8x A30GIGABYTE G482-Z52-0023.12-py3Mixed16384Amazon ReviewsA30

A10 Training Performance


Framework Framework Version Network Time to Train
(mins)
Accuracy Throughput GPU Server Container Precision Batch Size Dataset GPU Version
PyTorch2.3.0a0Tacotron2144.53 Training Loss214,246 total output mels/sec8x A10GIGABYTE G482-Z52-0024.02-py3Mixed104LJSpeech 1.1A10
2.3.0a0WaveGlow541-5.73 Training Loss776,764 output samples/sec8x A10GIGABYTE G482-Z52-0024.02-py3Mixed10LJSpeech 1.1A10
2.3.0a0GNMT v25324.2 BLEU Score282,447 total tokens/sec8x A10GIGABYTE G482-Z52-0024.02-py3Mixed128wmt16-en-deA10
2.3.0a0NCF2.96 Hit Rate at 1032,920,397 samples/sec8x A10GIGABYTE G482-Z52-0024.02-py3TF32131072MovieLens 20MA10
2.3.0a0FastPitch180.17 Training Loss460,415 frames/sec8x A10GIGABYTE G482-Z52-0024.02-py3Mixed16LJSpeech 1.1A10
2.3.0a0EfficientNet-B01,04577.11 Top 18,625 images/sec8x A10GIGABYTE G482-Z52-0024.02-py3Mixed128Imagenet2012A10
2.3.0a0EfficientNet-WideSE-B01,07677.31 Top 18,487 images/sec8x A10GIGABYTE G482-Z52-0024.02-py3Mixed128Imagenet2012A10
2.2.0a0MoFlow9386.86 NUV13,184 images/sec8x A10GIGABYTE G482-Z52-0023.12-py3Mixed512Medical Segmentation DecathlonA10
Tensorflow2.13.0U-Net Medical4.89 DICE Score352 images/sec8x A10GIGABYTE G482-Z52-0023.10-py3Mixed8EM segmentation challengeA10
2.15.0Electra Fine Tuning592.52 F1826 sequences/sec8x A10GIGABYTE G482-Z52-0024.02-py3Mixed16SQuaD v1.1A10
2.14.0SIM1.8 AUC2,346,013 samples/sec8x A10GIGABYTE G482-Z52-0023.12-py3Mixed16384Amazon ReviewsA10



View More Performance Data

AI Inference

Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More