AI Training

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Click here to view other performance data.

NVIDIA Performance on MLPerf 3.1 Training Benchmarks

NVIDIA Performance on MLPerf 3.1’s AI Benchmarks: Single Node, Closed Division

Framework	Network	Time to Train (mins)	MLPerf Quality Target	GPU	Server	MLPerf-ID	Precision	Dataset	GPU Version
Nemo	Stable Diffusion	46.8	FID⇐90 and and CLIP>=0.15	8x H100	XE9680x8H100-SXM-80GB	3.1-2019	Mixed	LAION-400M-filtered	H100-SXM5-80GB
MXNet	ResNet-50 v1.5	13.4	75.90% classification	8x H100	ESC-N8-E11	3.1-2011	Mixed	ImageNet	H100-SXM5-80GB
	3D U-Net	13.1	0.908 Mean DICE score	8x H100	AS-8125GS-TNHR	3.1-2068	Mixed	KiTS19	H100-SXM5-80GB
PyTorch	BERT	5.4	0.72 Mask-LM accuracy	8x H100	ESC-N8-E11	3.1-2011	Mixed	Wikipedia 2020/01/01	H100-SXM5-80GB
	Mask R-CNN	19.2	0.377 Box min AP and 0.339 Mask min AP	8x H100	Eos_n1	3.1-2048	Mixed	COCO2017	H100-SXM5-80GB
	RNN-T	16.2	0.058 Word Error Rate	8x H100	GIGABYTE G593-ZD2	3.1-2028	Mixed	LibriSpeech	H100-SXM5-80GB
	RetinaNet	36.0	34.0% mAP	8x H100	ESC-N8-E11	3.1-2011	Mixed	A subset of OpenImages	H100-SXM5-80GB
NVIDIA Merlin HugeCTR	DLRM-dcnv2	3.9	0.80275 AUC	8x H100	Eos_n1	3.1-2047	Mixed	Criteo 4TB	H100-SXM5-80GB

NVIDIA Performance on MLPerf 3.1’s AI Benchmarks: Multi Node, Closed Division

Framework	Network	Time to Train (mins)	MLPerf Quality Target	GPU	Server	MLPerf-ID	Precision	Dataset	GPU Version
NVIDIA NeMo	GPT3	58.3	2.69 log perplexity	512x H100	Eos_n64	3.1-2057	Mixed	c4/en/3.0.1	H100-SXM5-80GB
		40.6	2.69 log perplexity	768x H100	Eos_n96	3.1-2065	Mixed	c4/en/3.0.1	H100-SXM5-80GB
		8.6	2.69 log perplexity	4,096x H100	Eos-dfw_n512	3.1-2008	Mixed	c4/en/3.0.1	H100-SXM5-80GB
		6.0	2.69 log perplexity	6,144x H100	Eos-dfw_n768	3.1-2009	Mixed	c4/en/3.0.1	H100-SXM5-80GB
		4.9	2.69 log perplexity	8,192x H100	Eos-dfw_n1024	3.1-2005	Mixed	c4/en/3.0.1	H100-SXM5-80GB
		4.1	2.69 log perplexity	10,240x H100	Eos-dfw_n1280	3.1-2006	Mixed	c4/en/3.0.1	H100-SXM5-80GB
		3.9	2.69 log perplexity	10,752x H100	Eos-dfw_n1344	3.1-2007	Mixed	c4/en/3.0.1	H100-SXM5-80GB
	Stable Diffusion	10.0	FID⇐90 and and CLIP>=0.15	64x H100	Eos_n8	3.1-2060	Mixed	LAION-400M-filtered	H100-SXM5-80GB
		2.9	FID⇐90 and and CLIP>=0.15	512x H100	Eos_n64	3.1-2055	Mixed	LAION-400M-filtered	H100-SXM5-80GB
		2.5	FID⇐90 and and CLIP>=0.15	1,024x H100	Eos_n128	3.1-2050	Mixed	LAION-400M-filtered	H100-SXM5-80GB
MXNet	ResNet-50 v1.5	2.5	75.90% classification	64x H100	Eos_n8	3.1-2058	Mixed	ImageNet	H100-SXM5-80GB
		0.2	75.90% classification	3,584x H100	coreweave_hgxh100_n448_ngc23.04_mxnet	3.1-2010	Mixed	ImageNet	H100-SXM5-80GB
	3D U-Net	1.9	0.908 Mean DICE score	72x H100	Eos_n9	3.1-2063	Mixed	KiTS19	H100-SXM5-80GB
		0.8	0.908 Mean DICE score	768x H100	Eos_n96	3.1-2064	Mixed	KiTS19	H100-SXM5-80GB
PyTorch	BERT	0.9	0.72 Mask-LM accuracy	64x H100	Eos_n8	3.1-2061	Mixed	Wikipedia 2020/01/01	H100-SXM5-80GB
		0.1	0.72 Mask-LM accuracy	3,472x H100	Eos_n434	3.1-2053	Mixed	Wikipedia 2020/01/01	H100-SXM5-80GB
	Mask R-CNN	4.3	0.377 Box min AP and 0.339 Mask min AP	64x H100	Eos_n8	3.1-2061	Mixed	COCO2017	H100-SXM5-80GB
		1.5	0.377 Box min AP and 0.339 Mask min AP	384x H100	Eos_n48	3.1-2054	Mixed	COCO2017	H100-SXM5-80GB
	RNN-T	4.2	0.058 Word Error Rate	64x H100	Eos_n8	3.1-2061	Mixed	LibriSpeech	H100-SXM5-80GB
		1.7	0.058 Word Error Rate	512x H100	Eos_n64	3.1-2056	Mixed	LibriSpeech	H100-SXM5-80GB
	RetinaNet	6.1	34.0% mAP	64x H100	Eos_n8	3.1-2062	Mixed	A subset of OpenImages	H100-SXM5-80GB
		0.9	34.0% mAP	2,048x H100	Eos_n256	3.1-2052	Mixed	A subset of OpenImages	H100-SXM5-80GB
NVIDIA Merlin HugeCTR	DLRM-dcnv2	1.4	0.80275 AUC	64x H100	Eos_n8	3.1-2059	Mixed	Criteo 4TB	H100-SXM5-80GB
		1.0	0.80275 AUC	128x H100	Eos_n16	3.1-2051	Mixed	Criteo 4TB	H100-SXM5-80GB

MLPerf™ v3.1 Training Closed: 3.1-2005, 3.1-2006, 3.1-2007, 3.1-2008, 3.1-2009, 3.1-2010, 3.1-2011, 3.1-2019, 3.1-2028, 3.1-2047, 3.1-2048, 3.1-2050, 3.1-2051, 3.1-2052, 3.1-2053, 3.1-2054, 3.1-2055, 3.1-2056, 3.1-2057, 3.1-2058, 3.1-2059, 3.1-2060, 3.1-2061, 3.1-2062, 3.1-2063, 3.1-2064, 3.1-2065, 3.1-2068 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For Training rules and guidelines, click here

NVIDIA Performance on MLPerf 3.0’s Training HPC Benchmarks: Closed Division

Framework	Network	Time to Train (mins)	MLPerf Quality Target	GPU	Server	MLPerf-ID	Precision	Dataset	GPU Version
PyTorch	CosmoFlow	2.1	Mean average error 0.124	512x H100	eos	3.0-8006	Mixed	CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets	H100-SXM5-80GB
	DeepCAM	0.8	IOU 0.82	2,048x H100	eos	3.0-8007	Mixed	CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)	H100-SXM5-80GB
	OpenCatalyst	10.7	Forces mean absolute error 0.036	640x H100	eos	3.0-8008	Mixed	Open Catalyst 2020 (OC20) S2EF 2M training split, ID validation set	H100-SXM5-80GB
	OpenFold	7.5	Local Distance Difference Test (lDDT-Cα) >= 0.8	2,080x H100	eos	3.0-8009	Mixed	OpenProteinSet and Protein Data Bank	H100-SXM5-80GB

MLPerf™ v3.0 Training HPC Closed: 3.0-8006, 3.0-8007, 3.0-8008, 3.0-8009 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v3.0 Training HPC rules and guidelines, click here

LLM Training Performance on NVIDIA Data Center Products

H100 Training Performance

Framework	Framework Version	Network	Time to Train (days)	Throughput per GPU	GPU	Server	Container	Sequence Length	TP	PP	Precision	Global Batch Size	GPU Version
Nemo	1.23	GPT3 5B	0.5	23,574 tokens/sec	64x H100	Eos	nemo:24.03	2,048	1	1	FP8	2,048	H100 SXM5 80GB
	1.23	GPT3 20B	2	5,528 tokens/sec	64x H100	Eos	nemo:24.03	2,048	2	1	FP8	256	H100 SXM5 80GB
	1.23	Llama2 7B	0.7	16,290 tokens/sec	8x H100	Eos	nemo:24.03	4,096	1	1	FP8	128	H100 SXM5 80GB
	1.23	Llama2 13B	1.4	8,317 tokens/sec	16x H100	Eos	nemo:24.03	4,096	1	4	FP8	128	H100 SXM5 80GB
	1.23	Llama2 70B	6.6	1,725 tokens/sec	64x H100	Eos	nemo:24.03	4,096	4	4	FP8	128	H100 SXM5 80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism
Time to Train is estimated time to train on 1T tokens with 1K GPUs

Converged Training Performance on NVIDIA Data Center GPUs

H100 Training Performance

Framework	Framework Version	Network	Time to Train (mins)	Accuracy	Throughput	GPU	Server	Container	Precision	Batch Size	Dataset	GPU Version
PyTorch	2.3.0a0	Tacotron2	67	.56 Training Loss	469,109 total output mels/sec	8x H100	DGX H100	24.02-py3	Mixed	128	LJSpeech 1.1	H100 SXM5 80GB
	2.3.0a0	WaveGlow	119	-5.8 Training Loss	3,645,916 output samples/sec	8x H100	DGX H100	24.02-py3	Mixed	10	LJSpeech 1.1	H100 SXM5 80GB
	2.3.0a0	GNMT v2	9	24.15 BLEU Score	1,699,570 total tokens/sec	8x H100	DGX H100	23.12-py3	Mixed	128	wmt16-en-de	H100 SXM5 80GB
	2.3.0a0	NCF	0.27	.96 Hit Rate at 10	218,094,053 samples/sec	8x H100	DGX H100	24.02-py3	Mixed	131072	MovieLens 20M	H100 SXM5 80GB
	2.3.0a0	FastPitch	75	.17 Training Loss	1,331,733 frames/sec	8x H100	DGX H100	24.02-py3	TF32	32	LJSpeech 1.1	H100 SXM5 80GB
	2.3.0a0	Transformer XL Large	318	17.83 Perplexity	262,462 total tokens/sec	8x H100	DGX H100	24.02-py3	Mixed	16	WikiText-103	H100 SXM5 80GB
	2.3.0a0	Transformer XL Base	141	21.61 Perplexity	952,253 total tokens/sec	8x H100	DGX H100	24.02-py3	Mixed	128	WikiText-103	H100 SXM5 80GB
	2.3.0a0	EfficientNet-B4	1,667	82.02 Top 1	5,231 images/sec	8x H100	DGX H100	24.02-py3	Mixed	128	Imagenet2012	H100 SXM5 80GB
	2.1.0a0	EfficientDet-D0	325	.33 BBOX mAP	2,658 images/sec	8x H100	DGX H100	23.10-py3	Mixed	150	COCO 2017	H100 SXM5 80GB
	2.3.0a0	EfficientNet-WideSE-B4	1,673	82.01 Top 1	5,218 images/sec	8x H100	DGX H100	24.02-py3	Mixed	128	Imagenet2012	H100 SXM5 80GB
	2.2.0a0	TFT-Electricity	2	.03 Test P90	145,082 items/sec	8x H100	DGX H100	23.12-py3	Mixed	1024	Electricity	H100 SXM5 80GB
	2.3.0a0	HiFiGAN	948	9.42 Training Loss	115,461 total output mels/sec	8x H100	DGX H100	24.02-py3	Mixed	16	LJSpeech-1.1	H100 SXM5 80GB
	2.3.0a0	GPUNet-0	1,052	78.91 Top 1	9,950 images/sec	8x H100	DGX H100	24.02-py3	Mixed	192	Imagenet2012	H100 SXM5 80GB
	2.3.0a0	GPUNet-1	960	80.45 Top 1	10,946 images/sec	8x H100	DGX H100	24.02-py3	Mixed	192	Imagenet2012	H100 SXM5 80GB
	2.3.0a0	MoFlow	35	89.67 NUV	46,451 molecules/sec	8x H100	DGX H100	24.02-py3	Mixed	512	ZINC	H100 SXM5 80GB
Tensorflow	2.13.0	U-Net Medical	1	.89 DICE Score	2,139 images/sec	8x H100	DGX H100	23.10-py3	Mixed	8	EM segmentation challenge	H100 SXM5-80GB
	2.15.0	Electra Fine Tuning	2	92.59 F1	5,062 sequences/sec	8x H100	DGX H100	24.02-py3	Mixed	32	SQuaD v1.1	H100 SXM5 80GB
	2.13.0	Wide and Deep	4	.66 MAP at 12	12,217,033 samples/sec	8x H100	DGX H100	23.10-py3	Mixed	16384	Kaggle Outbrain Click Prediction	H100 SXM5 80GB

A30 Training Performance

Framework	Framework Version	Network	Time to Train (mins)	Accuracy	Throughput	GPU	Server	Container	Precision	Batch Size	Dataset	GPU Version
PyTorch	2.3.0a0	Tacotron2	131	.52 Training Loss	232,954 total output mels/sec	8x A30	GIGABYTE G482-Z52-00	24.02-py3	Mixed	104	LJSpeech 1.1	A30
	2.3.0a0	WaveGlow	403	. Training Loss	1,042,579 output samples/sec	8x A30	GIGABYTE G482-Z52-00	24.02-py3	Mixed	10	LJSpeech 1.1	A30
	2.3.0a0	GNMT v2	49	24.21 BLEU Score	309,310 total tokens/sec	8x A30	GIGABYTE G482-Z52-00	24.02-py3	Mixed	128	wmt16-en-de	A30
	2.3.0a0	NCF	1	.96 Hit Rate at 10	41,848,626 samples/sec	8x A30	GIGABYTE G482-Z52-00	24.02-py3	Mixed	131072	MovieLens 20M	A30
	2.3.0a0	FastPitch	156	.17 Training Loss	545,724 frames/sec	8x A30	GIGABYTE G482-Z52-00	24.02-py3	Mixed	16	LJSpeech 1.1	A30
	2.3.0a0	Transformer XL Base	198	22.87 Perplexity	168,704 total tokens/sec	8x A30	GIGABYTE G482-Z52-00	24.02-py3	Mixed	32	WikiText-103	A30
	2.3.0a0	EfficientNet-B0	793	77.13 Top 1	11,235 images/sec	8x A30	GIGABYTE G482-Z52-00	24.02-py3	Mixed	128	Imagenet2012	A30
	2.3.0a0	EfficientNet-WideSE-B0	820	77.21 Top 1	10,863 images/sec	8x A30	GIGABYTE G482-Z52-00	24.02-py3	Mixed	128	Imagenet2012	A30
	2.2.0a0	MoFlow	100	87.86 NUV	12,351 molecules/sec	8x A30	GIGABYTE G482-Z52-00	23.12-py3	Mixed	512	ZINC	A30
Tensorflow	2.13.0	U-Net Medical	4	.89 DICE Score	460 images/sec	8x A30	GIGABYTE G482-Z52-00	23.10-py3	Mixed	8	EM segmentation challenge	A30
	2.15.0	Electra Fine Tuning	5	92.63 F1	1,024 sequences/sec	8x A30	GIGABYTE G482-Z52-00	24.02-py3	Mixed	16	SQuaD v1.1	A30
	2.14.0	SIM	1	.81 AUC	2,481,945 samples/sec	8x A30	GIGABYTE G482-Z52-00	23.12-py3	Mixed	16384	Amazon Reviews	A30

A10 Training Performance

Framework	Framework Version	Network	Time to Train (mins)	Accuracy	Throughput	GPU	Server	Container	Precision	Batch Size	Dataset	GPU Version
PyTorch	2.3.0a0	Tacotron2	144	.53 Training Loss	214,246 total output mels/sec	8x A10	GIGABYTE G482-Z52-00	24.02-py3	Mixed	104	LJSpeech 1.1	A10
	2.3.0a0	WaveGlow	541	-5.73 Training Loss	776,764 output samples/sec	8x A10	GIGABYTE G482-Z52-00	24.02-py3	Mixed	10	LJSpeech 1.1	A10
	2.3.0a0	GNMT v2	53	24.2 BLEU Score	282,447 total tokens/sec	8x A10	GIGABYTE G482-Z52-00	24.02-py3	Mixed	128	wmt16-en-de	A10
	2.3.0a0	NCF	2	.96 Hit Rate at 10	32,920,397 samples/sec	8x A10	GIGABYTE G482-Z52-00	24.02-py3	TF32	131072	MovieLens 20M	A10
	2.3.0a0	FastPitch	180	.17 Training Loss	460,415 frames/sec	8x A10	GIGABYTE G482-Z52-00	24.02-py3	Mixed	16	LJSpeech 1.1	A10
	2.3.0a0	EfficientNet-B0	1,045	77.11 Top 1	8,625 images/sec	8x A10	GIGABYTE G482-Z52-00	24.02-py3	Mixed	128	Imagenet2012	A10
	2.3.0a0	EfficientNet-WideSE-B0	1,076	77.31 Top 1	8,487 images/sec	8x A10	GIGABYTE G482-Z52-00	24.02-py3	Mixed	128	Imagenet2012	A10
	2.2.0a0	MoFlow	93	86.86 NUV	13,184 images/sec	8x A10	GIGABYTE G482-Z52-00	23.12-py3	Mixed	512	Medical Segmentation Decathlon	A10
Tensorflow	2.13.0	U-Net Medical	4	.89 DICE Score	352 images/sec	8x A10	GIGABYTE G482-Z52-00	23.10-py3	Mixed	8	EM segmentation challenge	A10
	2.15.0	Electra Fine Tuning	5	92.52 F1	826 sequences/sec	8x A10	GIGABYTE G482-Z52-00	24.02-py3	Mixed	16	SQuaD v1.1	A10
	2.14.0	SIM	1	.8 AUC	2,346,013 samples/sec	8x A10	GIGABYTE G482-Z52-00	23.12-py3	Mixed	16384	Amazon Reviews	A10

View More Performance Data

AI Inference

Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More