Benchmarks

GPU Performance Deep Learning Benchmarks - BERT, GNMT, & More

August 29, 2025

13 min read

Update 8/28/25: Added NVIDIA RTX PRO 6000 Blackwell Server Edition

Introduction: GPU Deep Learning Benchmarks

When selecting hardware for AI and machine learning workloads, comprehensive performance data across real-world scenarios is essential for making informed decisions. We will benchmark a variety of NVIDIA GPUs using deep learning benchmarks. Here is our system configuration:

CPU	AMD EPYC 9654 96-Core Processor
Memory	755GB DDR5 ECC
NVIDIA Driver	570.133.07
CUDA Version	12.6.77
cuDNN Version	12.8
OS	Ubuntu 22.04.5 LTS
PyTorch	2.5.0 (NVIDIA optimized)

While these workloads are not one-to-one with your particular workload, they can serve as a good understanding of relative performance across multiple GPUs.

This blog will continue to grow as we test more and more GPUs. Current GPUs tested:

Testing Environment and Methodology

Our testing utilized a deep learning benchmark suite representing real-world AI workloads:

BERT Base/Large: Natural language processing and transformer models
GNMT (Google Neural Machine Translation): Sequence-to-sequence translation
ResNet50: Image classification model for computer vision applications
TransformerXL Base/Large: Advanced natural language processing for sequence modeling and context understanding in text
Tacotron2: Text-to-speech synthesis model
WaveGlow: High-quality audio generation
SSD (Single Shot MultiBox Detector): Object detection for computer vision
Neural Content Filtering (NCF): Recommendation system

These performance numbers are not to be taken at face value. Instead, they should be used as a reference point for performance comparison in similar workloads. We selected a diverse set of workloads, including vision, transformer, and classification models.

Benchmark Notes

GPU Scalability is relatively strong with some performance overhead. We are seeing a near 1:1 ratio of Speedup to GPU count. However, in workloads like Tacotron2, TransformerXL Large, and NCF, the GPU scalability is impacted. This is likely due to their smaller batch size type workloads that need to pull from memory often.
NVIDIA Blackwell performs well beyond the NVIDIA Ada, where in some cases, double the performance (ResNet50, WaveGlow, SSD, and NCF). With increased memory, models and data can reside in memory, which reduces memory bandwidth bottlenecks
RTX 6000 Ada and L40S show very similar performance since they are the same GPU architecture and die. We saw a few anomalies between these two GPUs, with both of them trading performance positions in some workloads.
NVIDIA RTX PRO 6000 Blackwell Server Edition has slightly better performance than the NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition due to the higher power draw.

Fueling Innovation with an Exxact Multi-GPU Server

Training AI models on massive datasets can be accelerated exponentially with the right system. It's not just a high-performance computer, but a tool to propel and accelerate your research.

Configure Now

Deep Learning GPU Benchmarks on FP32

Configuration	BERT Base (seq/sec)	BERT Large (seq/sec)	GNMT (seq/sec)	ResNet50 (images/sec)	TransformerXL Base (tokens/sec)	TransformerXL Large (tokens/sec)	Tacotron2 (samples/sec)	SSD (samples/sec)	WaveGlow (samples/sec)	Neural Content Filtering (samples/sec)
1x RTX PRO 6000 Blackwell Server	511	174	312180	1918	62700	28059	74883	861	290592	30620969
2x RTX PRO 6000 Blackwell Server	1004	344	618905	3819	127538	55760	145355	1720	578740	69955831
4x RTX PRO 6000 Blackwell Server	2047	689	1230190	7636	255038	108545	227987	3437	1156381	129237765
8x RTX PRO 6000 Blackwell Server	6772	1369	2065528	15252	509370	216779	430069	6816	2308974	217999600

1x RTX PRO 6000 Blackwell Max-Q	406	133	239435	1741	50968	21329	68912	733	269597	30136004
2x RTX PRO 6000 Blackwell Max-Q	811	266	492598	3446	102205	41592	141746	1449	522326	55119072
4x RTX PRO 6000 Blackwell Max-Q	1606	533	980809	6859	202013	81345	247240	2911	1037704	106410444
8x RTX PRO 6000 Blackwell Max-Q	4079	1041	1835891	13498	339800	139475	454854	5695	2036042	171572342

1x L40S 48GB	264	90	162228	1028	38357	15553	34816	350	150572	25112903
2x L40S 48GB	522	176	307875	1978	76492	31044	67438	620	280836	55054943
4x L40S 48GB	1037	353	617952	3961	129674	62242	106819	1239	491005	95436610

1x RTX 6000 Ada	267	89	157119	1185	40155	16334	41051	437	185530	33029658
2x RTX 6000 Ada	532	174	308098	2345	80511	32257	77495	876	351433	72748463
4x RTX 6000 Ada	1052	345	617944	4701	159669	64687	132918	1745	683535	133228628
8x RTX 6000 Ada	2111	690	1222839	9344	345991	128939	145731	3070	1315155	222566063

Deep Learning GPU Benchmarks on FP16

Configuration	BERT Base (seq/sec)	BERT Large (seq/sec)	GNMT (seq/sec)	ResNet50 (images/sec)	TransformerXL Base (tokens/sec)	TransformerXL Large (tokens/sec)	Tacotron2 (samples/sec)	SSD (samples/sec)	WaveGlow (samples/sec)	Neural Content Filtering (samples/sec)
1x RTX PRO 6000 Blackwell Server	268	88	187357	1141	40250	17630	74815	876	286560	30537703
2x RTX PRO 6000 Blackwell Server	533	175	367076	2272	81675	35540	126570	1748	569794	69955830
4x RTX PRO 6000 Blackwell Server	1062	350	728504	4539	163373	68083	243242	3494	1136257	111107835
8x RTX PRO 6000 Blackwell Server	2129	683	1374085	9066	326318	136074	470607	6916	2274969	188810349

1x RTX PRO 6000 Blackwell Max-Q	228	74	140714	1051	35370	14349	69519	749	280272	30324158
2x RTX PRO 6000 Blackwell Max-Q	458	146	291772	2045	67808	28374	141681	1480	528102	58579611
4x RTX PRO 6000 Blackwell Max-Q	866	293	570864	4101	127604	54626	277923	2980	1042007	105955663
8x RTX PRO 6000 Blackwell Max-Q	1722	576	1116521	8078	202477	89139	485055	5831	2053141	183317525

1x L40S 48GB	130	44	92008	554	20284	8262	28856	183	148217	18486005
2x L40S 48GB	257	86	168734	1095	40440	16497	59311	346	281196	54722411
4x L40S 48GB	508	169	341691	2189	80939	33036	72772	692	487629	102390292

1x RTX 6000 Ada	149	49	94317	668	22684	9362	39670	369	179979	20544461
2x RTX 6000 Ada	292	94	120641	1338	44828	18282	76071	729	350283	37312167
4x RTX 6000 Ada	581	187	354711	2679	89906	36572	130244	1454	682168	80183880
8x RTX 6000 Ada	1156	365	713068	5356	177943	73263	186411	2895	1312460	146599996

BERT Base GPU Benchmark

BERT Large GPU Benchmark

ResNet50 GPU Benchmark

GNMT GPU Benchmark

TransformerXL Base GPU Benchmark

TransformerXL Large GPU Benchmark

Tacotron2 GPU Benchmark

Single Shot Multibox Detector GPU Benchmark

WaveGlow GPU Benchmark

Neural Content Filtering GPU Benchmark

Multi-GPU Scaling Analysis - Scaling Efficiency (vs Single GPU)

Next, we will look at the scaling efficiency compared to single-GPU performance. We took an average of the 3 GPUs to calculate the scalability, which will tell us which workloads benefit from multiple GPUs.

Configuration	BERT Base	BERT Large	GNMT	ResNet50	TransformerXL Base	TransformerXL Large	Tacotron2	SSD	WaveGlow	NCF
2x GPUs (FP16)	1.98	1.98	1.98	1.97	2.01	1.98	1.97	1.96	1.93	2.13
4x GPUs (FP16)	3.97	3.95	3.96	3.94	3.88	3.90	3.25	3.92	3.76	3.91
8x GPUs (FP16)	11.94	8.50	7.84	8.65	8.29	7.96	6.26	8.73	8.42	6.86

2x GPUs (FP32)	1.99	1.96	1.84	1.98	1.98	1.99	1.90	1.98	1.93	2.21
4x GPUs (FP32)	3.89	3.92	3.88	3.96	3.89	3.88	3.40	3.96	3.74	4.00
8x GPUs (FP32)	8.61	8.49	8.30	8.79	7.95	8.02	7.15	9.58	8.40	6.92

FP16 Speedup vs FP32 Precision Performance Comparison

Here we check to see which workloads can benefit from the FP16 speedup. We calculated the average FP16 score divided by FP32 score. In this case, WaveGlow, Tacotron2, SSD, and NCF do not see meaningful benefits.

Configuration	BERT Base	BERT Large	GNMT	ResNet50	TransformerXL Base	TransformerXL Large	Tacotron2	SSD	WaveGlow	NCF
1x GPU	1.87	1.91	1.69	1.72	1.62	1.64	1.03	1.09	1.00	1.19
2x GPUs	1.86	1.92	1.82	1.72	1.65	1.63	1.07	1.08	1.00	1.15
4x GPUs	1.90	1.92	1.73	1.71	1.62	1.65	0.99	1.08	1.01	1.16
8x GPUs	2.59	1.91	1.60	1.69	1.69	1.63	0.90	1.00	1.00	1.18
Average Speedup	2.06	1.91	1.71	1.71	1.64	1.63	1.00	1.06	1.00	1.17

Performance Insights from Our Benchmark Results

Development Teams

Mixed Precision Acceleration: Up to 2.04x speedup with FP16 for BERT Base, allowing twice as many experiments in the same timeframe
Exceptional Multi-GPU Scaling: Additional GPUs will have some performance overhead, but each benchmark has near-perfect scaling for multi-GPU configurations.
Memory Efficiency: 96GB VRAM on the NVIDIA RTX PRO 6000 Blackwell supports larger batch sizes in BERT and TransformerXL Large models

Production Deployments

Workload Versatility: Strong performance across all 10 benchmarked models from NLP to computer vision
Linear Scaling: 3.94x scaling with 4x GPUs for BERT Base demonstrates excellent performance-to-cost ratio
Cost-Optimized Performance: All GPUs deliver amazing scaling efficiency for production AI workloads

Research Organizations

Domain-Specific Excellence: Top performance in vision (ResNet50, SSD), language (BERT, TransformerXL), and audio (Tacotron2, WaveGlow) models
Precision Flexibility: Choose between FP16 (1.70-2.04x speedup) and FP32 based on accuracy requirements. Check our average speedup to see if your type of workload would scale effectively.
Hardware Optimization: These benchmarks can help guide you to choose the right GPU architecture based on your deep learning workload needs, optimizing ROI for your AI infrastructure investment

Conclusion

Our extensive benchmarking analysis reveals the exceptional performance of modern GPUs across a diverse range of deep learning workloads. The data demonstrates near-linear scaling for multi-GPU configurations, with 2-GPU setups achieving up to 1.99x performance and 4-GPU systems delivering 3.94x speedups.

FP16 precision offers excellent acceleration in most workloads, especially transformer-based models (BERT and TransformerXL). For multi-GPU scalability, transformer-based models like show excellent multi-GPU efficiency, while models like WaveGlow, NCF, Tacotron2, and SSD scale less efficiently.

These insights provide valuable guidance for organizations designing AI infrastructure to match their specific workload requirements, whether for research experimentation, model development, or production deployment.

Need guidance with configuration and integration for your computing infrastructure? Exxact is a leading solution integrator helping to deliver custom-configurable workstations, servers, and full rack clusters. Contact us for more information!

Fueling Innovation with an Exxact Designed Computing Cluster

Deploying full-scale AI models can be accelerated exponentially with the right computing infrastructure. Storage, head node, networking, compute - all components of your next Exxact cluster are configurable to your workload to drive and accelerate research and innovation.

Get a Quote Today

Topics

Have any questions?

Benchmarks