Benchmarks

GPU Performance Deep Learning Benchmarks - BERT, GNMT, & More

August 29, 2025
13 min read
EXX-Blog-GPU-Performance-DL-BM-Bert-GNMT.jpg
Update 8/28/25: Added NVIDIA RTX PRO 6000 Blackwell Server Edition

Introduction: GPU Deep Learning Benchmarks

When selecting hardware for AI and machine learning workloads, comprehensive performance data across real-world scenarios is essential for making informed decisions. We will benchmark a variety of NVIDIA GPUs using deep learning benchmarks. Here is our system configuration:

CPUAMD EPYC 9654 96-Core Processor
Memory755GB DDR5 ECC
NVIDIA Driver570.133.07
CUDA Version12.6.77
cuDNN Version12.8
OSUbuntu 22.04.5 LTS
PyTorch2.5.0 (NVIDIA optimized)

While these workloads are not one-to-one with your particular workload, they can serve as a good understanding of relative performance across multiple GPUs.

This blog will continue to grow as we test more and more GPUs. Current GPUs tested:

Testing Environment and Methodology

Our testing utilized a deep learning benchmark suite representing real-world AI workloads:

  • BERT Base/Large: Natural language processing and transformer models
  • GNMT (Google Neural Machine Translation): Sequence-to-sequence translation
  • ResNet50: Image classification model for computer vision applications
  • TransformerXL Base/Large: Advanced natural language processing for sequence modeling and context understanding in text
  • Tacotron2: Text-to-speech synthesis model
  • WaveGlow: High-quality audio generation
  • SSD (Single Shot MultiBox Detector): Object detection for computer vision
  • Neural Content Filtering (NCF): Recommendation system

These performance numbers are not to be taken at face value. Instead, they should be used as a reference point for performance comparison in similar workloads. We selected a diverse set of workloads, including vision, transformer, and classification models.

Benchmark Notes

  • GPU Scalability is relatively strong with some performance overhead. We are seeing a near 1:1 ratio of Speedup to GPU count. However, in workloads like Tacotron2, TransformerXL Large, and NCF, the GPU scalability is impacted. This is likely due to their smaller batch size type workloads that need to pull from memory often.
  • NVIDIA Blackwell performs well beyond the NVIDIA Ada, where in some cases, double the performance (ResNet50, WaveGlow, SSD, and NCF). With increased memory, models and data can reside in memory, which reduces memory bandwidth bottlenecks
  • RTX 6000 Ada and L40S show very similar performance since they are the same GPU architecture and die. We saw a few anomalies between these two GPUs, with both of them trading performance positions in some workloads.
  • NVIDIA RTX PRO 6000 Blackwell Server Edition has slightly better performance than the NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition due to the higher power draw.

Fueling Innovation with an Exxact Multi-GPU Server

Training AI models on massive datasets can be accelerated exponentially with the right system. It's not just a high-performance computer, but a tool to propel and accelerate your research.

Configure Now

Deep Learning GPU Benchmarks on FP32

ConfigurationBERT Base (seq/sec)BERT Large (seq/sec)GNMT (seq/sec)ResNet50 (images/sec)TransformerXL Base (tokens/sec)TransformerXL Large (tokens/sec)Tacotron2 (samples/sec)SSD (samples/sec)WaveGlow (samples/sec)Neural Content Filtering (samples/sec)
1x RTX PRO 6000
Blackwell Server
511174312180191862700280597488386129059230620969
2x RTX PRO 6000 Blackwell Server1004344618905381912753855760145355172057874069955831
4x RTX PRO 6000 Blackwell Server20476891230190763625503810854522798734371156381129237765
8x RTX PRO 6000 Blackwell Server6772136920655281525250937021677943006968162308974217999600
           
1x RTX PRO 6000
Blackwell Max-Q
406133239435174150968213296891273326959730136004
2x RTX PRO 6000 Blackwell Max-Q811266492598344610220541592141746144952232655119072
4x RTX PRO 6000 Blackwell Max-Q160653398080968592020138134524724029111037704106410444
8x RTX PRO 6000 Blackwell Max-Q4079104118358911349833980013947545485456952036042171572342
           
1x L40S 48GB26490162228102838357155533481635015057225112903
2x L40S 48GB522176307875197876492310446743862028083655054943
4x L40S 48GB1037353617952396112967462242106819123949100595436610
           
1x RTX 6000 Ada26789157119118540155163344105143718553033029658
2x RTX 6000 Ada532174308098234580511322577749587635143372748463
4x RTX 6000 Ada10523456179444701159669646871329181745683535133228628
8x RTX 6000 Ada21116901222839934434599112893914573130701315155222566063

 

Deep Learning GPU Benchmarks on FP16

ConfigurationBERT Base
(seq/sec)
BERT Large
(seq/sec)
GNMT
(seq/sec)
ResNet50
(images/sec)
TransformerXL Base
(tokens/sec)
TransformerXL Large
(tokens/sec)
Tacotron2
(samples/sec)
SSD
(samples/sec)
WaveGlow
(samples/sec)
Neural Content Filtering (samples/sec)
1x RTX PRO 6000
Blackwell Server
26888187357114140250176307481587628656030537703
2x RTX PRO 6000 Blackwell Server53317536707622728167535540126570174856979469955830
4x RTX PRO 6000 Blackwell Server106235072850445391633736808324324234941136257111107835
8x RTX PRO 6000 Blackwell Server21296831374085906632631813607447060769162274969188810349
           
1x RTX PRO 6000
Blackwell Max-Q
22874140714105135370143496951974928027230324158
2x RTX PRO 6000 Blackwell Max-Q45814629177220456780828374141681148052810258579611
4x RTX PRO 6000 Blackwell Max-Q86629357086441011276045462627792329801042007105955663
8x RTX PRO 6000 Blackwell Max-Q1722576111652180782024778913948505558312053141183317525
           
1x L40S 48GB13044920085542028482622885618314821718486005
2x L40S 48GB25786168734109540440164975931134628119654722411
4x L40S 48GB5081693416912189809393303672772692487629102390292
           
1x RTX 6000 Ada14949943176682268493623967036917997920544461
2x RTX 6000 Ada29294120641133844828182827607172935028337312167
4x RTX 6000 Ada58118735471126798990636572130244145468216880183880
8x RTX 6000 Ada115636571306853561779437326318641128951312460146599996

 

BERT Base GPU Benchmark

BERT Base GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

BERT Large GPU Benchmark

BERT Large GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

ResNet50 GPU Benchmark

ResNet50 GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

GNMT GPU Benchmark

GNMT GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

TransformerXL Base GPU Benchmark

TransformerXL Base GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

TransformerXL Large GPU Benchmark

TransformerXL Large GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

Tacotron2 GPU Benchmark

Tacotron2 GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

Single Shot Multibox Detector GPU Benchmark

Single Shot Multibox Detector GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

WaveGlow GPU Benchmark

WaveGlow GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

Neural Content Filtering GPU Benchmark

Neural Content Filtering GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

Multi-GPU Scaling Analysis - Scaling Efficiency (vs Single GPU)

Next, we will look at the scaling efficiency compared to single-GPU performance. We took an average of the 3 GPUs to calculate the scalability, which will tell us which workloads benefit from multiple GPUs.

ConfigurationBERT BaseBERT LargeGNMTResNet50TransformerXL BaseTransformerXL LargeTacotron2SSDWaveGlowNCF
2x GPUs (FP16)1.981.981.981.972.011.981.971.961.932.13
4x GPUs (FP16)3.973.953.963.943.883.903.253.923.763.91
8x GPUs (FP16)11.948.507.848.658.297.966.268.738.426.86
           
2x GPUs (FP32)1.991.961.841.981.981.991.901.981.932.21
4x GPUs (FP32)3.893.923.883.963.893.883.403.963.744.00
8x GPUs (FP32)8.618.498.308.797.958.027.159.588.406.92

 

FP16 Speedup vs FP32 Precision Performance Comparison

Here we check to see which workloads can benefit from the FP16 speedup. We calculated the average FP16 score divided by FP32 score. In this case, WaveGlow, Tacotron2, SSD, and NCF do not see meaningful benefits.

ConfigurationBERT BaseBERT LargeGNMTResNet50TransformerXL BaseTransformerXL LargeTacotron2SSDWaveGlowNCF
1x GPU1.871.911.691.721.621.641.031.091.001.19
2x GPUs1.861.921.821.721.651.631.071.081.001.15
4x GPUs1.901.921.731.711.621.650.991.081.011.16
8x GPUs2.591.911.601.691.691.630.901.001.001.18
Average Speedup2.061.911.711.711.641.631.001.061.001.17

 

Performance Insights from Our Benchmark Results

Development Teams

  • Mixed Precision Acceleration: Up to 2.04x speedup with FP16 for BERT Base, allowing twice as many experiments in the same timeframe
  • Exceptional Multi-GPU Scaling: Additional GPUs will have some performance overhead, but each benchmark has near-perfect scaling for multi-GPU configurations.
  • Memory Efficiency: 96GB VRAM on the NVIDIA RTX PRO 6000 Blackwell supports larger batch sizes in BERT and TransformerXL Large models

Production Deployments

  • Workload Versatility: Strong performance across all 10 benchmarked models from NLP to computer vision
  • Linear Scaling: 3.94x scaling with 4x GPUs for BERT Base demonstrates excellent performance-to-cost ratio
  • Cost-Optimized Performance: All GPUs deliver amazing scaling efficiency for production AI workloads

Research Organizations

  • Domain-Specific Excellence: Top performance in vision (ResNet50, SSD), language (BERT, TransformerXL), and audio (Tacotron2, WaveGlow) models
  • Precision Flexibility: Choose between FP16 (1.70-2.04x speedup) and FP32 based on accuracy requirements. Check our average speedup to see if your type of workload would scale effectively.
  • Hardware Optimization: These benchmarks can help guide you to choose the right GPU architecture based on your deep learning workload needs, optimizing ROI for your AI infrastructure investment

 

Conclusion

Our extensive benchmarking analysis reveals the exceptional performance of modern GPUs across a diverse range of deep learning workloads. The data demonstrates near-linear scaling for multi-GPU configurations, with 2-GPU setups achieving up to 1.99x performance and 4-GPU systems delivering 3.94x speedups.

FP16 precision offers excellent acceleration in most workloads, especially transformer-based models (BERT and TransformerXL). For multi-GPU scalability, transformer-based models like show excellent multi-GPU efficiency, while models like WaveGlow, NCF, Tacotron2, and SSD scale less efficiently.

These insights provide valuable guidance for organizations designing AI infrastructure to match their specific workload requirements, whether for research experimentation, model development, or production deployment.

Need guidance with configuration and integration for your computing infrastructure? Exxact is a leading solution integrator helping to deliver custom-configurable workstations, servers, and full rack clusters. Contact us for more information!

 

Fueling Innovation with an Exxact Designed Computing Cluster

Deploying full-scale AI models can be accelerated exponentially with the right computing infrastructure. Storage, head node, networking, compute - all components of your next Exxact cluster are configurable to your workload to drive and accelerate research and innovation.

Get a Quote Today
EXX-Blog-GPU-Performance-DL-BM-Bert-GNMT.jpg
Benchmarks

GPU Performance Deep Learning Benchmarks - BERT, GNMT, & More

August 29, 202513 min read
Update 8/28/25: Added NVIDIA RTX PRO 6000 Blackwell Server Edition

Introduction: GPU Deep Learning Benchmarks

When selecting hardware for AI and machine learning workloads, comprehensive performance data across real-world scenarios is essential for making informed decisions. We will benchmark a variety of NVIDIA GPUs using deep learning benchmarks. Here is our system configuration:

CPUAMD EPYC 9654 96-Core Processor
Memory755GB DDR5 ECC
NVIDIA Driver570.133.07
CUDA Version12.6.77
cuDNN Version12.8
OSUbuntu 22.04.5 LTS
PyTorch2.5.0 (NVIDIA optimized)

While these workloads are not one-to-one with your particular workload, they can serve as a good understanding of relative performance across multiple GPUs.

This blog will continue to grow as we test more and more GPUs. Current GPUs tested:

Testing Environment and Methodology

Our testing utilized a deep learning benchmark suite representing real-world AI workloads:

  • BERT Base/Large: Natural language processing and transformer models
  • GNMT (Google Neural Machine Translation): Sequence-to-sequence translation
  • ResNet50: Image classification model for computer vision applications
  • TransformerXL Base/Large: Advanced natural language processing for sequence modeling and context understanding in text
  • Tacotron2: Text-to-speech synthesis model
  • WaveGlow: High-quality audio generation
  • SSD (Single Shot MultiBox Detector): Object detection for computer vision
  • Neural Content Filtering (NCF): Recommendation system

These performance numbers are not to be taken at face value. Instead, they should be used as a reference point for performance comparison in similar workloads. We selected a diverse set of workloads, including vision, transformer, and classification models.

Benchmark Notes

  • GPU Scalability is relatively strong with some performance overhead. We are seeing a near 1:1 ratio of Speedup to GPU count. However, in workloads like Tacotron2, TransformerXL Large, and NCF, the GPU scalability is impacted. This is likely due to their smaller batch size type workloads that need to pull from memory often.
  • NVIDIA Blackwell performs well beyond the NVIDIA Ada, where in some cases, double the performance (ResNet50, WaveGlow, SSD, and NCF). With increased memory, models and data can reside in memory, which reduces memory bandwidth bottlenecks
  • RTX 6000 Ada and L40S show very similar performance since they are the same GPU architecture and die. We saw a few anomalies between these two GPUs, with both of them trading performance positions in some workloads.
  • NVIDIA RTX PRO 6000 Blackwell Server Edition has slightly better performance than the NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition due to the higher power draw.

Fueling Innovation with an Exxact Multi-GPU Server

Training AI models on massive datasets can be accelerated exponentially with the right system. It's not just a high-performance computer, but a tool to propel and accelerate your research.

Configure Now

Deep Learning GPU Benchmarks on FP32

ConfigurationBERT Base (seq/sec)BERT Large (seq/sec)GNMT (seq/sec)ResNet50 (images/sec)TransformerXL Base (tokens/sec)TransformerXL Large (tokens/sec)Tacotron2 (samples/sec)SSD (samples/sec)WaveGlow (samples/sec)Neural Content Filtering (samples/sec)
1x RTX PRO 6000
Blackwell Server
511174312180191862700280597488386129059230620969
2x RTX PRO 6000 Blackwell Server1004344618905381912753855760145355172057874069955831
4x RTX PRO 6000 Blackwell Server20476891230190763625503810854522798734371156381129237765
8x RTX PRO 6000 Blackwell Server6772136920655281525250937021677943006968162308974217999600
           
1x RTX PRO 6000
Blackwell Max-Q
406133239435174150968213296891273326959730136004
2x RTX PRO 6000 Blackwell Max-Q811266492598344610220541592141746144952232655119072
4x RTX PRO 6000 Blackwell Max-Q160653398080968592020138134524724029111037704106410444
8x RTX PRO 6000 Blackwell Max-Q4079104118358911349833980013947545485456952036042171572342
           
1x L40S 48GB26490162228102838357155533481635015057225112903
2x L40S 48GB522176307875197876492310446743862028083655054943
4x L40S 48GB1037353617952396112967462242106819123949100595436610
           
1x RTX 6000 Ada26789157119118540155163344105143718553033029658
2x RTX 6000 Ada532174308098234580511322577749587635143372748463
4x RTX 6000 Ada10523456179444701159669646871329181745683535133228628
8x RTX 6000 Ada21116901222839934434599112893914573130701315155222566063

 

Deep Learning GPU Benchmarks on FP16

ConfigurationBERT Base
(seq/sec)
BERT Large
(seq/sec)
GNMT
(seq/sec)
ResNet50
(images/sec)
TransformerXL Base
(tokens/sec)
TransformerXL Large
(tokens/sec)
Tacotron2
(samples/sec)
SSD
(samples/sec)
WaveGlow
(samples/sec)
Neural Content Filtering (samples/sec)
1x RTX PRO 6000
Blackwell Server
26888187357114140250176307481587628656030537703
2x RTX PRO 6000 Blackwell Server53317536707622728167535540126570174856979469955830
4x RTX PRO 6000 Blackwell Server106235072850445391633736808324324234941136257111107835
8x RTX PRO 6000 Blackwell Server21296831374085906632631813607447060769162274969188810349
           
1x RTX PRO 6000
Blackwell Max-Q
22874140714105135370143496951974928027230324158
2x RTX PRO 6000 Blackwell Max-Q45814629177220456780828374141681148052810258579611
4x RTX PRO 6000 Blackwell Max-Q86629357086441011276045462627792329801042007105955663
8x RTX PRO 6000 Blackwell Max-Q1722576111652180782024778913948505558312053141183317525
           
1x L40S 48GB13044920085542028482622885618314821718486005
2x L40S 48GB25786168734109540440164975931134628119654722411
4x L40S 48GB5081693416912189809393303672772692487629102390292
           
1x RTX 6000 Ada14949943176682268493623967036917997920544461
2x RTX 6000 Ada29294120641133844828182827607172935028337312167
4x RTX 6000 Ada58118735471126798990636572130244145468216880183880
8x RTX 6000 Ada115636571306853561779437326318641128951312460146599996

 

BERT Base GPU Benchmark

BERT Large GPU Benchmark

ResNet50 GPU Benchmark

GNMT GPU Benchmark

TransformerXL Base GPU Benchmark

TransformerXL Large GPU Benchmark

Tacotron2 GPU Benchmark

Single Shot Multibox Detector GPU Benchmark

WaveGlow GPU Benchmark

Neural Content Filtering GPU Benchmark

Multi-GPU Scaling Analysis - Scaling Efficiency (vs Single GPU)

Next, we will look at the scaling efficiency compared to single-GPU performance. We took an average of the 3 GPUs to calculate the scalability, which will tell us which workloads benefit from multiple GPUs.

ConfigurationBERT BaseBERT LargeGNMTResNet50TransformerXL BaseTransformerXL LargeTacotron2SSDWaveGlowNCF
2x GPUs (FP16)1.981.981.981.972.011.981.971.961.932.13
4x GPUs (FP16)3.973.953.963.943.883.903.253.923.763.91
8x GPUs (FP16)11.948.507.848.658.297.966.268.738.426.86
           
2x GPUs (FP32)1.991.961.841.981.981.991.901.981.932.21
4x GPUs (FP32)3.893.923.883.963.893.883.403.963.744.00
8x GPUs (FP32)8.618.498.308.797.958.027.159.588.406.92

 

FP16 Speedup vs FP32 Precision Performance Comparison

Here we check to see which workloads can benefit from the FP16 speedup. We calculated the average FP16 score divided by FP32 score. In this case, WaveGlow, Tacotron2, SSD, and NCF do not see meaningful benefits.

ConfigurationBERT BaseBERT LargeGNMTResNet50TransformerXL BaseTransformerXL LargeTacotron2SSDWaveGlowNCF
1x GPU1.871.911.691.721.621.641.031.091.001.19
2x GPUs1.861.921.821.721.651.631.071.081.001.15
4x GPUs1.901.921.731.711.621.650.991.081.011.16
8x GPUs2.591.911.601.691.691.630.901.001.001.18
Average Speedup2.061.911.711.711.641.631.001.061.001.17

 

Performance Insights from Our Benchmark Results

Development Teams

  • Mixed Precision Acceleration: Up to 2.04x speedup with FP16 for BERT Base, allowing twice as many experiments in the same timeframe
  • Exceptional Multi-GPU Scaling: Additional GPUs will have some performance overhead, but each benchmark has near-perfect scaling for multi-GPU configurations.
  • Memory Efficiency: 96GB VRAM on the NVIDIA RTX PRO 6000 Blackwell supports larger batch sizes in BERT and TransformerXL Large models

Production Deployments

  • Workload Versatility: Strong performance across all 10 benchmarked models from NLP to computer vision
  • Linear Scaling: 3.94x scaling with 4x GPUs for BERT Base demonstrates excellent performance-to-cost ratio
  • Cost-Optimized Performance: All GPUs deliver amazing scaling efficiency for production AI workloads

Research Organizations

  • Domain-Specific Excellence: Top performance in vision (ResNet50, SSD), language (BERT, TransformerXL), and audio (Tacotron2, WaveGlow) models
  • Precision Flexibility: Choose between FP16 (1.70-2.04x speedup) and FP32 based on accuracy requirements. Check our average speedup to see if your type of workload would scale effectively.
  • Hardware Optimization: These benchmarks can help guide you to choose the right GPU architecture based on your deep learning workload needs, optimizing ROI for your AI infrastructure investment

 

Conclusion

Our extensive benchmarking analysis reveals the exceptional performance of modern GPUs across a diverse range of deep learning workloads. The data demonstrates near-linear scaling for multi-GPU configurations, with 2-GPU setups achieving up to 1.99x performance and 4-GPU systems delivering 3.94x speedups.

FP16 precision offers excellent acceleration in most workloads, especially transformer-based models (BERT and TransformerXL). For multi-GPU scalability, transformer-based models like show excellent multi-GPU efficiency, while models like WaveGlow, NCF, Tacotron2, and SSD scale less efficiently.

These insights provide valuable guidance for organizations designing AI infrastructure to match their specific workload requirements, whether for research experimentation, model development, or production deployment.

Need guidance with configuration and integration for your computing infrastructure? Exxact is a leading solution integrator helping to deliver custom-configurable workstations, servers, and full rack clusters. Contact us for more information!

 

Fueling Innovation with an Exxact Designed Computing Cluster

Deploying full-scale AI models can be accelerated exponentially with the right computing infrastructure. Storage, head node, networking, compute - all components of your next Exxact cluster are configurable to your workload to drive and accelerate research and innovation.

Get a Quote Today