Benchmarks

GPU Performance Deep Learning Benchmarks - BERT, GNMT, & More

August 8, 2025
12 min read
EXX-Blog-GPU-Performance-DL-BM-Bert-GNMT.jpg

When selecting hardware for AI and machine learning workloads, comprehensive performance data across real-world scenarios is essential for making informed decisions. We will benchmark a variety of NVIDIA GPUs using deep learning benchmarks. Here is our system configuration:

CPUAMD EPYC 9654 96-Core Processor
Memory755GB DDR4
NVIDIA Driver570.133.07
CUDA Version12.6.77
cuDNN Version12.8
OSUbuntu 22.04.5 LTS
PyTorch2.5.0 (NVIDIA optimized)

While these workloads are not one-to-one with your particular workload, they can serve as a good understanding of relative performance across multiple GPUs.

This blog will continue to grow as we test more and more GPUs. Current GPUs tested:

Testing Environment and Methodology

Our testing utilized a deep learning benchmark suite representing real-world AI workloads:

  • BERT Base/Large: Natural language processing and transformer models
  • GNMT (Google Neural Machine Translation): Sequence-to-sequence translation
  • ResNet50: Image classification model for computer vision applications
  • TransformerXL Base/Large: Advanced natural language processing for sequence modeling and context understanding in text
  • Tacotron2: Text-to-speech synthesis model
  • WaveGlow: High-quality audio generation
  • SSD (Single Shot MultiBox Detector): Object detection for computer vision
  • Neural Content Filtering (NCF): Recommendation system

These performance numbers are not to be taken at face value. Instead, they should be used as a reference point for performance comparison in similar workloads. We selected a diverse set of workloads, including vision, transformer, and classification models.

Benchmark Notes

  • GPU Scalability is relatively strong with some performance overhead. We are seeing a near 1:1 ratio of Speedup to GPU count. However, in workloads like Tacotron2, TransformerXL Large, and NCF, the GPU scalability is impacted. This is likely due to their smaller batch size type workloads that need to pull from memory often.
  • NVIDIA Blackwell performs well beyond the NVIDIA Ada, where in some cases, double the performance (ResNet50, WaveGlow, SSD, and NCF). With increased memory, models and data can reside in memory, which reduces memory bandwidth bottlenecks
  • RTX 6000 Ada and L40S show very similar performance since they are the same GPU architecture and die. We saw a few anomalies between these two GPUs, with both of them trading performance positions in some workloads.
  • Tacotron2 had a DNR for the multi-GPU NVIDIA RTX PRO 6000 Blackwell runs due to an error on our end. We will fix those in due time. This is not a performance issue.

Fueling Innovation with an Exxact Multi-GPU Server

Training AI models on massive datasets can be accelerated exponentially with the right system. It's not just a high-performance computer, but a tool to propel and accelerate your research.

Configure Now

Deep Learning GPU Benchmarks on FP32

ConfigurationBERT Base (seq/sec)BERT Large (seq/sec)GNMT (seq/sec)ResNet50 (images/sec)TransformerXL Base (tokens/sec)TransformerXL Large (tokens/sec)Tacotron2 (samples/sec)WaveGlow (samples/sec)SSD (samples/sec)NCF (samples/sec)
1x RTX PRO 6000 Blackwell Max-Q22874140714105135370143496951928027274930324158
2x RTX PRO 6000 Blackwell Max-Q458146291772204567808283740528102148058579611
4x RTX PRO 6000 Blackwell Max-Q910293570864410112760454626010420072980105955663
8x RTX PRO 6000 Blackwell Max-Q18015761116521807820247789139020531415831183317525
1x L40S 48GB13044920085542028482622885614821718317262116
2x L40S 48GB25786168734109540440164975931128119634650191320
4x L40S 48GB5081693416912189809393303672772487629692102390292
1x RTX 6000 Ada14949943176682268493623967017997936920544461
2x RTX 6000 Ada29294120641133844828182827607135028372937312167
4x RTX 6000 Ada58118735471126798990636572130244682168145480183880
8x RTX 6000 Ada115636571306853561779437326318641113124602895146599996

Deep Learning GPU Benchmarks on FP16

ConfigurationBERT Base
(seq/sec)
BERT Large
(seq/sec)
GNMT
(seq/sec)
ResNet50
(images/sec)
TransformerXL Base
(tokens/sec)
TransformerXL Large
(tokens/sec)
Tacotron2
(samples/sec)
SSD
(samples/sec)
WaveGlow
(samples/sec)
NCF (samples/sec)
1x RTX PRO 6000
Blackwell Max-Q
406133239435174150968213296891273326959730136004
2x RTX PRO 6000 Blackwell Max-Q8112664925983446102205415920144952232655119072
4x RTX PRO 6000 Blackwell Max-Q1606533980809685920201381345029111037704106410444
8x RTX PRO 6000 Blackwell Max-Q58761041183589113498339800139475056952036042171572342
1x L40S 48GB26490162228102838357155533481635015057225112903
2x L40S 48GB522176307875197876492310446743862028083655054943
4x L40S 48GB1037353617952396112967462242106819123949100595436610
1x RTX 6000 Ada26789157119118540155163344105143718553033029658
2x RTX 6000 Ada532174308098234580511322577749587635143372748463
4x RTX 6000 Ada10523456179444701159669646871329181745683535133228628
8x RTX 6000 Ada21116901222839934434599112893914573130701315155222566063

BERT Base GPU Benchmark

BERT Base GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

BERT Large GPU Benchmark

BERT Base GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

ResNet50 GPU Benchmark

ResNet50 GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

GNMT GPU Benchmark

GNMT GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

TransformerXL Base GPU Benchmark

TransformerXL Base GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

TransformerXL Large GPU Benchmark

TransformerXL Base GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

Tacotron2 GPU Benchmark

Tacotron2 GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

Single Shot Multibox Detector GPU Benchmark

Single Shot Multibox Detector GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

WaveGlow GPU Benchmark

WaveGlow GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

Neural Content Filtering GPU Benchmark

Neural Content Filtering GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

Multi-GPU Scaling Analysis - Scaling Efficiency (vs Single GPU)

Next, we will look at the scaling efficiency compared to single-GPU performance. We took an average of the 3 GPUs to calculate the scalability, which will tell us which workloads benefit from multiple GPUs.

ConfigurationBERT BaseBERT LargeGNMTResNet50TransformerXL BaseTransformerXL LargeTacotron2SSDWaveGlowNCF
2x GPUs (FP16)1.991.971.971.962.001.971.271.921.902.07
4x GPUs (FP16)3.943.943.953.923.773.933.153.833.603.79
8x GPUs (FP16)11.197.797.737.827.647.223.557.407.326.22
2x GPUs (FP32)1.981.951.731.981.961.981.321.951.912.22
4x GPUs (FP32)3.933.873.843.953.853.902.903.903.604.44
8x GPUs (FP32)7.837.627.757.856.787.024.707.827.316.59

FP16 Speedup vs FP32 Precision Performance Comparison

Here we check to see which workloads can benefit from the FP16 speedup. We calculated the average FP16 score divided by FP32 score. In this case, WaveGlow, Tacotron2, SSD, and NCF do not see meaningful benefits.

ConfigurationSSDBERT BaseBERT LargeGNMTResNet50Tacotron2WaveGlowTransformerXL BaseTransformerXL LargeNCF
1x GPU1.361.871.891.711.761.081.001.701.701.35
2x GPUs1.321.871.912.021.751.081.001.731.701.33
4x GPUs1.321.871.921.761.751.241.001.651.711.20
8x GPUs1.022.541.851.681.710.781.001.811.661.23
Average Speedup1.262.041.891.791.741.051.001.721.701.28

Performance Insights from Our Benchmark Results

Development Teams

  • Mixed Precision Acceleration: Up to 2.04x speedup with FP16 for BERT Base, allowing twice as many experiments in the same timeframe
  • Exceptional Multi-GPU Scaling: Additional GPUs will have some performance overhead, but each benchmark has near-perfect scaling for multi-GPU configurations.
  • Memory Efficiency: 96GB VRAM on the NVIDIA RTX PRO 6000 Blackwell supports larger batch sizes in BERT and TransformerXL Large models

Production Deployments

  • Workload Versatility: Strong performance across all 10 benchmarked models from NLP to computer vision
  • Linear Scaling: 3.94x scaling with 4x GPUs for BERT Base demonstrates excellent performance-to-cost ratio
  • Cost-Optimized Performance: All GPUs deliver amazing scaling efficiency for production AI workloads

Research Organizations

  • Domain-Specific Excellence: Top performance in vision (ResNet50, SSD), language (BERT, TransformerXL), and audio (Tacotron2, WaveGlow) models
  • Precision Flexibility: Choose between FP16 (1.70-2.04x speedup) and FP32 based on accuracy requirements. Check our average speedup to see if your type of workload would scale effectively.
  • Hardware Optimization: These benchmarks can help guide you to choose the right GPU architecture based on your deep learning workload needs, optimizing ROI for your AI infrastructure investment

 

Conclusion

Our extensive benchmarking analysis reveals the exceptional performance of modern GPUs across a diverse range of deep learning workloads. The data demonstrates near-linear scaling for multi-GPU configurations, with 2-GPU setups achieving up to 1.99x performance and 4-GPU systems delivering 3.94x speedups.

FP16 precision offers excellent acceleration in most workloads, especially transformer-based models (BERT and TransformerXL). For multi-GPU scalability, transformer-based models like show excellent multi-GPU efficiency, while models like WaveGlow, NCF, Tacotron2, and SSD scale less efficiently.

These insights provide valuable guidance for organizations designing AI infrastructure to match their specific workload requirements, whether for research experimentation, model development, or production deployment.

Need guidance with configuration and integration for your computing infrastructure? Exxact is a leading solution integrator helping to deliver custom-configurable workstations, servers, and full rack clusters. Contact us for more information!

 

Fueling Innovation with an Exxact Designed Computing Cluster

Deploying full-scale AI models can be accelerated exponentially with the right computing infrastructure. Storage, head node, networking, compute - all components of your next Exxact cluster are configurable to your workload to drive and accelerate research and innovation.

Get a Quote Today
EXX-Blog-GPU-Performance-DL-BM-Bert-GNMT.jpg
Benchmarks

GPU Performance Deep Learning Benchmarks - BERT, GNMT, & More

August 8, 202512 min read

When selecting hardware for AI and machine learning workloads, comprehensive performance data across real-world scenarios is essential for making informed decisions. We will benchmark a variety of NVIDIA GPUs using deep learning benchmarks. Here is our system configuration:

CPUAMD EPYC 9654 96-Core Processor
Memory755GB DDR4
NVIDIA Driver570.133.07
CUDA Version12.6.77
cuDNN Version12.8
OSUbuntu 22.04.5 LTS
PyTorch2.5.0 (NVIDIA optimized)

While these workloads are not one-to-one with your particular workload, they can serve as a good understanding of relative performance across multiple GPUs.

This blog will continue to grow as we test more and more GPUs. Current GPUs tested:

Testing Environment and Methodology

Our testing utilized a deep learning benchmark suite representing real-world AI workloads:

  • BERT Base/Large: Natural language processing and transformer models
  • GNMT (Google Neural Machine Translation): Sequence-to-sequence translation
  • ResNet50: Image classification model for computer vision applications
  • TransformerXL Base/Large: Advanced natural language processing for sequence modeling and context understanding in text
  • Tacotron2: Text-to-speech synthesis model
  • WaveGlow: High-quality audio generation
  • SSD (Single Shot MultiBox Detector): Object detection for computer vision
  • Neural Content Filtering (NCF): Recommendation system

These performance numbers are not to be taken at face value. Instead, they should be used as a reference point for performance comparison in similar workloads. We selected a diverse set of workloads, including vision, transformer, and classification models.

Benchmark Notes

  • GPU Scalability is relatively strong with some performance overhead. We are seeing a near 1:1 ratio of Speedup to GPU count. However, in workloads like Tacotron2, TransformerXL Large, and NCF, the GPU scalability is impacted. This is likely due to their smaller batch size type workloads that need to pull from memory often.
  • NVIDIA Blackwell performs well beyond the NVIDIA Ada, where in some cases, double the performance (ResNet50, WaveGlow, SSD, and NCF). With increased memory, models and data can reside in memory, which reduces memory bandwidth bottlenecks
  • RTX 6000 Ada and L40S show very similar performance since they are the same GPU architecture and die. We saw a few anomalies between these two GPUs, with both of them trading performance positions in some workloads.
  • Tacotron2 had a DNR for the multi-GPU NVIDIA RTX PRO 6000 Blackwell runs due to an error on our end. We will fix those in due time. This is not a performance issue.

Fueling Innovation with an Exxact Multi-GPU Server

Training AI models on massive datasets can be accelerated exponentially with the right system. It's not just a high-performance computer, but a tool to propel and accelerate your research.

Configure Now

Deep Learning GPU Benchmarks on FP32

ConfigurationBERT Base (seq/sec)BERT Large (seq/sec)GNMT (seq/sec)ResNet50 (images/sec)TransformerXL Base (tokens/sec)TransformerXL Large (tokens/sec)Tacotron2 (samples/sec)WaveGlow (samples/sec)SSD (samples/sec)NCF (samples/sec)
1x RTX PRO 6000 Blackwell Max-Q22874140714105135370143496951928027274930324158
2x RTX PRO 6000 Blackwell Max-Q458146291772204567808283740528102148058579611
4x RTX PRO 6000 Blackwell Max-Q910293570864410112760454626010420072980105955663
8x RTX PRO 6000 Blackwell Max-Q18015761116521807820247789139020531415831183317525
1x L40S 48GB13044920085542028482622885614821718317262116
2x L40S 48GB25786168734109540440164975931128119634650191320
4x L40S 48GB5081693416912189809393303672772487629692102390292
1x RTX 6000 Ada14949943176682268493623967017997936920544461
2x RTX 6000 Ada29294120641133844828182827607135028372937312167
4x RTX 6000 Ada58118735471126798990636572130244682168145480183880
8x RTX 6000 Ada115636571306853561779437326318641113124602895146599996

Deep Learning GPU Benchmarks on FP16

ConfigurationBERT Base
(seq/sec)
BERT Large
(seq/sec)
GNMT
(seq/sec)
ResNet50
(images/sec)
TransformerXL Base
(tokens/sec)
TransformerXL Large
(tokens/sec)
Tacotron2
(samples/sec)
SSD
(samples/sec)
WaveGlow
(samples/sec)
NCF (samples/sec)
1x RTX PRO 6000
Blackwell Max-Q
406133239435174150968213296891273326959730136004
2x RTX PRO 6000 Blackwell Max-Q8112664925983446102205415920144952232655119072
4x RTX PRO 6000 Blackwell Max-Q1606533980809685920201381345029111037704106410444
8x RTX PRO 6000 Blackwell Max-Q58761041183589113498339800139475056952036042171572342
1x L40S 48GB26490162228102838357155533481635015057225112903
2x L40S 48GB522176307875197876492310446743862028083655054943
4x L40S 48GB1037353617952396112967462242106819123949100595436610
1x RTX 6000 Ada26789157119118540155163344105143718553033029658
2x RTX 6000 Ada532174308098234580511322577749587635143372748463
4x RTX 6000 Ada10523456179444701159669646871329181745683535133228628
8x RTX 6000 Ada21116901222839934434599112893914573130701315155222566063

BERT Base GPU Benchmark

BERT Large GPU Benchmark

ResNet50 GPU Benchmark

GNMT GPU Benchmark

TransformerXL Base GPU Benchmark

TransformerXL Large GPU Benchmark

Tacotron2 GPU Benchmark

Single Shot Multibox Detector GPU Benchmark

WaveGlow GPU Benchmark

Neural Content Filtering GPU Benchmark

Multi-GPU Scaling Analysis - Scaling Efficiency (vs Single GPU)

Next, we will look at the scaling efficiency compared to single-GPU performance. We took an average of the 3 GPUs to calculate the scalability, which will tell us which workloads benefit from multiple GPUs.

ConfigurationBERT BaseBERT LargeGNMTResNet50TransformerXL BaseTransformerXL LargeTacotron2SSDWaveGlowNCF
2x GPUs (FP16)1.991.971.971.962.001.971.271.921.902.07
4x GPUs (FP16)3.943.943.953.923.773.933.153.833.603.79
8x GPUs (FP16)11.197.797.737.827.647.223.557.407.326.22
2x GPUs (FP32)1.981.951.731.981.961.981.321.951.912.22
4x GPUs (FP32)3.933.873.843.953.853.902.903.903.604.44
8x GPUs (FP32)7.837.627.757.856.787.024.707.827.316.59

FP16 Speedup vs FP32 Precision Performance Comparison

Here we check to see which workloads can benefit from the FP16 speedup. We calculated the average FP16 score divided by FP32 score. In this case, WaveGlow, Tacotron2, SSD, and NCF do not see meaningful benefits.

ConfigurationSSDBERT BaseBERT LargeGNMTResNet50Tacotron2WaveGlowTransformerXL BaseTransformerXL LargeNCF
1x GPU1.361.871.891.711.761.081.001.701.701.35
2x GPUs1.321.871.912.021.751.081.001.731.701.33
4x GPUs1.321.871.921.761.751.241.001.651.711.20
8x GPUs1.022.541.851.681.710.781.001.811.661.23
Average Speedup1.262.041.891.791.741.051.001.721.701.28

Performance Insights from Our Benchmark Results

Development Teams

  • Mixed Precision Acceleration: Up to 2.04x speedup with FP16 for BERT Base, allowing twice as many experiments in the same timeframe
  • Exceptional Multi-GPU Scaling: Additional GPUs will have some performance overhead, but each benchmark has near-perfect scaling for multi-GPU configurations.
  • Memory Efficiency: 96GB VRAM on the NVIDIA RTX PRO 6000 Blackwell supports larger batch sizes in BERT and TransformerXL Large models

Production Deployments

  • Workload Versatility: Strong performance across all 10 benchmarked models from NLP to computer vision
  • Linear Scaling: 3.94x scaling with 4x GPUs for BERT Base demonstrates excellent performance-to-cost ratio
  • Cost-Optimized Performance: All GPUs deliver amazing scaling efficiency for production AI workloads

Research Organizations

  • Domain-Specific Excellence: Top performance in vision (ResNet50, SSD), language (BERT, TransformerXL), and audio (Tacotron2, WaveGlow) models
  • Precision Flexibility: Choose between FP16 (1.70-2.04x speedup) and FP32 based on accuracy requirements. Check our average speedup to see if your type of workload would scale effectively.
  • Hardware Optimization: These benchmarks can help guide you to choose the right GPU architecture based on your deep learning workload needs, optimizing ROI for your AI infrastructure investment

 

Conclusion

Our extensive benchmarking analysis reveals the exceptional performance of modern GPUs across a diverse range of deep learning workloads. The data demonstrates near-linear scaling for multi-GPU configurations, with 2-GPU setups achieving up to 1.99x performance and 4-GPU systems delivering 3.94x speedups.

FP16 precision offers excellent acceleration in most workloads, especially transformer-based models (BERT and TransformerXL). For multi-GPU scalability, transformer-based models like show excellent multi-GPU efficiency, while models like WaveGlow, NCF, Tacotron2, and SSD scale less efficiently.

These insights provide valuable guidance for organizations designing AI infrastructure to match their specific workload requirements, whether for research experimentation, model development, or production deployment.

Need guidance with configuration and integration for your computing infrastructure? Exxact is a leading solution integrator helping to deliver custom-configurable workstations, servers, and full rack clusters. Contact us for more information!

 

Fueling Innovation with an Exxact Designed Computing Cluster

Deploying full-scale AI models can be accelerated exponentially with the right computing infrastructure. Storage, head node, networking, compute - all components of your next Exxact cluster are configurable to your workload to drive and accelerate research and innovation.

Get a Quote Today