Benchmarks

GPU Performance Deep Learning Benchmarks - BERT, GNMT, & More

April 3, 2026
9 min read
EXX-Blog-GPU-Performance-DL-BM-Bert-GNMT.jpg
Update 8/28/25: Added NVIDIA RTX PRO 6000 Blackwell Server Edition
Update 4/2/26: Reformatted Tables and Added Additional Context

GPU Performance Deep Learning Benchmarks: Comprehensive Performance Analysis

When selecting hardware for AI and machine learning workloads, comprehensive performance data across real-world scenarios is essential for making informed decisions. We will benchmark a variety of NVIDIA GPUs using deep learning benchmarks. Here is our system configuration:

CPUAMD EPYC 9654 96-Core Processor
Memory755GB DDR5 ECC
NVIDIA Driver570.133.07
CUDA Version12.6.77
cuDNN Version12.8
OSUbuntu 22.04.5 LTS
PyTorch2.5.0 (NVIDIA optimized)

 

While these workloads are not one-to-one with your particular workload, they can serve as a good understanding of relative performance across multiple GPUs.

This blog will continue to grow as we test more and more GPUs. Current GPUs tested:

GPUArchitectureVRAMTDPForm Factor
NVIDIA RTX PRO™ 6000 Blackwell Server Edition 96GB GDDR7Blackwell96GB GDDR7600WServer
NVIDIA RTX PRO™ 6000 Blackwell Max-Q Workstation Edition 96GB GDDR7Blackwell96GB GDDR7300WWorkstation
NVIDIA RTX™ 6000 Ada 48GB GDDR6Ada Lovelace48GB GDDR6300WWorkstation
NVIDIA L40S 48G GDDR6Ada Lovelace48GB GDDR6350WServer

 

Key Findings:

  • Blackwell Delivers Up to 2x the Performance of Ada. Across our benchmark suite, the NVIDIA RTX PRO 6000 Blackwell Server Edition delivers an average of 1.89x the throughput of the L40S on a single GPU — with some workloads like SSD object detection reaching 2.46x.
  • Near-Linear Multi-GPU Scaling. Adding GPUs pays off: 4-GPU configurations deliver an average 3.9x speedup, and 8-GPU systems achieve approximately 8x, making multi-GPU servers a strong investment for production and research workloads.
  • Mixed Precision (FP16) Doubles Throughput for Transformer Models. BERT Base sees up to 2.06x acceleration with FP16, while vision and audio workloads like WaveGlow and Tacotron2 see minimal FP16 benefit — an important consideration when planning your pipeline.
  • 96GB VRAM Changes the Game. The Blackwell GPUs' 96GB GDDR7 memory allows larger batch sizes and entire models to reside in memory, eliminating memory bandwidth bottlenecks that constrain 48GB cards.

Testing Environment and Methodology

Our testing utilized a deep learning benchmark suite representing real-world AI workloads:

WorkloadCategoryUse Case
BERT Base / LargeNLP — TransformersThe backbone of modern language understanding. Used in search, chatbots, and document analysis.
GNMTNLP — Seq2SeqGoogle's neural machine translation model. Represents sequence-to-sequence workloads common in production translation services.
ResNet50Computer VisionThe standard benchmark for image classification. Widely used in medical imaging, manufacturing inspection, and autonomous vehicles.
TransformerXL Base / LargeNLP — TransformersAdvanced sequence modeling with long-range dependencies. A proxy for modern LLM training workloads.
Tacotron2Audio — TTSText-to-speech synthesis. Critical for voice assistants, accessibility tools, and media production.
WaveGlowAudio — GenerationHigh-quality audio waveform generation. Represents compute-intensive audio synthesis pipelines.
SSDComputer VisionSingle Shot MultiBox Detector for real-time object detection. Used in surveillance, robotics, and autonomous systems.
NCFRecommendationNeural Collaborative Filtering. Powers recommendation engines at companies like Netflix, Amazon, and Spotify.

These benchmarks are not one-to-one representations of your specific workload. They serve as reference points for understanding relative performance across GPU configurations in similar classes of work.

Benchmark Notes

  • GPU Scalability is relatively strong with some performance overhead. We are seeing a near 1:1 ratio of Speedup to GPU count. However, in workloads like Tacotron2, TransformerXL Large, and NCF, the GPU scalability is impacted. This is likely due to their smaller batch size type workloads that need to pull from memory often.
  • NVIDIA Blackwell performs well beyond the NVIDIA Ada, where in some cases, double the performance (ResNet50, WaveGlow, SSD, and NCF). With increased memory, models and data can reside in memory, which reduces memory bandwidth bottlenecks
  • RTX 6000 Ada and L40S show very similar performance since they are the same GPU architecture and die. We saw a few anomalies between these two GPUs, with both of them trading performance positions in some workloads.
  • NVIDIA RTX PRO 6000 Blackwell Server Edition has slightly better performance than the NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition due to the higher power draw.

Generational Leap to Blackwell

WorkloadL40S 48GB (Baseline)RTX 6000 Ada 48GBRTX PRO 6000 Blackwell Max-Q 96GBRTX PRO 6000 Blackwell Server 96GB
BERT Base1.00x1.01x1.54x1.94x
BERT Large1.00x0.99x1.48x1.93x
GNMT1.00x0.97x1.48x1.92x
ResNet501.00x1.15x1.69x1.87x
TransformerXL Base1.00x1.05x1.33x1.63x
TransformerXL Large1.00x1.05x1.37x1.80x
Tacotron21.00x1.18x1.98x2.15x
SSD1.00x1.25x2.09x2.46x
WaveGlow1.00x1.23x1.79x1.93x
NCF1.00x1.32x1.20x1.22x
Average1.00x1.12x1.59x1.89x

On average, a single RTX PRO 6000 Blackwell Server Edition delivers nearly twice the throughput of an L40S. For workloads like SSD object detection, a single Blackwell card does what previously required two Ada-generation GPUs. The RTX 6000 Ada and L40S perform within ~12% of each other, sharing the same Ada Lovelace architecture; your choice should come down to price, availability, and form factor

A note on NCF: Neural Content Filtering is an interesting outlier where the RTX 6000 Ada outperforms both Blackwell GPUs at 1.32x vs. ~1.21x. NCF is a memory-bound, small-batch workload. RTX 6000 Ada has higher clock speeds and each subsequent sample (in our test) wouldn’t saturate the full 96GB in the Blackwell GPUs. If recommendation systems are your primary workload, factor this into your decision.

Fueling Innovation with an Exxact Multi-GPU Server

Training AI models on massive datasets can be accelerated exponentially with the right system. It's not just a high-performance computer, but a tool to propel and accelerate your research.

Configure Now

Deep Learning Benchmarks on NVIDIA GPUs

BERT Base GPU Benchmark

BERT Base GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

BERT Large GPU Benchmark

BERT Large GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

ResNet50 GPU Benchmark

ResNet50 GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

GNMT GPU Benchmark

GNMT GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

TransformerXL Base GPU Benchmark

TransformerXL Base GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

TransformerXL Large GPU Benchmark

TransformerXL Large GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

Tacotron2 GPU Benchmark

Tacotron2 GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

Single Shot Multibox Detector GPU Benchmark

Single Shot Multibox Detector GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

WaveGlow GPU Benchmark

WaveGlow GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

Neural Content Filtering GPU Benchmark

Neural Content Filtering GPU Benchmark - NVIDIA RTX PRO 6000 Blackwell, RTX 6000 Ada, L40S

Multi-GPU Scaling Analysis - Scaling Efficiency (vs Single GPU)

Next, we will look at the scaling efficiency compared to single-GPU performance. Does adding GPUs deliver proportional performance gains? If 4 GPUs cost 4x as much, do you get 4x the throughput? We took an average of the 3 GPUs to calculate the scalability, which will tell us which workloads benefit from multiple GPUs.

Multi-GPU Scaling Analysis

Performance Insights from Our Benchmark Results

  • LLM Training & Fine-tuning
    • Recommended GPU: RTX PRO 6000 Blackwell Server
    • 96GB VRAM fits larger models in memory; 1.9x throughput vs. Ada
  • NLP / Transformer Research
    • Recommended GPU: RTX PRO 6000 Blackwell (Server or Max-Q)
    • BERT and TransformerXL see the largest generational gains; FP16 provides up to 2x additional acceleration
  • Computer Vision (Training)
    • Recommended GPU: RTX PRO 6000 Blackwell Server
    • 1.87x on ResNet50 and 2.46x on SSD vs. L40S; excellent multi-GPU scaling for large datasets
  • Audio / Speech Synthesis
    • Recommended GPU: RTX PRO 6000 Blackwell Max-Q
    • Strong Tacotron2/WaveGlow performance; FP16 provides no benefit, so FP32 accuracy comes free; Max-Q saves power in workstation setups
  • Recommendation Systems
    • Recommended GPU: RTX 6000 Ada or L40S
    • NCF shows Ada outperforming Blackwell; memory-bound workloads don't fully leverage Blackwell's compute advantage
  • Budget-Conscious Research
    • Recommended GPU: L40S 48GB
    • Within ~12% of RTX 6000 Ada; strong multi-GPU scaling; good performance-per-dollar entry point
  • Production Inference at Scale
    • Recommended GPU: RTX PRO 6000 Blackwell Server
    • Maximum single-GPU throughput reduces the number of cards needed; 96GB VRAM supports larger batch sizes for higher utilization

Conclusion

Our extensive benchmarking analysis reveals the exceptional performance of modern GPUs across a diverse range of deep learning workloads. The data demonstrates near-linear scaling for multi-GPU configurations, with 2-GPU setups achieving up to 1.99x performance and 4-GPU systems delivering 3.94x speedups.

FP16 precision offers excellent acceleration in most workloads, especially transformer-based models (BERT and TransformerXL). For multi-GPU scalability, transformer-based models like show excellent multi-GPU efficiency, while models like WaveGlow, NCF, Tacotron2, and SSD scale less efficiently.

These insights provide valuable guidance for organizations designing AI infrastructure to match their specific workload requirements, whether for research experimentation, model development, or production deployment.

Need guidance with configuration and integration for your computing infrastructure? Exxact is a leading solution integrator helping to deliver custom-configurable workstations, servers, and full rack clusters. Contact us for more information!

Fueling Innovation with an Exxact Designed Computing Cluster

Deploying full-scale AI models can be accelerated exponentially with the right computing infrastructure. Storage, head node, networking, compute - all components of your next Exxact cluster are configurable to your workload to drive and accelerate research and innovation.

Get a Quote Today
EXX-Blog-GPU-Performance-DL-BM-Bert-GNMT.jpg
Benchmarks

GPU Performance Deep Learning Benchmarks - BERT, GNMT, & More

April 3, 20269 min read
Update 8/28/25: Added NVIDIA RTX PRO 6000 Blackwell Server Edition
Update 4/2/26: Reformatted Tables and Added Additional Context

GPU Performance Deep Learning Benchmarks: Comprehensive Performance Analysis

When selecting hardware for AI and machine learning workloads, comprehensive performance data across real-world scenarios is essential for making informed decisions. We will benchmark a variety of NVIDIA GPUs using deep learning benchmarks. Here is our system configuration:

CPUAMD EPYC 9654 96-Core Processor
Memory755GB DDR5 ECC
NVIDIA Driver570.133.07
CUDA Version12.6.77
cuDNN Version12.8
OSUbuntu 22.04.5 LTS
PyTorch2.5.0 (NVIDIA optimized)

 

While these workloads are not one-to-one with your particular workload, they can serve as a good understanding of relative performance across multiple GPUs.

This blog will continue to grow as we test more and more GPUs. Current GPUs tested:

GPUArchitectureVRAMTDPForm Factor
NVIDIA RTX PRO™ 6000 Blackwell Server Edition 96GB GDDR7Blackwell96GB GDDR7600WServer
NVIDIA RTX PRO™ 6000 Blackwell Max-Q Workstation Edition 96GB GDDR7Blackwell96GB GDDR7300WWorkstation
NVIDIA RTX™ 6000 Ada 48GB GDDR6Ada Lovelace48GB GDDR6300WWorkstation
NVIDIA L40S 48G GDDR6Ada Lovelace48GB GDDR6350WServer

 

Key Findings:

  • Blackwell Delivers Up to 2x the Performance of Ada. Across our benchmark suite, the NVIDIA RTX PRO 6000 Blackwell Server Edition delivers an average of 1.89x the throughput of the L40S on a single GPU — with some workloads like SSD object detection reaching 2.46x.
  • Near-Linear Multi-GPU Scaling. Adding GPUs pays off: 4-GPU configurations deliver an average 3.9x speedup, and 8-GPU systems achieve approximately 8x, making multi-GPU servers a strong investment for production and research workloads.
  • Mixed Precision (FP16) Doubles Throughput for Transformer Models. BERT Base sees up to 2.06x acceleration with FP16, while vision and audio workloads like WaveGlow and Tacotron2 see minimal FP16 benefit — an important consideration when planning your pipeline.
  • 96GB VRAM Changes the Game. The Blackwell GPUs' 96GB GDDR7 memory allows larger batch sizes and entire models to reside in memory, eliminating memory bandwidth bottlenecks that constrain 48GB cards.

Testing Environment and Methodology

Our testing utilized a deep learning benchmark suite representing real-world AI workloads:

WorkloadCategoryUse Case
BERT Base / LargeNLP — TransformersThe backbone of modern language understanding. Used in search, chatbots, and document analysis.
GNMTNLP — Seq2SeqGoogle's neural machine translation model. Represents sequence-to-sequence workloads common in production translation services.
ResNet50Computer VisionThe standard benchmark for image classification. Widely used in medical imaging, manufacturing inspection, and autonomous vehicles.
TransformerXL Base / LargeNLP — TransformersAdvanced sequence modeling with long-range dependencies. A proxy for modern LLM training workloads.
Tacotron2Audio — TTSText-to-speech synthesis. Critical for voice assistants, accessibility tools, and media production.
WaveGlowAudio — GenerationHigh-quality audio waveform generation. Represents compute-intensive audio synthesis pipelines.
SSDComputer VisionSingle Shot MultiBox Detector for real-time object detection. Used in surveillance, robotics, and autonomous systems.
NCFRecommendationNeural Collaborative Filtering. Powers recommendation engines at companies like Netflix, Amazon, and Spotify.

These benchmarks are not one-to-one representations of your specific workload. They serve as reference points for understanding relative performance across GPU configurations in similar classes of work.

Benchmark Notes

  • GPU Scalability is relatively strong with some performance overhead. We are seeing a near 1:1 ratio of Speedup to GPU count. However, in workloads like Tacotron2, TransformerXL Large, and NCF, the GPU scalability is impacted. This is likely due to their smaller batch size type workloads that need to pull from memory often.
  • NVIDIA Blackwell performs well beyond the NVIDIA Ada, where in some cases, double the performance (ResNet50, WaveGlow, SSD, and NCF). With increased memory, models and data can reside in memory, which reduces memory bandwidth bottlenecks
  • RTX 6000 Ada and L40S show very similar performance since they are the same GPU architecture and die. We saw a few anomalies between these two GPUs, with both of them trading performance positions in some workloads.
  • NVIDIA RTX PRO 6000 Blackwell Server Edition has slightly better performance than the NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition due to the higher power draw.

Generational Leap to Blackwell

WorkloadL40S 48GB (Baseline)RTX 6000 Ada 48GBRTX PRO 6000 Blackwell Max-Q 96GBRTX PRO 6000 Blackwell Server 96GB
BERT Base1.00x1.01x1.54x1.94x
BERT Large1.00x0.99x1.48x1.93x
GNMT1.00x0.97x1.48x1.92x
ResNet501.00x1.15x1.69x1.87x
TransformerXL Base1.00x1.05x1.33x1.63x
TransformerXL Large1.00x1.05x1.37x1.80x
Tacotron21.00x1.18x1.98x2.15x
SSD1.00x1.25x2.09x2.46x
WaveGlow1.00x1.23x1.79x1.93x
NCF1.00x1.32x1.20x1.22x
Average1.00x1.12x1.59x1.89x

On average, a single RTX PRO 6000 Blackwell Server Edition delivers nearly twice the throughput of an L40S. For workloads like SSD object detection, a single Blackwell card does what previously required two Ada-generation GPUs. The RTX 6000 Ada and L40S perform within ~12% of each other, sharing the same Ada Lovelace architecture; your choice should come down to price, availability, and form factor

A note on NCF: Neural Content Filtering is an interesting outlier where the RTX 6000 Ada outperforms both Blackwell GPUs at 1.32x vs. ~1.21x. NCF is a memory-bound, small-batch workload. RTX 6000 Ada has higher clock speeds and each subsequent sample (in our test) wouldn’t saturate the full 96GB in the Blackwell GPUs. If recommendation systems are your primary workload, factor this into your decision.

Fueling Innovation with an Exxact Multi-GPU Server

Training AI models on massive datasets can be accelerated exponentially with the right system. It's not just a high-performance computer, but a tool to propel and accelerate your research.

Configure Now

Deep Learning Benchmarks on NVIDIA GPUs

BERT Base GPU Benchmark

BERT Large GPU Benchmark

ResNet50 GPU Benchmark

GNMT GPU Benchmark

TransformerXL Base GPU Benchmark

TransformerXL Large GPU Benchmark

Tacotron2 GPU Benchmark

Single Shot Multibox Detector GPU Benchmark

WaveGlow GPU Benchmark

Neural Content Filtering GPU Benchmark

Multi-GPU Scaling Analysis - Scaling Efficiency (vs Single GPU)

Next, we will look at the scaling efficiency compared to single-GPU performance. Does adding GPUs deliver proportional performance gains? If 4 GPUs cost 4x as much, do you get 4x the throughput? We took an average of the 3 GPUs to calculate the scalability, which will tell us which workloads benefit from multiple GPUs.

Multi-GPU Scaling Analysis

Performance Insights from Our Benchmark Results

  • LLM Training & Fine-tuning
    • Recommended GPU: RTX PRO 6000 Blackwell Server
    • 96GB VRAM fits larger models in memory; 1.9x throughput vs. Ada
  • NLP / Transformer Research
    • Recommended GPU: RTX PRO 6000 Blackwell (Server or Max-Q)
    • BERT and TransformerXL see the largest generational gains; FP16 provides up to 2x additional acceleration
  • Computer Vision (Training)
    • Recommended GPU: RTX PRO 6000 Blackwell Server
    • 1.87x on ResNet50 and 2.46x on SSD vs. L40S; excellent multi-GPU scaling for large datasets
  • Audio / Speech Synthesis
    • Recommended GPU: RTX PRO 6000 Blackwell Max-Q
    • Strong Tacotron2/WaveGlow performance; FP16 provides no benefit, so FP32 accuracy comes free; Max-Q saves power in workstation setups
  • Recommendation Systems
    • Recommended GPU: RTX 6000 Ada or L40S
    • NCF shows Ada outperforming Blackwell; memory-bound workloads don't fully leverage Blackwell's compute advantage
  • Budget-Conscious Research
    • Recommended GPU: L40S 48GB
    • Within ~12% of RTX 6000 Ada; strong multi-GPU scaling; good performance-per-dollar entry point
  • Production Inference at Scale
    • Recommended GPU: RTX PRO 6000 Blackwell Server
    • Maximum single-GPU throughput reduces the number of cards needed; 96GB VRAM supports larger batch sizes for higher utilization

Conclusion

Our extensive benchmarking analysis reveals the exceptional performance of modern GPUs across a diverse range of deep learning workloads. The data demonstrates near-linear scaling for multi-GPU configurations, with 2-GPU setups achieving up to 1.99x performance and 4-GPU systems delivering 3.94x speedups.

FP16 precision offers excellent acceleration in most workloads, especially transformer-based models (BERT and TransformerXL). For multi-GPU scalability, transformer-based models like show excellent multi-GPU efficiency, while models like WaveGlow, NCF, Tacotron2, and SSD scale less efficiently.

These insights provide valuable guidance for organizations designing AI infrastructure to match their specific workload requirements, whether for research experimentation, model development, or production deployment.

Need guidance with configuration and integration for your computing infrastructure? Exxact is a leading solution integrator helping to deliver custom-configurable workstations, servers, and full rack clusters. Contact us for more information!

Fueling Innovation with an Exxact Designed Computing Cluster

Deploying full-scale AI models can be accelerated exponentially with the right computing infrastructure. Storage, head node, networking, compute - all components of your next Exxact cluster are configurable to your workload to drive and accelerate research and innovation.

Get a Quote Today