Blog

Benchmarks

NVIDIA RTX A5500 Benchmark - BERT Large Fine Tuning in TensorFlow 2

June 21, 2022
12 min read
EXX-Blog-RTX-A5500-BERT-Tensorflow-2.jpg

Fine Tuning BERT Large on a GPU Workstation

For this post, we measured fine-tuning performance (training and inference) for the BERT implementation of TensorFlow 2 on NVIDIA RTX A5500 GPUs. For testing, we used an Exxact Valence Workstation fitted with 8x RTX A5500 GPUs with 24GB GPU memory per GPU.

Benchmark scripts we used for evaluation were the finetune_train_benchmark.sh and finetune_inference_benchmark.sh from NVIDIA NGC Repository BERT for TensorFlow. We made slight modifications to the training benchmark script to get the larger batch size numbers.

The script runs multiple tests on the SQuAD v1.1 dataset using batch sizes 1, 2, 4, 8, 16, and 32. Inferencing tests were conducted using 1 GPU configuration on BERT Large. In addition, ran all benchmarks using TensorFlow's XLA across the board. Other training settings can be viewed at the end of this blog in the Appendix section.

Key Points and Observations

  • Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below and provided for reference as an indication of single-chip throughput of the platform
  • When doing Performance comparisons results showed that the RTX A5500 delivered a slightly better performance than the RTX A5000.
  • For those interested in training BERT Large, a 4x RTX A5500 system may be a great choice to start with, giving the opportunity to add additional cards as budget/scaling needs increase. 
  • NOTE: In order to run these benchmarks, or be able to fine-tune BERT Large with 4x GPUs you'll need a system with at least 64GB RAM.

Interested in getting faster results?


Exxact Workstation System Specs: 

Nodes1
Processor / Count2x AMD EPYC 7552
Total Logical Cores48
MemoryDDR4 512 GB
StorageNVMe 3.84 TB
OSUbuntu 18.04
CUDA Version11.4
BERT Dataset

squad v1

GPU Benchmark Overview

FP = Floating Point Precision, Seq = Sequence Length, BS = Batch Size

1x RTX A5500 BERT LARGE Inference Benchmark

ModelSequence-LengthBatch-sizePrecisionTotal-Inference-TimeThroughput-Average(sent/sec)Latency-Average(ms)Latency-50%(ms)Latency-90%(ms)Latency-95%(ms)iLatency-99%(ms)Latency-100%(ms)
base3841fp1614.31183.035.465.26.546.686.987.74
base3842fp1616.04320.546.246.236.626.76.957.49
base3844fp1619.62407.119.839.7610.1710.2510.4110.84
base3848fp1626.36482.0616.616.5816.8416.9917.2517.9
base3841fp3210.89171.985.815.716.676.87.119.5
base3842fp3214.32224.818.98.799.289.379.5910.3
base3844fp3219.99274.514.5714.5814.8814.9615.3119.02
base3848fp3232.76292.227.3827.5227.7627.8127.9628.27
large3841fp1625.9284.5411.8311.8812.5812.713.0814.56
large3842fp1632.47114.0817.5317.5518.318.4218.6621.21
large3844fp1642.87142.1328.1427.8929.0629.1929.529.99
large3848fp1662.71167.3947.7947.5848.5948.748.9149.9
large3841fp3222.3670.4514.1914.314.9715.115.4316.18
large3842fp3231.6985.0323.5223.6524.2724.4224.6725.63
large3844fp3254.6486.1246.4546.2847.3347.4347.5748.53
large3848fp3291.1896.682.8282.7783.6683.9184.1385.04

Data Chart

FP = Floating Point Precision, Seq = Sequence Length,

Batch Size for all runs below = 8

8x RTX A5500 BERT LARGE Inference Benchmark

Number GPUsModelPrecisionXLABatchTraining Time (sec)Performance Time (sec)
2basefp16TRUE8172.39174.57
4basefp16TRUE8177.22340.87
6basefp16TRUE8187.03454.5
8basefp16TRUE8189.97599.73
2basefp32TRUE8161.84123.95
4basefp32TRUE8170.89231.76
6basefp32TRUE8186.79307.47
8basefp32TRUE8190.02404.42
2basefp16FALSE8156.57104.45
4basefp16FALSE8161.1202.13
6basefp16FALSE8168.62285.75
8basefp16FALSE8169.75378.49
2basefp32FALSE8179.4283.01
4basefp32FALSE8186.52159.24
6basefp32FALSE8201.07219.01
8basefp32FALSE8204.11287.95
2largefp16TRUE8398.3463.24
4largefp16TRUE8410.66121.13
6largefp16TRUE8433.85164.81
8largefp16TRUE8438.53216.53
2largefp32TRUE8413.8842.29
4largefp32TRUE8437.679
6largefp32TRUE8480.33104.99
8largefp32TRUE8No DataNo Data
2largefp16FALSE8382.3640.47
4largefp16FALSE8385.0781.78
6largefp16FALSE8411.98111.27
8largefp16FALSE8414.62147.47
2largefp32FALSE8471.1330.38
4largefp32FALSE8488.5458.42
6largefp32FALSE8534.4979.32
8largefp32FALSE8538.51104.93

Data Chart

Batch Size for all runs below = 8

Nvidia RTX A Series GPU Specs

NVIDIA RTX A4000NVIDIA RTX A4500NVIDIA RTX A5000NVIDIA RTX A5500NVIDIA RTX A6000
ArchitectureAmpereAmpereAmpereAmpereAmpere
GPU Memory16 GB GDDR620 GB GDDR624 GB GDDR624 GB GDDR648 GB GDDR6
ECC MemoryYesYesYesYesYes
CUDA Cores6,1447,1688,19210,24010,752
Tensor Cores192224256320336
RT Cores4856648084
SP perf19.2 TFLOPS23.7 TFLOPS27.8 TFLOPS34.1 TFLOPS38.7 TFLOPS
RT Core perf37.4 TFLOPS46.2 TFLOPS54.2 TFLOPS66.6 TFLOPS75.6 TFLOPS
Tensor perf153.4 TFLOPS189.2 TFLOPS222.2 TFLOPS272.8 TFLOPS309.7 TFLOPS
Max Power140W200W230W230W300W
Graphic busPCI-E 4.0 x16PCI-E 4.0 x16PCI-E 4.0 x16PCI-E 4.0 x16PCI-E 4.0 x16
ConnectorsDP 1.4 (4)DP 1.4 (4)DP 1.4 (4)DP 1.4 (4)DP 1.4 (4)
Form FactorSingle SlotDual SlotDual SlotDual SlotDual Slot
vGPU SoftwareNoNoNVIDIA RTX vWSNVIDIA RTX vWSNVIDIA RTX vWS
NVLinkN/A2x RTX A45002x RTX A50002x RTX A55002x RTX A6000
Power Connector1 x 6-pin PCIe1 x 8-pin PCIe1 x 8-pin PCIe1 x 8-pin PCIe1 x 8-pin PCI

More About NVIDIA A5500's Features

  • NVIDIA Ampere Architecture-Based CUDA Cores: Accelerate graphics workflows with the latest CUDA® cores for up to 3X single-precision floating-point (FP32) performance compared to the previous generation.
  • Second-Generation RT Cores: Produce more visually accurate renders faster with hardware-accelerated ray tracing and motion blur, with up to 2X faster performance than the previous generation.
  • Third Generation Tensor Cores: Boost AI and data science model training with up to 12X faster training performance compared to the previous generation with hardware support for structural sparsity.
  • 24GB of GPU Memory: Tackle memory-intensive workloads, from virtual production to engineering simulation, with 24GB of GDDR6 memory with ECC.
  • Third-Generation NVIDIA NVLink:  Scale memory and performance across multiple GPUs with NVIDIA® NVLink™ to tackle larger datasets, models, and scenes
  • PCI Express Gen 4: Improve data-transfer speeds from CPU memory for data-intensive tasks with support for PCIe Gen 4.
  • Power Efficiency: Leverage a dual-slot design that’s 3X more power efficient than the previous generation and is crafted to fit a wide range of workstations.

Have any questions about NVIDIA GPUs or AI workstations and servers?
Contact Exxact Today

Free Resources

Browse our whitepapers, e-books, case studies, and reference architecture.

Explore
EXX-Blog-RTX-A5500-BERT-Tensorflow-2.jpg
Benchmarks

NVIDIA RTX A5500 Benchmark - BERT Large Fine Tuning in TensorFlow 2

June 21, 202212 min read

Fine Tuning BERT Large on a GPU Workstation

For this post, we measured fine-tuning performance (training and inference) for the BERT implementation of TensorFlow 2 on NVIDIA RTX A5500 GPUs. For testing, we used an Exxact Valence Workstation fitted with 8x RTX A5500 GPUs with 24GB GPU memory per GPU.

Benchmark scripts we used for evaluation were the finetune_train_benchmark.sh and finetune_inference_benchmark.sh from NVIDIA NGC Repository BERT for TensorFlow. We made slight modifications to the training benchmark script to get the larger batch size numbers.

The script runs multiple tests on the SQuAD v1.1 dataset using batch sizes 1, 2, 4, 8, 16, and 32. Inferencing tests were conducted using 1 GPU configuration on BERT Large. In addition, ran all benchmarks using TensorFlow's XLA across the board. Other training settings can be viewed at the end of this blog in the Appendix section.

Key Points and Observations

  • Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below and provided for reference as an indication of single-chip throughput of the platform
  • When doing Performance comparisons results showed that the RTX A5500 delivered a slightly better performance than the RTX A5000.
  • For those interested in training BERT Large, a 4x RTX A5500 system may be a great choice to start with, giving the opportunity to add additional cards as budget/scaling needs increase. 
  • NOTE: In order to run these benchmarks, or be able to fine-tune BERT Large with 4x GPUs you'll need a system with at least 64GB RAM.

Interested in getting faster results?


Exxact Workstation System Specs: 

Nodes1
Processor / Count2x AMD EPYC 7552
Total Logical Cores48
MemoryDDR4 512 GB
StorageNVMe 3.84 TB
OSUbuntu 18.04
CUDA Version11.4
BERT Dataset

squad v1

GPU Benchmark Overview

FP = Floating Point Precision, Seq = Sequence Length, BS = Batch Size

1x RTX A5500 BERT LARGE Inference Benchmark

ModelSequence-LengthBatch-sizePrecisionTotal-Inference-TimeThroughput-Average(sent/sec)Latency-Average(ms)Latency-50%(ms)Latency-90%(ms)Latency-95%(ms)iLatency-99%(ms)Latency-100%(ms)
base3841fp1614.31183.035.465.26.546.686.987.74
base3842fp1616.04320.546.246.236.626.76.957.49
base3844fp1619.62407.119.839.7610.1710.2510.4110.84
base3848fp1626.36482.0616.616.5816.8416.9917.2517.9
base3841fp3210.89171.985.815.716.676.87.119.5
base3842fp3214.32224.818.98.799.289.379.5910.3
base3844fp3219.99274.514.5714.5814.8814.9615.3119.02
base3848fp3232.76292.227.3827.5227.7627.8127.9628.27
large3841fp1625.9284.5411.8311.8812.5812.713.0814.56
large3842fp1632.47114.0817.5317.5518.318.4218.6621.21
large3844fp1642.87142.1328.1427.8929.0629.1929.529.99
large3848fp1662.71167.3947.7947.5848.5948.748.9149.9
large3841fp3222.3670.4514.1914.314.9715.115.4316.18
large3842fp3231.6985.0323.5223.6524.2724.4224.6725.63
large3844fp3254.6486.1246.4546.2847.3347.4347.5748.53
large3848fp3291.1896.682.8282.7783.6683.9184.1385.04

Data Chart

FP = Floating Point Precision, Seq = Sequence Length,

Batch Size for all runs below = 8

8x RTX A5500 BERT LARGE Inference Benchmark

Number GPUsModelPrecisionXLABatchTraining Time (sec)Performance Time (sec)
2basefp16TRUE8172.39174.57
4basefp16TRUE8177.22340.87
6basefp16TRUE8187.03454.5
8basefp16TRUE8189.97599.73
2basefp32TRUE8161.84123.95
4basefp32TRUE8170.89231.76
6basefp32TRUE8186.79307.47
8basefp32TRUE8190.02404.42
2basefp16FALSE8156.57104.45
4basefp16FALSE8161.1202.13
6basefp16FALSE8168.62285.75
8basefp16FALSE8169.75378.49
2basefp32FALSE8179.4283.01
4basefp32FALSE8186.52159.24
6basefp32FALSE8201.07219.01
8basefp32FALSE8204.11287.95
2largefp16TRUE8398.3463.24
4largefp16TRUE8410.66121.13
6largefp16TRUE8433.85164.81
8largefp16TRUE8438.53216.53
2largefp32TRUE8413.8842.29
4largefp32TRUE8437.679
6largefp32TRUE8480.33104.99
8largefp32TRUE8No DataNo Data
2largefp16FALSE8382.3640.47
4largefp16FALSE8385.0781.78
6largefp16FALSE8411.98111.27
8largefp16FALSE8414.62147.47
2largefp32FALSE8471.1330.38
4largefp32FALSE8488.5458.42
6largefp32FALSE8534.4979.32
8largefp32FALSE8538.51104.93

Data Chart

Batch Size for all runs below = 8

Nvidia RTX A Series GPU Specs

NVIDIA RTX A4000NVIDIA RTX A4500NVIDIA RTX A5000NVIDIA RTX A5500NVIDIA RTX A6000
ArchitectureAmpereAmpereAmpereAmpereAmpere
GPU Memory16 GB GDDR620 GB GDDR624 GB GDDR624 GB GDDR648 GB GDDR6
ECC MemoryYesYesYesYesYes
CUDA Cores6,1447,1688,19210,24010,752
Tensor Cores192224256320336
RT Cores4856648084
SP perf19.2 TFLOPS23.7 TFLOPS27.8 TFLOPS34.1 TFLOPS38.7 TFLOPS
RT Core perf37.4 TFLOPS46.2 TFLOPS54.2 TFLOPS66.6 TFLOPS75.6 TFLOPS
Tensor perf153.4 TFLOPS189.2 TFLOPS222.2 TFLOPS272.8 TFLOPS309.7 TFLOPS
Max Power140W200W230W230W300W
Graphic busPCI-E 4.0 x16PCI-E 4.0 x16PCI-E 4.0 x16PCI-E 4.0 x16PCI-E 4.0 x16
ConnectorsDP 1.4 (4)DP 1.4 (4)DP 1.4 (4)DP 1.4 (4)DP 1.4 (4)
Form FactorSingle SlotDual SlotDual SlotDual SlotDual Slot
vGPU SoftwareNoNoNVIDIA RTX vWSNVIDIA RTX vWSNVIDIA RTX vWS
NVLinkN/A2x RTX A45002x RTX A50002x RTX A55002x RTX A6000
Power Connector1 x 6-pin PCIe1 x 8-pin PCIe1 x 8-pin PCIe1 x 8-pin PCIe1 x 8-pin PCI

More About NVIDIA A5500's Features

  • NVIDIA Ampere Architecture-Based CUDA Cores: Accelerate graphics workflows with the latest CUDA® cores for up to 3X single-precision floating-point (FP32) performance compared to the previous generation.
  • Second-Generation RT Cores: Produce more visually accurate renders faster with hardware-accelerated ray tracing and motion blur, with up to 2X faster performance than the previous generation.
  • Third Generation Tensor Cores: Boost AI and data science model training with up to 12X faster training performance compared to the previous generation with hardware support for structural sparsity.
  • 24GB of GPU Memory: Tackle memory-intensive workloads, from virtual production to engineering simulation, with 24GB of GDDR6 memory with ECC.
  • Third-Generation NVIDIA NVLink:  Scale memory and performance across multiple GPUs with NVIDIA® NVLink™ to tackle larger datasets, models, and scenes
  • PCI Express Gen 4: Improve data-transfer speeds from CPU memory for data-intensive tasks with support for PCIe Gen 4.
  • Power Efficiency: Leverage a dual-slot design that’s 3X more power efficient than the previous generation and is crafted to fit a wide range of workstations.

Have any questions about NVIDIA GPUs or AI workstations and servers?
Contact Exxact Today

Free Resources

Browse our whitepapers, e-books, case studies, and reference architecture.

Explore