Benchmarks

NVIDIA RTX A5500 Benchmark - BERT Large Fine Tuning in TensorFlow 2

June 21, 2022
11 min read
EXX-Blog-RTX-A5500-BERT-Tensorflow-2.jpg

Fine Tuning BERT Large on a GPU Workstation

For this post, we measured fine-tuning performance (training and inference) for the BERT implementation of TensorFlow 2 on NVIDIA RTX A5500 GPUs. For testing, we used an Exxact Valence Workstation fitted with 8x RTX A5500 GPUs with 24GB GPU memory per GPU.

Benchmark scripts we used for evaluation were the finetune_train_benchmark.sh and finetune_inference_benchmark.sh from NVIDIA NGC Repository BERT for TensorFlow. We made slight modifications to the training benchmark script to get the larger batch size numbers.

The script runs multiple tests on the SQuAD v1.1 dataset using batch sizes 1, 2, 4, 8, 16, and 32. Inferencing tests were conducted using 1 GPU configuration on BERT Large. In addition, ran all benchmarks using TensorFlow's XLA across the board. Other training settings can be viewed at the end of this blog in the Appendix section.

Key Points and Observations

  • Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below and provided for reference as an indication of single-chip throughput of the platform
  • When doing Performance comparisons results showed that the RTX A5500 delivered a slightly better performance than the RTX A5000.
  • For those interested in training BERT Large, a 4x RTX A5500 system may be a great choice to start with, giving the opportunity to add additional cards as budget/scaling needs increase. 
  • NOTE: In order to run these benchmarks, or be able to fine-tune BERT Large with 4x GPUs you'll need a system with at least 64GB RAM.

Interested in getting faster results?


Exxact Workstation System Specs: 

Nodes 1
Processor / Count 2x AMD EPYC 7552
Total Logical Cores 48
Memory DDR4 512 GB
Storage NVMe 3.84 TB
OS Ubuntu 18.04
CUDA Version 11.4
BERT Dataset

squad v1

GPU Benchmark Overview

FP = Floating Point Precision, Seq = Sequence Length, BS = Batch Size

1x RTX A5500 BERT LARGE Inference Benchmark

Model Sequence-Length Batch-size Precision Total-Inference-Time Throughput-Average(sent/sec) Latency-Average(ms) Latency-50%(ms) Latency-90%(ms) Latency-95%(ms) iLatency-99%(ms) Latency-100%(ms)
base 384 1 fp16 14.31 183.03 5.46 5.2 6.54 6.68 6.98 7.74
base 384 2 fp16 16.04 320.54 6.24 6.23 6.62 6.7 6.95 7.49
base 384 4 fp16 19.62 407.11 9.83 9.76 10.17 10.25 10.41 10.84
base 384 8 fp16 26.36 482.06 16.6 16.58 16.84 16.99 17.25 17.9
base 384 1 fp32 10.89 171.98 5.81 5.71 6.67 6.8 7.11 9.5
base 384 2 fp32 14.32 224.81 8.9 8.79 9.28 9.37 9.59 10.3
base 384 4 fp32 19.99 274.5 14.57 14.58 14.88 14.96 15.31 19.02
base 384 8 fp32 32.76 292.2 27.38 27.52 27.76 27.81 27.96 28.27
large 384 1 fp16 25.92 84.54 11.83 11.88 12.58 12.7 13.08 14.56
large 384 2 fp16 32.47 114.08 17.53 17.55 18.3 18.42 18.66 21.21
large 384 4 fp16 42.87 142.13 28.14 27.89 29.06 29.19 29.5 29.99
large 384 8 fp16 62.71 167.39 47.79 47.58 48.59 48.7 48.91 49.9
large 384 1 fp32 22.36 70.45 14.19 14.3 14.97 15.1 15.43 16.18
large 384 2 fp32 31.69 85.03 23.52 23.65 24.27 24.42 24.67 25.63
large 384 4 fp32 54.64 86.12 46.45 46.28 47.33 47.43 47.57 48.53
large 384 8 fp32 91.18 96.6 82.82 82.77 83.66 83.91 84.13 85.04

Data Chart

FP = Floating Point Precision, Seq = Sequence Length,

Batch Size for all runs below = 8

8x RTX A5500 BERT LARGE Inference Benchmark

Number GPUs Model Precision XLA Batch Training Time (sec) Performance Time (sec)
2 base fp16 TRUE 8 172.39 174.57
4 base fp16 TRUE 8 177.22 340.87
6 base fp16 TRUE 8 187.03 454.5
8 base fp16 TRUE 8 189.97 599.73
2 base fp32 TRUE 8 161.84 123.95
4 base fp32 TRUE 8 170.89 231.76
6 base fp32 TRUE 8 186.79 307.47
8 base fp32 TRUE 8 190.02 404.42
2 base fp16 FALSE 8 156.57 104.45
4 base fp16 FALSE 8 161.1 202.13
6 base fp16 FALSE 8 168.62 285.75
8 base fp16 FALSE 8 169.75 378.49
2 base fp32 FALSE 8 179.42 83.01
4 base fp32 FALSE 8 186.52 159.24
6 base fp32 FALSE 8 201.07 219.01
8 base fp32 FALSE 8 204.11 287.95
2 large fp16 TRUE 8 398.34 63.24
4 large fp16 TRUE 8 410.66 121.13
6 large fp16 TRUE 8 433.85 164.81
8 large fp16 TRUE 8 438.53 216.53
2 large fp32 TRUE 8 413.88 42.29
4 large fp32 TRUE 8 437.6 79
6 large fp32 TRUE 8 480.33 104.99
8 large fp32 TRUE 8 No Data No Data
2 large fp16 FALSE 8 382.36 40.47
4 large fp16 FALSE 8 385.07 81.78
6 large fp16 FALSE 8 411.98 111.27
8 large fp16 FALSE 8 414.62 147.47
2 large fp32 FALSE 8 471.13 30.38
4 large fp32 FALSE 8 488.54 58.42
6 large fp32 FALSE 8 534.49 79.32
8 large fp32 FALSE 8 538.51 104.93

Data Chart

Batch Size for all runs below = 8

Nvidia RTX A Series GPU Specs

NVIDIA RTX A4000 NVIDIA RTX A4500 NVIDIA RTX A5000 NVIDIA RTX A5500 NVIDIA RTX A6000
Architecture Ampere Ampere Ampere Ampere Ampere
GPU Memory 16 GB GDDR6 20 GB GDDR6 24 GB GDDR6 24 GB GDDR6 48 GB GDDR6
ECC Memory Yes Yes Yes Yes Yes
CUDA Cores 6,144 7,168 8,192 10,240 10,752
Tensor Cores 192 224 256 320 336
RT Cores 48 56 64 80 84
SP perf 19.2 TFLOPS 23.7 TFLOPS 27.8 TFLOPS 34.1 TFLOPS 38.7 TFLOPS
RT Core perf 37.4 TFLOPS 46.2 TFLOPS 54.2 TFLOPS 66.6 TFLOPS 75.6 TFLOPS
Tensor perf 153.4 TFLOPS 189.2 TFLOPS 222.2 TFLOPS 272.8 TFLOPS 309.7 TFLOPS
Max Power 140W 200W 230W 230W 300W
Graphic bus PCI-E 4.0 x16 PCI-E 4.0 x16 PCI-E 4.0 x16 PCI-E 4.0 x16 PCI-E 4.0 x16
Connectors DP 1.4 (4) DP 1.4 (4) DP 1.4 (4) DP 1.4 (4) DP 1.4 (4)
Form Factor Single Slot Dual Slot Dual Slot Dual Slot Dual Slot
vGPU Software No No NVIDIA RTX vWS NVIDIA RTX vWS NVIDIA RTX vWS
NVLink N/A 2x RTX A4500 2x RTX A5000 2x RTX A5500 2x RTX A6000
Power Connector 1 x 6-pin PCIe 1 x 8-pin PCIe 1 x 8-pin PCIe 1 x 8-pin PCIe 1 x 8-pin PCI

Topics

EXX-Blog-RTX-A5500-BERT-Tensorflow-2.jpg
Benchmarks

NVIDIA RTX A5500 Benchmark - BERT Large Fine Tuning in TensorFlow 2

June 21, 202211 min read

Fine Tuning BERT Large on a GPU Workstation

For this post, we measured fine-tuning performance (training and inference) for the BERT implementation of TensorFlow 2 on NVIDIA RTX A5500 GPUs. For testing, we used an Exxact Valence Workstation fitted with 8x RTX A5500 GPUs with 24GB GPU memory per GPU.

Benchmark scripts we used for evaluation were the finetune_train_benchmark.sh and finetune_inference_benchmark.sh from NVIDIA NGC Repository BERT for TensorFlow. We made slight modifications to the training benchmark script to get the larger batch size numbers.

The script runs multiple tests on the SQuAD v1.1 dataset using batch sizes 1, 2, 4, 8, 16, and 32. Inferencing tests were conducted using 1 GPU configuration on BERT Large. In addition, ran all benchmarks using TensorFlow's XLA across the board. Other training settings can be viewed at the end of this blog in the Appendix section.

Key Points and Observations

  • Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below and provided for reference as an indication of single-chip throughput of the platform
  • When doing Performance comparisons results showed that the RTX A5500 delivered a slightly better performance than the RTX A5000.
  • For those interested in training BERT Large, a 4x RTX A5500 system may be a great choice to start with, giving the opportunity to add additional cards as budget/scaling needs increase. 
  • NOTE: In order to run these benchmarks, or be able to fine-tune BERT Large with 4x GPUs you'll need a system with at least 64GB RAM.

Interested in getting faster results?


Exxact Workstation System Specs: 

Nodes 1
Processor / Count 2x AMD EPYC 7552
Total Logical Cores 48
Memory DDR4 512 GB
Storage NVMe 3.84 TB
OS Ubuntu 18.04
CUDA Version 11.4
BERT Dataset

squad v1

GPU Benchmark Overview

FP = Floating Point Precision, Seq = Sequence Length, BS = Batch Size

1x RTX A5500 BERT LARGE Inference Benchmark

Model Sequence-Length Batch-size Precision Total-Inference-Time Throughput-Average(sent/sec) Latency-Average(ms) Latency-50%(ms) Latency-90%(ms) Latency-95%(ms) iLatency-99%(ms) Latency-100%(ms)
base 384 1 fp16 14.31 183.03 5.46 5.2 6.54 6.68 6.98 7.74
base 384 2 fp16 16.04 320.54 6.24 6.23 6.62 6.7 6.95 7.49
base 384 4 fp16 19.62 407.11 9.83 9.76 10.17 10.25 10.41 10.84
base 384 8 fp16 26.36 482.06 16.6 16.58 16.84 16.99 17.25 17.9
base 384 1 fp32 10.89 171.98 5.81 5.71 6.67 6.8 7.11 9.5
base 384 2 fp32 14.32 224.81 8.9 8.79 9.28 9.37 9.59 10.3
base 384 4 fp32 19.99 274.5 14.57 14.58 14.88 14.96 15.31 19.02
base 384 8 fp32 32.76 292.2 27.38 27.52 27.76 27.81 27.96 28.27
large 384 1 fp16 25.92 84.54 11.83 11.88 12.58 12.7 13.08 14.56
large 384 2 fp16 32.47 114.08 17.53 17.55 18.3 18.42 18.66 21.21
large 384 4 fp16 42.87 142.13 28.14 27.89 29.06 29.19 29.5 29.99
large 384 8 fp16 62.71 167.39 47.79 47.58 48.59 48.7 48.91 49.9
large 384 1 fp32 22.36 70.45 14.19 14.3 14.97 15.1 15.43 16.18
large 384 2 fp32 31.69 85.03 23.52 23.65 24.27 24.42 24.67 25.63
large 384 4 fp32 54.64 86.12 46.45 46.28 47.33 47.43 47.57 48.53
large 384 8 fp32 91.18 96.6 82.82 82.77 83.66 83.91 84.13 85.04

Data Chart

FP = Floating Point Precision, Seq = Sequence Length,

Batch Size for all runs below = 8

8x RTX A5500 BERT LARGE Inference Benchmark

Number GPUs Model Precision XLA Batch Training Time (sec) Performance Time (sec)
2 base fp16 TRUE 8 172.39 174.57
4 base fp16 TRUE 8 177.22 340.87
6 base fp16 TRUE 8 187.03 454.5
8 base fp16 TRUE 8 189.97 599.73
2 base fp32 TRUE 8 161.84 123.95
4 base fp32 TRUE 8 170.89 231.76
6 base fp32 TRUE 8 186.79 307.47
8 base fp32 TRUE 8 190.02 404.42
2 base fp16 FALSE 8 156.57 104.45
4 base fp16 FALSE 8 161.1 202.13
6 base fp16 FALSE 8 168.62 285.75
8 base fp16 FALSE 8 169.75 378.49
2 base fp32 FALSE 8 179.42 83.01
4 base fp32 FALSE 8 186.52 159.24
6 base fp32 FALSE 8 201.07 219.01
8 base fp32 FALSE 8 204.11 287.95
2 large fp16 TRUE 8 398.34 63.24
4 large fp16 TRUE 8 410.66 121.13
6 large fp16 TRUE 8 433.85 164.81
8 large fp16 TRUE 8 438.53 216.53
2 large fp32 TRUE 8 413.88 42.29
4 large fp32 TRUE 8 437.6 79
6 large fp32 TRUE 8 480.33 104.99
8 large fp32 TRUE 8 No Data No Data
2 large fp16 FALSE 8 382.36 40.47
4 large fp16 FALSE 8 385.07 81.78
6 large fp16 FALSE 8 411.98 111.27
8 large fp16 FALSE 8 414.62 147.47
2 large fp32 FALSE 8 471.13 30.38
4 large fp32 FALSE 8 488.54 58.42
6 large fp32 FALSE 8 534.49 79.32
8 large fp32 FALSE 8 538.51 104.93

Data Chart

Batch Size for all runs below = 8

Nvidia RTX A Series GPU Specs

NVIDIA RTX A4000 NVIDIA RTX A4500 NVIDIA RTX A5000 NVIDIA RTX A5500 NVIDIA RTX A6000
Architecture Ampere Ampere Ampere Ampere Ampere
GPU Memory 16 GB GDDR6 20 GB GDDR6 24 GB GDDR6 24 GB GDDR6 48 GB GDDR6
ECC Memory Yes Yes Yes Yes Yes
CUDA Cores 6,144 7,168 8,192 10,240 10,752
Tensor Cores 192 224 256 320 336
RT Cores 48 56 64 80 84
SP perf 19.2 TFLOPS 23.7 TFLOPS 27.8 TFLOPS 34.1 TFLOPS 38.7 TFLOPS
RT Core perf 37.4 TFLOPS 46.2 TFLOPS 54.2 TFLOPS 66.6 TFLOPS 75.6 TFLOPS
Tensor perf 153.4 TFLOPS 189.2 TFLOPS 222.2 TFLOPS 272.8 TFLOPS 309.7 TFLOPS
Max Power 140W 200W 230W 230W 300W
Graphic bus PCI-E 4.0 x16 PCI-E 4.0 x16 PCI-E 4.0 x16 PCI-E 4.0 x16 PCI-E 4.0 x16
Connectors DP 1.4 (4) DP 1.4 (4) DP 1.4 (4) DP 1.4 (4) DP 1.4 (4)
Form Factor Single Slot Dual Slot Dual Slot Dual Slot Dual Slot
vGPU Software No No NVIDIA RTX vWS NVIDIA RTX vWS NVIDIA RTX vWS
NVLink N/A 2x RTX A4500 2x RTX A5000 2x RTX A5500 2x RTX A6000
Power Connector 1 x 6-pin PCIe 1 x 8-pin PCIe 1 x 8-pin PCIe 1 x 8-pin PCIe 1 x 8-pin PCI

Topics