Benchmarks

NVIDIA RTX A5000 BERT Large Fine Tuning Benchmarks in TensorFlow

July 14, 2021
14 min read
blog-rtx-a5000-bert-large.jpg

Fine Tuning BERT Large on a GPU Workstation

For this post, we measured fine-tuning performance (training and inference) for the BERT implementation of TensorFlow on NVIDIA RTX A5000 GPUs. For testing we used an Exxact Valence Workstation was fitted with 4x RTX A5000 GPUs with 24GB GPU memory per GPU.

Benchmark scripts we used for evaluation came from the NVIDIA NGC Repository BERT for TensorFlow:  finetune_train_benchmark.sh and finetune_inference_benchmark.sh 

We made slight modifications to the training benchmark script to get the larger batch size numbers.

The script runs multiple tests on the SQuAD v1.1 dataset using batch sizes 1, 2, 4, 8, 16, and 32 for training; and 1, 2, 4, and 8 for inference, and conducted tests using 1, 2, and 4 GPU configurations on BERT Large (we used 1 GPU for inference benchmark). In addition, we ran all benchmarks using TensorFlow's XLA across the board. 

Key Points and Observations

  • Performance-wise, the RTX A5000 performed well, and significantly better than the RTX 6000.
  • In terms of throughput during training, the 4x configs really started to shine, performing well as the batch size increased.
  • For those interested in training BERT Large, a 2x RTX A5000 system may be a great choice to start with, giving the opportunity to add additional cards as budget/scaling needs increase.
  • NOTE: In order to run these benchmarks, or be able to fine-tune BERT Large with 4x GPUs, you'll need a system with at least 64GB RAM.

Interested in getting faster results?
Learn more about Exxact AI workstations starting at $3,700


Make / ModelSupermicro AS -4124GS-TN
Nodes1
Processor / Count2x AMD EPYC 7552
Total Logical Cores48
MemoryDDR4 512 GB
StorageNVMe 3.84 TB
OSCentos 7
CUDA Version11.2
BERT Dataset

squad v1

GPU Benchmark Overview

FP = Floating Point Precision, Seq = Sequence Length, BS = Batch Size

Our results were obtained by running the scripts/finetune_inference_benchmark.sh script in the TensorFlow 21.06-tf1-py3 NGC container 1x A5000 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model (i.e. no pipelining).

1x RTX A5000 BERT LARGE Inference Benchmark

ModelSequence-LengthBatch-sizePrecisionTotal-Inference-TimeThroughput-Average(sent/sec)Latency-Average(ms)Latency-50%(ms)Latency-90%(ms)Latency-95%(ms)iLatency-99%(ms)Latency-100%(ms)
base1281fp1623.01169.359.116.096.616.87.355711.58
base1281fp3221.15179.348.695.786.266.396.735567.95
base1282fp1620.87351.119.135.846.46.566.985805.16
base1282fp3221.02349.679.145.866.396.547.015785.09
base1284fp1626.31680.9711.615.96.326.487.135749.98
base1284fp3226.61672.7411.746.026.56.697.585745.76
base1288fp1632.911039.5613.017.78.098.278.885855.3
base1288fp3232.91037.4213.017.78.128.328.915855.67
base3841fp1617.5175.9611.75.776.286.447.726263.5
base3841fp3217.17176.411.525.716.356.557.166087.21
base3842fp1625.29258.8718.887.768.068.269.366676.43
base3842fp3225.02256.0118.827.798.228.379.726473.25
base3844fp1629.94338.6723.0311.7812.1612.3813.826623.03
base3844fp3230.43337.7223.1411.8412.2712.4213.576802.34
base3848fp1641.07402.8931.819.8320.2920.5421.646988.95
base3848fp3241.33401.4531.9919.920.3520.6321.987063.42
large1281fp1640.89104.1515.7910.1710.9211.0711.4411112.47
large1281fp3237.26103.9215.8310.2511.0911.2611.7911137.36
large1282fp1637.81193.0116.9510.5811.0411.212.0611149.47
large1282fp3237.5194.0116.9110.491111.1812.2611147.67
large1284fp1652.43318.7424.2412.7113.1613.3113.811354.08
large1284fp3252.81318.724.2512.7413.1913.3814.0411304.21
large1288fp1668.8432.6729.3418.6119.1319.3219.8811602.32
large1288fp3268.41435.7729.1618.4619.0119.1619.8911517.24
large3841fp1633.881.6424.0512.4512.8112.9613.9512308.35
large3841fp3233.7581.8324.0712.4212.8613.0313.5512362.84
large3842fp1651.77108.1841.3318.619.1719.3720.6613082.53
large3842fp3251.5108.3741.2918.6219.1319.2720.1613072.42
large3844fp1666.02127.2954.6931.532.0832.2833.4813438.17
large3844fp3266.1126.854.8231.5832.1332.3433.7513485.86
large3848fp1693.68148.6278.6753.8654.4554.665614180.87
large3848fp3293.9148.7478.7553.7754.5254.6655.8714261.35


1x RTX A5000 BERT LARGE Training Benchmark

GPUsPrecisionSequence-LengthTotal-Training-TimeBatch-sizePerformance(sent/sec)
1fp1612812991.12114.31
1fp3212812932.53114.38
1fp16128128228.21
1fp321286875.02228.11
1fp161283992.3451.17
1fp321283969.82451.43
1fp161282689.2882.47
1fp321282718.08880.71
1fp161282052.1316116.46
1fp321282034.8216117.67
1fp161281732.6632146.93
1fp321281722.0132147.85
1fp1638414256.77112.98
1fp3238414279.8112.97
1fp163849047.55220.82
1fp323849044.82220.83
1fp163846365.77430.42
1fp323846380.79430.35
1fp163844939.75840.14
1fp323844900.55840.44
1fp163844310.771647.08
1fp323844299.061647.19
1fp16384Did not Run32Did not Run
1fp32384Did not Run32Did not Run

BERT_A5000.png


2x RTX A5000 BERT LARGE Training Benchmark

GPUsPrecisionSequence-LengthTotal-Training-TimeBatch-sizePerformance(sent/sec)
2fp1612811437.17116.22
2fp3212811376.72116.31
2fp161286073.78231.75
2fp321286090.61231.79
2fp161283453.41459.98
2fp321283424.23458.25
2fp161282193.278105.09
2fp321282187.788105.41
2fp161281580.916166.9
2fp321281580.0716167.45
2fp161281160.7632233.9
2fp321281158.7232233.28
2fp1638412180.88115.17
2fp3238412294.78115.05
2fp163847199.95226.46
2fp323847211.68226.49
2fp163844630.27443.03
2fp323844626.98443.06
2fp163843340.13863.02
2fp323843336.61863.02

BERT_A5000_2.png


4x RTX A5000 BERT LARGE Training Benchmark

GPUsPrecisionSequence-LengthTotal-Training-TimeBatch-sizePerformance(sent/sec)
4fp161286721.29128.53
4fp321286720.39128.53
4fp161283657.87256.11
4fp321283643.79256.34
4fp161282161.744106.77
4fp321282163.324106.72
4fp161281448.38189.41
4fp321281447.998189.19
4fp16128983.2416306.3
4fp32128980.9116306.11
4fp16128815.4832437.08
4fp32128820.4532436.6
4fp163847088126.99
4fp323847109.11126.91
4fp163844220.48247.77
4fp323844227.16247.69
4fp163842777.77478.55
4fp323842775.14478.72
4fp163842047.378117.43
4fp323842039.438118.1


RTX A5000 BERT LARGE Training Benchmark Comparison

bert_a5000_3.png


Have any questions?
Contact Exxact Today


Topics

blog-rtx-a5000-bert-large.jpg
Benchmarks

NVIDIA RTX A5000 BERT Large Fine Tuning Benchmarks in TensorFlow

July 14, 202114 min read

Fine Tuning BERT Large on a GPU Workstation

For this post, we measured fine-tuning performance (training and inference) for the BERT implementation of TensorFlow on NVIDIA RTX A5000 GPUs. For testing we used an Exxact Valence Workstation was fitted with 4x RTX A5000 GPUs with 24GB GPU memory per GPU.

Benchmark scripts we used for evaluation came from the NVIDIA NGC Repository BERT for TensorFlow:  finetune_train_benchmark.sh and finetune_inference_benchmark.sh 

We made slight modifications to the training benchmark script to get the larger batch size numbers.

The script runs multiple tests on the SQuAD v1.1 dataset using batch sizes 1, 2, 4, 8, 16, and 32 for training; and 1, 2, 4, and 8 for inference, and conducted tests using 1, 2, and 4 GPU configurations on BERT Large (we used 1 GPU for inference benchmark). In addition, we ran all benchmarks using TensorFlow's XLA across the board. 

Key Points and Observations

  • Performance-wise, the RTX A5000 performed well, and significantly better than the RTX 6000.
  • In terms of throughput during training, the 4x configs really started to shine, performing well as the batch size increased.
  • For those interested in training BERT Large, a 2x RTX A5000 system may be a great choice to start with, giving the opportunity to add additional cards as budget/scaling needs increase.
  • NOTE: In order to run these benchmarks, or be able to fine-tune BERT Large with 4x GPUs, you'll need a system with at least 64GB RAM.

Interested in getting faster results?
Learn more about Exxact AI workstations starting at $3,700


Make / ModelSupermicro AS -4124GS-TN
Nodes1
Processor / Count2x AMD EPYC 7552
Total Logical Cores48
MemoryDDR4 512 GB
StorageNVMe 3.84 TB
OSCentos 7
CUDA Version11.2
BERT Dataset

squad v1

GPU Benchmark Overview

FP = Floating Point Precision, Seq = Sequence Length, BS = Batch Size

Our results were obtained by running the scripts/finetune_inference_benchmark.sh script in the TensorFlow 21.06-tf1-py3 NGC container 1x A5000 16GB GPUs. Performance numbers (throughput in sentences per second and latency in milliseconds) were averaged from 1024 iterations. Latency is computed as the time taken for a batch to process as they are fed in one after another in the model (i.e. no pipelining).

1x RTX A5000 BERT LARGE Inference Benchmark

ModelSequence-LengthBatch-sizePrecisionTotal-Inference-TimeThroughput-Average(sent/sec)Latency-Average(ms)Latency-50%(ms)Latency-90%(ms)Latency-95%(ms)iLatency-99%(ms)Latency-100%(ms)
base1281fp1623.01169.359.116.096.616.87.355711.58
base1281fp3221.15179.348.695.786.266.396.735567.95
base1282fp1620.87351.119.135.846.46.566.985805.16
base1282fp3221.02349.679.145.866.396.547.015785.09
base1284fp1626.31680.9711.615.96.326.487.135749.98
base1284fp3226.61672.7411.746.026.56.697.585745.76
base1288fp1632.911039.5613.017.78.098.278.885855.3
base1288fp3232.91037.4213.017.78.128.328.915855.67
base3841fp1617.5175.9611.75.776.286.447.726263.5
base3841fp3217.17176.411.525.716.356.557.166087.21
base3842fp1625.29258.8718.887.768.068.269.366676.43
base3842fp3225.02256.0118.827.798.228.379.726473.25
base3844fp1629.94338.6723.0311.7812.1612.3813.826623.03
base3844fp3230.43337.7223.1411.8412.2712.4213.576802.34
base3848fp1641.07402.8931.819.8320.2920.5421.646988.95
base3848fp3241.33401.4531.9919.920.3520.6321.987063.42
large1281fp1640.89104.1515.7910.1710.9211.0711.4411112.47
large1281fp3237.26103.9215.8310.2511.0911.2611.7911137.36
large1282fp1637.81193.0116.9510.5811.0411.212.0611149.47
large1282fp3237.5194.0116.9110.491111.1812.2611147.67
large1284fp1652.43318.7424.2412.7113.1613.3113.811354.08
large1284fp3252.81318.724.2512.7413.1913.3814.0411304.21
large1288fp1668.8432.6729.3418.6119.1319.3219.8811602.32
large1288fp3268.41435.7729.1618.4619.0119.1619.8911517.24
large3841fp1633.881.6424.0512.4512.8112.9613.9512308.35
large3841fp3233.7581.8324.0712.4212.8613.0313.5512362.84
large3842fp1651.77108.1841.3318.619.1719.3720.6613082.53
large3842fp3251.5108.3741.2918.6219.1319.2720.1613072.42
large3844fp1666.02127.2954.6931.532.0832.2833.4813438.17
large3844fp3266.1126.854.8231.5832.1332.3433.7513485.86
large3848fp1693.68148.6278.6753.8654.4554.665614180.87
large3848fp3293.9148.7478.7553.7754.5254.6655.8714261.35


1x RTX A5000 BERT LARGE Training Benchmark

GPUsPrecisionSequence-LengthTotal-Training-TimeBatch-sizePerformance(sent/sec)
1fp1612812991.12114.31
1fp3212812932.53114.38
1fp16128128228.21
1fp321286875.02228.11
1fp161283992.3451.17
1fp321283969.82451.43
1fp161282689.2882.47
1fp321282718.08880.71
1fp161282052.1316116.46
1fp321282034.8216117.67
1fp161281732.6632146.93
1fp321281722.0132147.85
1fp1638414256.77112.98
1fp3238414279.8112.97
1fp163849047.55220.82
1fp323849044.82220.83
1fp163846365.77430.42
1fp323846380.79430.35
1fp163844939.75840.14
1fp323844900.55840.44
1fp163844310.771647.08
1fp323844299.061647.19
1fp16384Did not Run32Did not Run
1fp32384Did not Run32Did not Run

BERT_A5000.png


2x RTX A5000 BERT LARGE Training Benchmark

GPUsPrecisionSequence-LengthTotal-Training-TimeBatch-sizePerformance(sent/sec)
2fp1612811437.17116.22
2fp3212811376.72116.31
2fp161286073.78231.75
2fp321286090.61231.79
2fp161283453.41459.98
2fp321283424.23458.25
2fp161282193.278105.09
2fp321282187.788105.41
2fp161281580.916166.9
2fp321281580.0716167.45
2fp161281160.7632233.9
2fp321281158.7232233.28
2fp1638412180.88115.17
2fp3238412294.78115.05
2fp163847199.95226.46
2fp323847211.68226.49
2fp163844630.27443.03
2fp323844626.98443.06
2fp163843340.13863.02
2fp323843336.61863.02

BERT_A5000_2.png


4x RTX A5000 BERT LARGE Training Benchmark

GPUsPrecisionSequence-LengthTotal-Training-TimeBatch-sizePerformance(sent/sec)
4fp161286721.29128.53
4fp321286720.39128.53
4fp161283657.87256.11
4fp321283643.79256.34
4fp161282161.744106.77
4fp321282163.324106.72
4fp161281448.38189.41
4fp321281447.998189.19
4fp16128983.2416306.3
4fp32128980.9116306.11
4fp16128815.4832437.08
4fp32128820.4532436.6
4fp163847088126.99
4fp323847109.11126.91
4fp163844220.48247.77
4fp323844227.16247.69
4fp163842777.77478.55
4fp323842775.14478.72
4fp163842047.378117.43
4fp323842039.438118.1


RTX A5000 BERT LARGE Training Benchmark Comparison

bert_a5000_3.png


Have any questions?
Contact Exxact Today


Topics