Benchmarks

NVIDIA RTX A4000 BERT Large Fine Tuning Benchmarks in TensorFlow

August 20, 2021

18 min read

NVIDIA RTX A4000: BERT Inferencing and Training Benchmarks in TensorFlow

For this post, we measured fine-tuning performance (training and inference) for the BERT implementation of TensorFlow on NVIDIA RTX A4000 GPUs. For testing we used an Exxact Valence Workstation that was fitted with 4x RTX A4000 GPUs with 16GB GPU memory per GPU.

Benchmark scripts we used for evaluation:

finetune_train_benchmark.sh

and

finetune_inference_benchmark.sh

from NVIDIA NGC Repository BERT for TensorFlow. We made slight modifications to the training benchmark script to get the larger batch size numbers.

The script runs multiple tests on the SQuAD v1.1 dataset using batch sizes 1, 2, 4, and 8. Inferencing tests were conducted using a 1 GPU configuration on BERT Large. In addition, we ran all benchmarks using TensorFlow's XLA across the board. Other training settings can be viewed at the end of this blog.

Key Points and Observations

Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.
For those interested in training BERT Large, a 2x RTX A4000 system may be a great choice to start with, giving the opportunity to add additional cards as budget/scaling needs increase.
NOTE: In order to run these benchmarks, or to be able to fine-tune BERT Large with 4x GPUs you'll need a system with at least 64GB RAM.

Interested in getting faster results?
Learn more about Exxact AI workstations starting at $3,700

Exxact Workstation System Specs:

Nodes	1
Processor / Count	2x AMD EPYC 7552
Total Logical Cores	192
Memory	DDR4 512GB
Storage	NVMe 3.84TB
OS	Ubuntu 18.04
CUDA Version	11.2
BERT Dataset	squad v1
Tensorflow	2.40

GPU Benchmark Overview

FP = Floating Point Precision, Seq = Sequence Length, BS = Batch Size

1x RTX A4000 BERT LARGE Inference Benchmark

Raw Data

Model	Sequence-Length	Batch-size	Precision	Total-Inference-Time	Throughput-Average(sent/sec)	Latency-Average(ms)	Latency-50%(ms)	Latency-90%(ms)	Latency-95%(ms)	iLatency-99%(ms)	Latency-100%(ms)
base	128	1	fp16	13.59	155.38	6.44	6.43	6.83	6.93	7.17	7.81
base	128	1	fp32	12.73	128.8	7.76	7.73	8.1	8.21	8.49	11.4
base	128	2	fp16	18.25	220.84	9.06	8.93	9.42	9.5	9.71	10.52
base	128	2	fp32	18.09	156.25	12.8	12.72	13.14	13.2	13.46	16.49
base	128	4	fp16	24.14	268.32	14.91	14.87	15.24	15.32	15.65	29.78
base	128	4	fp32	29.36	165.49	24.17	24.21	24.54	24.62	24.74	24.93
base	128	8	fp16	35.6	303.29	26.38	26.38	26.73	26.85	26.95	27.05
base	128	8	fp32	49.5	181.21	44.15	44.23	44.62	44.68	44.8	44.91
base	384	1	fp16	13.38	160.37	6.24	6.11	6.65	6.7	6.97	7.38
base	384	1	fp32	12.63	130.45	7.67	7.55	8.08	8.18	8.52	12.04
base	384	2	fp16	18.28	221.3	9.04	8.93	9.42	9.49	9.64	10.47
base	384	2	fp32	18.28	155.08	12.9	12.88	13.2	13.3	13.45	16.3
base	384	4	fp16	24.12	267.27	14.97	14.99	15.26	15.32	15.54	16.06
base	384	4	fp32	29.43	165.07	24.23	24.25	24.58	24.68	24.8	26.3
base	384	8	fp16	35.74	304.75	26.25	26.28	26.73	26.83	26.98	27.19
base	384	8	fp32	49.53	181.19	44.15	44.2	44.64	44.7	44.81	45.62
large	128	1	fp16	29.75	62.83	15.92	15.97	16.69	16.85	17.12	17.84
large	128	1	fp32	26.85	53.15	18.82	18.73	19.71	19.86	20.27	24.23
large	128	2	fp16	39.01	82.16	24.34	24.18	25.24	25.42	25.84	27.49
large	128	2	fp32	42.69	57.8	34.6	34.51	35.57	35.78	36.15	38.11
large	128	4	fp16	57.45	94.04	42.54	42.46	43.59	43.79	44.15	45.14
large	128	4	fp32	74.18	60.61	66	66.07	67.2	67.46	67.68	67.93
large	128	8	fp16	90.61	105.54	75.8	75.85	76.86	77.1	77.4	78.4
large	128	8	fp32	139.8	60.89	131.37	131.73	132.79	133.01	133.37	133.67
large	384	1	fp16	29.74	62.56	15.98	16.06	16.77	16.91	17.16	17.99
large	384	1	fp32	27.12	52.4	19.09	19.03	20.04	20.22	20.57	22.28
large	384	2	fp16	38.91	82.38	24.28	24.07	25.17	25.3	25.52	26.05
large	384	2	fp32	42.77	58.01	34.48	34.37	35.47	35.66	36.05	36.45
large	384	4	fp16	57.33	93.92	42.59	42.55	43.53	43.72	44.1	44.65
large	384	4	fp32	74.38	60.44	66.19	66.23	67.18	67.43	67.64	67.93
large	384	8	fp16	90.59	105.62	75.74	75.81	76.67	76.97	77.32	78.46
large	384	8	fp32	139.75	60.93	131.29	131.52	132.72	133.05	133.57	134.62

Data Chart

1x RTX A4000 Benchmarks BERT for Tensorflow 2 FineTuning Training

Raw Data

	Training Time Hours	Throughput sentences/sec
Base FP32, BS1	1.07	15.42
Base FP16, BS1	1.23	14.34
Base FP32, BS2	1.45	21.34
Base FP16, BS2	1.5	22.3
Base FP16 XLA,BS1	1.86	21.22
Base FP16, BS4	2.05	30.59
Base FP16 XLA,BS2	2.08	35.3
Base FP32, BS4	2.3	25.44
Base FP16 XLA,BS4	2.4	53.32
Large FP32, BS1	2.7	5.82
Base FP16 XLA,BS8	2.85	70.82
Large FP16, BS1	2.89	5.85
Base FP16, BS8	3.13	37.81
Large FP16, BS2	3.72	8.61
Base FP32, BS8	3.85	29.18
Large FP32, BS2	3.87	7.71
Large FP16 XLA,BS1	4.23	8.32
Large FP16 XLA,BS2	4.77	13.5
Large FP16, BS4	5.23	11.57
Large FP16 XLA,BS4	5.58	19.73
Large FP32, BS4	6.1	9.37
Large FP16 XLA,BS8	7.13	25.09
Large FP16, BS8	8.25	13.99
Base FP32 XLA,BS1	42.4	0.75
Base FP32 XLA,BS2	75	1.04
Base FP32 XLA,BS4	93.35	1.07
Base FP32 XLA,BS8	117.2	1.37

Data Chart

2x RTX A4000 Benchmarks BERT for Tensorflow 2 FineTuning Training

Raw Data

	Training Time Hours	Throughput sentences/sec
Base FP32 , BS1	1.64	21.82
Base FP16, BS1	1.7	23.39
Base FP32 XLA , BS1	1.85	26.72
Base FP32 , BS2	2.01	33.36
Base FP16, BS2	2.04	35.99
Base FP32 XLA , BS2	2.12	44.35
Base FP16 XLA , BS1	2.4	29.73
Base FP16, BS4	2.57	53.15
Base FP16 XLA , BS2	2.6	51.83
Base FP32 XLA , BS4	2.62	63.18
Base FP32 , BS4	2.84	43.9
Base FP16 XLA , BS4	2.92	83
Base FP16 XLA , BS8	3.38	120.19
Base FP32 XLA , BS8	3.52	82.78
Base FP16, BS8	3.64	69.01
Large FP16, BS1	4.07	8.94
Large FP32, BS1	4.2	7.93
Base FP32 , BS8	4.4	53.68
Large FP16, BS2	4.85	14.21
Large FP32, BS2	5.3	11.98
Large FP16 XLA , BS1	5.46	11.14
Large FP16 XLA , BS2	6	19.13
Large FP16, BS4	6.34	20.31
Large FP16 XLA , BS4	6.81	30.4
Large FP32, BS4	7.48	15.97
Large FP16 XLA , BS8	8.29	42.3
Large FP16, BS8	9.31	25.94

Data Chart

4x RTX A4000 Benchmarks BERT for Tensorflow 2 FineTuning Training

Raw Data

	Training Time Hours	Throughput sentences/sec
Base FP16, BS1	1.79	43.52
Base FP32 XLA, BS1	1.78	39.02
Base FP32, BS1	1.79	39.02
Base FP16, BS2	2.09	69.84
Base FP32, BS2	2.15	61.86
Base FP32 XLA, BS2	2.15	61.82
Base FP16 XLA, BS1	2.49	53.84
Base FP16, BS4	2.63	102.85
Base FP16 XLA, BS2	2.7	96.01
Base FP32 XLA, BS4	2.96	83.8
Base FP32, BS4	2.96	83.73
Base FP16 XLA, BS4	3	158.56
Base FP16 XLA, BS8	3.45	233.43
Base FP16, BS8	3.67	137.78
Large FP16, BS1	4.29	16.68
Base FP32, BS8	4.48	105.07
Base FP32 XLA, BS8	4.49	104.6
Large FP32, BS1	4.61	14.12
Large FP16, BS2	5.06	26.85
Large FP32, BS2	5.64	22.15
Large FP16 XLA, BS1	5.66	20.56
Large FP16 XLA, BS2	6.21	35.69
Large FP16 XLA, BS8	6.52	39.21
Large FP16 XLA, BS4	7.01	57.43
Large FP32, BS4	7.81	30.51
Large FP16 XLA, BS8	8.43	82.55
Large FP16, BS8	9.34	52.13

Data Chart

NVIDIA RTX A4000 Series GPUs

GPU Features	NVIDIA RTX A4000
GPU Memory	16GB GDDR6 with error-correction code (ECC)
Display Ports	4x DisplayPort 1.4
Max Power Consumption	140 W
Graphics Bus	PCI Express Gen 4 x 16
Form Factor	4.4” (H) x 9.5” (L) Single Slot
Thermal	Active
VR Ready	Yes

Additional GPU Benchmarks

Have any questions?
Contact Exxact Today

Topics

Have any questions?

Benchmarks

NVIDIA RTX A4000 BERT Large Fine Tuning Benchmarks in TensorFlow

August 20, 202118 min read