Blog

Benchmarks

NVIDIA RTX 3080 Ti BERT Large Fine Tuning Benchmarks in TensorFlow

September 23, 2021
11 min read
NVIDIA-RTX-3080ti-BERT.jpg

Fine-tuning BERT Large on a GPU Workstation

For this post, we measured fine-tuning performance (training and inference) for the BERT implementation of TensorFlow on NVIDIA GeForce RTX 3080 Ti GPUs. For testing we used an Exxact Valence Workstation fitted with 4x RTX 3080 Ti GPUs with 12GB GPU memory per GPU.

Benchmark scripts we used for evaluation:

finetune_train_benchmark.sh

and 

finetune_inference_benchmark.sh

from NVIDIA NGC Repository BERT for TensorFlow. We made slight modifications to the training benchmark script to get the larger batch size numbers.

The script runs multiple tests on the SQuAD v1.1 dataset using batch sizes 1, 2, 4, 8, 16, and 32. Inferencing tests were conducted using a 1 GPU configuration on BERT Large. In addition, we ran all benchmarks using TensorFlow's XLA across the board. 

Key Points and Observations

  • Scenarios that are not typically used in real-world training such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.
  • When doing performance comparisons, results showed that the RTX 3080 Ti delivers 3.3% better performance compared to the RTX A5000.
  • For those interested in training BERT Large, a 2x RTX 3080 Ti system may be a great choice to start with, giving the opportunity to add additional cards as budget/scaling needs increase.
  • NOTE: In order to run these benchmarks, or be able to fine-tune BERT Large with 4x GPUs you'll need a system with at least 64GB RAM.

Interested in getting faster results?
Learn more about Exxact AI workstations starting at $3,700


Exxact Workstation System Specs:

Nodes1
Processor / Count2x AMD EPYC 7552
Total Logical Cores48
MemoryDDR4 512GB
StorageNVMe 3.84TB
OSUbuntu 18.04
CUDA Version11.2
BERT Dataset

squad v1

GPU Benchmark Overview

FP = Floating Point Precision, Seq = Sequence Length, BS = Batch Size

1x Quadro RTX 3080 Ti BERT Large Inference Benchmark

ModelSequence-LengthBatch-sizePrecisionTotal-Inference-TimeThroughput-Average(sent/sec)Latency-Average(ms)Latency-50%(ms)Latency-90%(ms)Latency-95%(ms)iLatency-99%(ms)Latency-100%(ms)
base1281fp1623.09182.38.85.696.216.376.785934.1
base1281fp3221.01179.678.895.816.386.556.985965.55
base1282fp1620.07381.078.85.476.036.166.516016.77
base1282fp3220.18379.858.865.46.146.286.716093.34
base1284fp1627.36669.1312.295.996.416.586.996188.94
base1284fp3227.6671.5712.265.966.366.537.016213.83
base1288fp1635.36949.7114.28.499.29.796357
base1288fp3235.36955.4114.18.358.889.129.616340.38
base3841fp1617.03181.8311.75.656.296.456.776461.01
base3841fp3217.2180.9111.755.696.286.416.766484.46
base3842fp1625.2283.118.927.037.527.78.456918.94
base3842fp3224.95282.8618.887.067.477.718.596907.74
base3844fp1630.09357.0323.2111.1911.691212.997059.7
base3844fp3230.03358.9723.1511.111.6511.9613.117061.75
base3848fp1641.64411.6432.2719.419.9620.2121.487440.31
base3848fp3241.71411.8532.3619.419.9120.2621.557406.49
large1281fp1641.8108.36159.6110.6610.8911.410322.8
large1281fp3236.63100.615.7410.3611.2611.5612.1610407.55
large1282fp1634.84214.715.539.1210.0910.2710.8910541.14
large1282fp3234.9321415.589.2210.1410.311.2310575.28
large1284fp1652.47301.8224.3413.3613.8614.0514.6610740.68
large1284fp3252.85298.8324.5413.4713.8814.0515.2710685.7
large1288fp1671.15391.0330.5820.4721.2321.4622.2310741.95
large1288fp3271.7938930.8820.6621.3321.522.0410941.51
large3841fp1633.875.9624.4213.2513.9714.115.0711737.02
large3841fp3233.178.0323.8212.5413.713.8314.7511466.12
large3842fp1652.2897.842.1320.3121.2621.3822.3312373.92
large3842fp3252.7197.6542.5420.621.2521.3722.9912629.24
large3844fp1665.04124.6953.8832.1732.7732.9934.2912558.83
large3844fp3265.62124.6654.4132.1832.8533.0333.9712861.03
large3848fp1696.27139.0981.0157.5458.4558.8260.5713293.83
large3848fp3296.13139.4280.8557.3758.2158.5560.4413304.43

NVIDIA RTX-30 Series GPUs

 NVIDIA GeForce RTX 3060NVIDIA GeForce RTX 3060 TiNVIDIA GeForce RTX 3070NVIDIA GeForce RTX 3080NVIDIA GeForce RTX 3090
NVIDIA CUDA Cores3,5844,8645,8888,70410,496
Boost Clock (GHz)1.781.671.731.711.70
Memory Size12GB8GB8GB10GB24GB
Memory TypeGDDR6GDDR6GDDR6GDDR6XGDDR6X
Dimensions9.5 x 4.4 inches9.5 x 4.4 inches9.5 x 4.4 inches11.2 x 4.4 inches12.3 x 5.4 inches
Power Draw170W200W220W320W350W

Additional GPU Benchmarks


Have any questions?
Contact Exxact Today


Free Resources

Browse our whitepapers, e-books, case studies, and reference architecture.

Explore
NVIDIA-RTX-3080ti-BERT.jpg
Benchmarks

NVIDIA RTX 3080 Ti BERT Large Fine Tuning Benchmarks in TensorFlow

September 23, 2021 11 min read

Fine-tuning BERT Large on a GPU Workstation

For this post, we measured fine-tuning performance (training and inference) for the BERT implementation of TensorFlow on NVIDIA GeForce RTX 3080 Ti GPUs. For testing we used an Exxact Valence Workstation fitted with 4x RTX 3080 Ti GPUs with 12GB GPU memory per GPU.

Benchmark scripts we used for evaluation:

finetune_train_benchmark.sh

and 

finetune_inference_benchmark.sh

from NVIDIA NGC Repository BERT for TensorFlow. We made slight modifications to the training benchmark script to get the larger batch size numbers.

The script runs multiple tests on the SQuAD v1.1 dataset using batch sizes 1, 2, 4, 8, 16, and 32. Inferencing tests were conducted using a 1 GPU configuration on BERT Large. In addition, we ran all benchmarks using TensorFlow's XLA across the board. 

Key Points and Observations

  • Scenarios that are not typically used in real-world training such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.
  • When doing performance comparisons, results showed that the RTX 3080 Ti delivers 3.3% better performance compared to the RTX A5000.
  • For those interested in training BERT Large, a 2x RTX 3080 Ti system may be a great choice to start with, giving the opportunity to add additional cards as budget/scaling needs increase.
  • NOTE: In order to run these benchmarks, or be able to fine-tune BERT Large with 4x GPUs you'll need a system with at least 64GB RAM.

Interested in getting faster results?
Learn more about Exxact AI workstations starting at $3,700


Exxact Workstation System Specs:

Nodes1
Processor / Count2x AMD EPYC 7552
Total Logical Cores48
MemoryDDR4 512GB
StorageNVMe 3.84TB
OSUbuntu 18.04
CUDA Version11.2
BERT Dataset

squad v1

GPU Benchmark Overview

FP = Floating Point Precision, Seq = Sequence Length, BS = Batch Size

1x Quadro RTX 3080 Ti BERT Large Inference Benchmark

ModelSequence-LengthBatch-sizePrecisionTotal-Inference-TimeThroughput-Average(sent/sec)Latency-Average(ms)Latency-50%(ms)Latency-90%(ms)Latency-95%(ms)iLatency-99%(ms)Latency-100%(ms)
base1281fp1623.09182.38.85.696.216.376.785934.1
base1281fp3221.01179.678.895.816.386.556.985965.55
base1282fp1620.07381.078.85.476.036.166.516016.77
base1282fp3220.18379.858.865.46.146.286.716093.34
base1284fp1627.36669.1312.295.996.416.586.996188.94
base1284fp3227.6671.5712.265.966.366.537.016213.83
base1288fp1635.36949.7114.28.499.29.796357
base1288fp3235.36955.4114.18.358.889.129.616340.38
base3841fp1617.03181.8311.75.656.296.456.776461.01
base3841fp3217.2180.9111.755.696.286.416.766484.46
base3842fp1625.2283.118.927.037.527.78.456918.94
base3842fp3224.95282.8618.887.067.477.718.596907.74
base3844fp1630.09357.0323.2111.1911.691212.997059.7
base3844fp3230.03358.9723.1511.111.6511.9613.117061.75
base3848fp1641.64411.6432.2719.419.9620.2121.487440.31
base3848fp3241.71411.8532.3619.419.9120.2621.557406.49
large1281fp1641.8108.36159.6110.6610.8911.410322.8
large1281fp3236.63100.615.7410.3611.2611.5612.1610407.55
large1282fp1634.84214.715.539.1210.0910.2710.8910541.14
large1282fp3234.9321415.589.2210.1410.311.2310575.28
large1284fp1652.47301.8224.3413.3613.8614.0514.6610740.68
large1284fp3252.85298.8324.5413.4713.8814.0515.2710685.7
large1288fp1671.15391.0330.5820.4721.2321.4622.2310741.95
large1288fp3271.7938930.8820.6621.3321.522.0410941.51
large3841fp1633.875.9624.4213.2513.9714.115.0711737.02
large3841fp3233.178.0323.8212.5413.713.8314.7511466.12
large3842fp1652.2897.842.1320.3121.2621.3822.3312373.92
large3842fp3252.7197.6542.5420.621.2521.3722.9912629.24
large3844fp1665.04124.6953.8832.1732.7732.9934.2912558.83
large3844fp3265.62124.6654.4132.1832.8533.0333.9712861.03
large3848fp1696.27139.0981.0157.5458.4558.8260.5713293.83
large3848fp3296.13139.4280.8557.3758.2158.5560.4413304.43

NVIDIA RTX-30 Series GPUs

 NVIDIA GeForce RTX 3060NVIDIA GeForce RTX 3060 TiNVIDIA GeForce RTX 3070NVIDIA GeForce RTX 3080NVIDIA GeForce RTX 3090
NVIDIA CUDA Cores3,5844,8645,8888,70410,496
Boost Clock (GHz)1.781.671.731.711.70
Memory Size12GB8GB8GB10GB24GB
Memory TypeGDDR6GDDR6GDDR6GDDR6XGDDR6X
Dimensions9.5 x 4.4 inches9.5 x 4.4 inches9.5 x 4.4 inches11.2 x 4.4 inches12.3 x 5.4 inches
Power Draw170W200W220W320W350W

Additional GPU Benchmarks


Have any questions?
Contact Exxact Today


Free Resources

Browse our whitepapers, e-books, case studies, and reference architecture.

Explore