Benchmarks

NVIDIA Quadro RTX 8000 Benchmarks for Deep Learning in TensorFlow 2019

March 29, 2019
36 min read
20190301_220128025_iOS.jpg

NVIDIA Quadro RTX 8000 Benchmarks

Updated 6/11/2019 with XLA FP32 and XLA FP16 metrics.

For this post, we conducted deep learning performance benchmarks for TensorFlow using the new NVIDIA Quadro RTX 8000 GPUs. Our Exxact Valence Workstation was equipped with 4x Quadro RTX 8000's giving us an awesome 192 GB of GPU memory for our system. To demonstrate, we ran the standard tf_cnn_benchmarks.py benchmark script (found here in the official TensorFlow github). Also, we ran tests on the following networks: ResNet-50, ResNet-152, Inception v3, Inception v4, VGG-16, AlexNet, and Nasnet. For good measure, we compared FP16 to FP32 performance, and used 'typical' batch sizes (64 in most cases). Furthermore, we incrementally doubled the batch size until we threw a memory error. Incidentally, all tests ran on1,2 and 4 GPU configurations.

Key Points and Observations

  • In most scenarios, large batch size training showed impressive results in images/sec when compared to smaller batch sizes. This is especially true when scaling to the 4 GPU configuration.
    • AlexNet and VGG16 performed better using smaller batch size on a single GPU, but larger batch size performed better on these models when scaling up to 4 GPUs.
  • ResNet-50 and ResNet-152 Showed massive scaling when going from 1-2-4 GPUs, a mind blowing 4193.48 images/sec for ResNet-50 and 1621.96 images/sec for ResNet-152 at FP16 & XLA!
  • Using FP16 showed impressive gains in images/sec across most models when using 4 GPUs. (exception AlexNet)
  • The Quadro RTX 8000 with 48 GB RAM is Ideal for training networks that require large batch sizes that otherwise would be limited on lower end GPUs.
  • The Quadro RTX 8000 is an ideal choice for deep learning if you're restricted to a workstation or single server form factor and want maximum GPU memory.
  • Our workstations with Quadro RTX 8000 can also train state of the art NLP Transformer networks that require large batch size for best performance, a popular application for the fast growing data science market.
  • XLA significantly increases the amount of Img/sec across most models, however the most dramatic gains were seen in FP16.

Quadro RTX 8000 Deep Learning Benchmark Snapshot (FP16, FP32, XLA on/off)

Slide1-1024x576.png

eBook-DL-1024x202.jpg


Quadro RTX 8000 Deep Learning Benchmarks: FP16, XLA


1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
InceptionV4314.95468.11808.72512
NASNET406.77787.471557.53512
ResNet152429.1835.261621.96512
VGG16530.311028.791982.34512
InceptionV3577.051039.152025.35512
ResNet501096.322158.674193.481024


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=512 --num_batches=100 --model=inception4 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

Quadro RTX 8000 Deep Learning Benchmarks: FP32, XLA

Slide4-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
InceptionV4113.86218.12424.77256
ResNet152150.04287.6549.79256
VGG16163.43319.69604.44512
InceptionV3236.74459.86886.57256
ResNet50372.39719.111391.74512
NASNET407.48788.331562.55512


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

Quadro RTX 8000 Deep Learning Benchmarks: FP32, Batch Size 64

Slide2-1024x576.png

1 GPU2 GPU4 GPUBatch Size
ResNet50314.87590.3952.864
ResNet152127.71232.42418.4464
InceptionV3207.53386.86655.4564
InceptionV4102.41191.4337.4464
VGG16188.91337.38536.9564
NASNET160.42280.07510.1564


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server

Quadro RTX 8000 Deep Learning Benchmarks: FP32, Large Batch Size

Slide3-1024x576.png

1 GPU2 GPU4 GPUBatch Size
ResNet50322.66622.411213.3512
ResNet152137.12249.58452.77256
InceptionV3216.27412.75716.47256
InceptionV4105.2201.49345.79256
VGG16166.55316.46617512
NASNET187.69348.71614512


Run these benchmarks

Configure num_gpus to the number of GPUs desired to test. Change model to desired architecture. Change batch_size to desired mini-batch.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server

Quadro RTX 8000 Deep Learning Benchmarks: FP16, Batch Size 64

Slide6-1024x576.png

1 GPU2 GPU4 GPUBatch Size
ResNet50544.16972.891565.1864
ResNet152246.56412.25672.8764
InceptionV3334.28596.651029.2464
InceptionV4178.41327.89540.5264
VGG16347.01570.53637.9764
NASNET155.44282.78517.0664


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 8000 Deep Learning Benchmarks: FP16, Large Batch Size

Slide8-1024x576.png


1 GPU2 GPU4 GPUBatch Size
ResNet50604.761184.522338.841024
ResNet152285.85529.051062.13512
InceptionV3391.3754.941471.66512
InceptionV4203.67384.29762.32512
VGG16276.16528.88983.85512
NASNET196.52367.6726.85512

Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture. Change batch_size to desired mini-batch.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=1024 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 8000 Deep Learning Benchmarks: AlexNet (FP32, FP16, XLA on, off)

Slide7-1024x576.png

1 GPU2 GPU4 GPUBatch Size
Alexnet FP16 (Large Batch)5911.611456.1121828.998192
Alexnet FP16 (Regular Batch)6013.6411275.5414960.97512
Alexnet FP32 (Large Batch)2825.614421.978482.398192
Alexnet FP32 (Regular Batch)4103.277814.0410491.22512
Alexnet FP16 XLA6787.513101.0725035.278192
Alexnet FP32 XLA2173.974144.438007.668192

Run these deep learning benchmarks

configure the num_gpus to the number of GPUs desired to test, and omit use_fp16 flag to run in FP32. Change batch_size to desired mini-batch.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=8192 --model=alexnet --variable_update=parameter_server --use_fp16=True

DeepLearning_PE_Banner_Dynamic-051618-2-1024x108.jpg


System Specifications

SystemExxact Valence Workstation
GPU4 x NVIDIA Quadro RTX 8000
CPUIntel CORE I7-7820X 3.6GHZ
RAM32GB DDR4
SSD480 GB SSD
HDD (data)10 TB HDD
OSUbuntu 18.04
NVIDIA DRIVER410.79
CUDA Version10
Python2.7
TensorFlow1.14
Docker Imagetensorflow/tensorflow:nightly-gpu


Training Parameters (non XLA)

DatasetImagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:Varied
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:sgd
Variables:parameter_server


Training Parameters (XLA)

Dataset:Imagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:Varied
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:momentum
Variables:replicated
AllReducenccl


More Deep Learning Benchmarks

That's it for now! Have any questions? Let us know on social media.

https://www.facebook.com/exxactcorp/

https://twitter.com/Exxactcorp

Topics

20190301_220128025_iOS.jpg
Benchmarks

NVIDIA Quadro RTX 8000 Benchmarks for Deep Learning in TensorFlow 2019

March 29, 201936 min read

NVIDIA Quadro RTX 8000 Benchmarks

Updated 6/11/2019 with XLA FP32 and XLA FP16 metrics.

For this post, we conducted deep learning performance benchmarks for TensorFlow using the new NVIDIA Quadro RTX 8000 GPUs. Our Exxact Valence Workstation was equipped with 4x Quadro RTX 8000's giving us an awesome 192 GB of GPU memory for our system. To demonstrate, we ran the standard tf_cnn_benchmarks.py benchmark script (found here in the official TensorFlow github). Also, we ran tests on the following networks: ResNet-50, ResNet-152, Inception v3, Inception v4, VGG-16, AlexNet, and Nasnet. For good measure, we compared FP16 to FP32 performance, and used 'typical' batch sizes (64 in most cases). Furthermore, we incrementally doubled the batch size until we threw a memory error. Incidentally, all tests ran on1,2 and 4 GPU configurations.

Key Points and Observations

  • In most scenarios, large batch size training showed impressive results in images/sec when compared to smaller batch sizes. This is especially true when scaling to the 4 GPU configuration.
    • AlexNet and VGG16 performed better using smaller batch size on a single GPU, but larger batch size performed better on these models when scaling up to 4 GPUs.
  • ResNet-50 and ResNet-152 Showed massive scaling when going from 1-2-4 GPUs, a mind blowing 4193.48 images/sec for ResNet-50 and 1621.96 images/sec for ResNet-152 at FP16 & XLA!
  • Using FP16 showed impressive gains in images/sec across most models when using 4 GPUs. (exception AlexNet)
  • The Quadro RTX 8000 with 48 GB RAM is Ideal for training networks that require large batch sizes that otherwise would be limited on lower end GPUs.
  • The Quadro RTX 8000 is an ideal choice for deep learning if you're restricted to a workstation or single server form factor and want maximum GPU memory.
  • Our workstations with Quadro RTX 8000 can also train state of the art NLP Transformer networks that require large batch size for best performance, a popular application for the fast growing data science market.
  • XLA significantly increases the amount of Img/sec across most models, however the most dramatic gains were seen in FP16.

Quadro RTX 8000 Deep Learning Benchmark Snapshot (FP16, FP32, XLA on/off)

Slide1-1024x576.png

eBook-DL-1024x202.jpg


Quadro RTX 8000 Deep Learning Benchmarks: FP16, XLA


1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
InceptionV4314.95468.11808.72512
NASNET406.77787.471557.53512
ResNet152429.1835.261621.96512
VGG16530.311028.791982.34512
InceptionV3577.051039.152025.35512
ResNet501096.322158.674193.481024


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=512 --num_batches=100 --model=inception4 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

Quadro RTX 8000 Deep Learning Benchmarks: FP32, XLA

Slide4-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
InceptionV4113.86218.12424.77256
ResNet152150.04287.6549.79256
VGG16163.43319.69604.44512
InceptionV3236.74459.86886.57256
ResNet50372.39719.111391.74512
NASNET407.48788.331562.55512


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

Quadro RTX 8000 Deep Learning Benchmarks: FP32, Batch Size 64

Slide2-1024x576.png

1 GPU2 GPU4 GPUBatch Size
ResNet50314.87590.3952.864
ResNet152127.71232.42418.4464
InceptionV3207.53386.86655.4564
InceptionV4102.41191.4337.4464
VGG16188.91337.38536.9564
NASNET160.42280.07510.1564


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server

Quadro RTX 8000 Deep Learning Benchmarks: FP32, Large Batch Size

Slide3-1024x576.png

1 GPU2 GPU4 GPUBatch Size
ResNet50322.66622.411213.3512
ResNet152137.12249.58452.77256
InceptionV3216.27412.75716.47256
InceptionV4105.2201.49345.79256
VGG16166.55316.46617512
NASNET187.69348.71614512


Run these benchmarks

Configure num_gpus to the number of GPUs desired to test. Change model to desired architecture. Change batch_size to desired mini-batch.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server

Quadro RTX 8000 Deep Learning Benchmarks: FP16, Batch Size 64

Slide6-1024x576.png

1 GPU2 GPU4 GPUBatch Size
ResNet50544.16972.891565.1864
ResNet152246.56412.25672.8764
InceptionV3334.28596.651029.2464
InceptionV4178.41327.89540.5264
VGG16347.01570.53637.9764
NASNET155.44282.78517.0664


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 8000 Deep Learning Benchmarks: FP16, Large Batch Size

Slide8-1024x576.png


1 GPU2 GPU4 GPUBatch Size
ResNet50604.761184.522338.841024
ResNet152285.85529.051062.13512
InceptionV3391.3754.941471.66512
InceptionV4203.67384.29762.32512
VGG16276.16528.88983.85512
NASNET196.52367.6726.85512

Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture. Change batch_size to desired mini-batch.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=1024 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 8000 Deep Learning Benchmarks: AlexNet (FP32, FP16, XLA on, off)

Slide7-1024x576.png

1 GPU2 GPU4 GPUBatch Size
Alexnet FP16 (Large Batch)5911.611456.1121828.998192
Alexnet FP16 (Regular Batch)6013.6411275.5414960.97512
Alexnet FP32 (Large Batch)2825.614421.978482.398192
Alexnet FP32 (Regular Batch)4103.277814.0410491.22512
Alexnet FP16 XLA6787.513101.0725035.278192
Alexnet FP32 XLA2173.974144.438007.668192

Run these deep learning benchmarks

configure the num_gpus to the number of GPUs desired to test, and omit use_fp16 flag to run in FP32. Change batch_size to desired mini-batch.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=8192 --model=alexnet --variable_update=parameter_server --use_fp16=True

DeepLearning_PE_Banner_Dynamic-051618-2-1024x108.jpg


System Specifications

SystemExxact Valence Workstation
GPU4 x NVIDIA Quadro RTX 8000
CPUIntel CORE I7-7820X 3.6GHZ
RAM32GB DDR4
SSD480 GB SSD
HDD (data)10 TB HDD
OSUbuntu 18.04
NVIDIA DRIVER410.79
CUDA Version10
Python2.7
TensorFlow1.14
Docker Imagetensorflow/tensorflow:nightly-gpu


Training Parameters (non XLA)

DatasetImagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:Varied
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:sgd
Variables:parameter_server


Training Parameters (XLA)

Dataset:Imagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:Varied
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:momentum
Variables:replicated
AllReducenccl


More Deep Learning Benchmarks

That's it for now! Have any questions? Let us know on social media.

https://www.facebook.com/exxactcorp/

https://twitter.com/Exxactcorp

Topics