Blog

Benchmarks

NVIDIA Quadro RTX 6000 GPU Performance Benchmarks for TensorFlow

April 26, 2019
146 min read
20190403_232013169_iOS-e1556306243710.jpg

NVIDIA Quadro RTX 6000 Benchmarks

For this post, we conducted deep learning performance benchmarks for TensorFlow using the new NVIDIA Quadro RTX 6000 GPUs. Our Exxact Valence Workstation was fitted with 4x Quadro RTX 6000's giving us 96 GB of GPU memory for our system.

We ran the standard "tf_cnn_benchmarks.py" benchmark script (found here in the official TensorFlow github) on the following networks: ResNet-50, ResNet-152, Inception v3, Inception v4, VGG-16, AlexNet, and Nasnet.

We also compared FP16 to FP32 performance, and used 'typical' batch sizes (64 in most cases), then incrementally doubled the batch size until we threw a memory error. We ran the same tests using 1,2 and 4 GPU configurations. In addition we've also ran benchmarks using XLA and noticed substantial improvements.

Key Points and Observations

  • In terms of purely img/sec, the RTX 6000 is on par, with performance of the RTX 8000. The two cards use the same Turing processor, yet have different memory sizes.
  • However in terms of batch size, the RTX 6000 cannot fit the large batch sizes that the RTX 8000 can.
  • XLA significantly increases the amount of Img/sec across most models. This is true for both FP 16 and FP32, however the most dramatic gains were seen in FP 16.

Quadro RTX 6000 Benchmark Snapshot, XLA on/off, FP32, FP16

Chart-1-1024x576.png

Quadro RTX 6000 Deep Learning Benchmarks: FP16, Large Batch Size (XLA on)

Slide7-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
ResNet501081.72129.514132.64512
ResNet152419.07816.221547.55256
InceptionV3566.321018.81962.5256
Inception V4313.07454.6792.61256
VGG16535.071051.861996.54512
NASNET389.39761.651521256


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=inception4 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_codefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

Quadro RTX 6000 Deep Learning Benchmarks: FP16 Batch Size 64 (XLA off)

Slide6-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
ResNet50555.66989.021764.5764
ResNet152251.12404.76687.1764
InceptionV3338.07584.421064.1464
InceptionV4180.15327.88606.4764
VGG16354.14589.33776.8864
NASNET163.96283.55518.3264


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 6000 Deep Learning Benchmarks: FP16 Large Batch Size (XLA off)

Slide5-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
ResNet50637.561248.542450.09512
ResNet152281.94519.761019.9256
InceptionV3391.08737.771441.87256
InceptionV4203.9392.14763.02256
VGG16277.53534.93997.29512
NASNET185.68354.48684.18256



Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 6000 Deep Learning Benchmarks: FP32, Large Batch Size (XLA on)

RTX 6000 Large Batch XLA Fp32

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
NASNET393.08765.761526.41256
ResNet50371.39730.331391.21256
InceptionV3235.37455.26872.55128
VGG16212.84421.94746.47256
ResNet152148.4287.98531.83128
InceptionV4113.85220.51423.12128


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture use .

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --nodistortions --gradient_repacking=2 --datasets_use_codefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10


Quadro RTX 6000 Deep Learning Benchmarks: FP32, Batch Size 64

Slide2-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
ResNet50314.08589.161055.6564
ResNet152128.32235.46430.6564
InceptionV3206.29388.36726.5364
InceptionV4101.99192.91370.4764
VGG16192.97347.27549.9764
NASNET159.2275.47517.8364


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server

Quadro RTX 6000 Deep Learning Benchmarks: FP32 Large Batch Size

RTX 6000 Large Batch FP32

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
ResNet50327.73642.621247.82256
ResNet152136.5255.47496.64128
InceptionV3*215.95266.22429.75256 (*128 when using 1 GPU)
InceptionV4106.34204.33397.83128
VGG16186.79365.06695.92512
NASNET181.28343.81669.67256


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server

Quadro RTX 6000 Deep Learning Benchmarks: Alexnet (FP32, FP16, FP16 XLA on,FP32 XLA off)

Slide8-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
Alexnet FP16 (Large Batch)5893.1111471.5622062.748192
Alexnet FP16 (Normal Batch)6033.5511361.1514933.42512
Alexnet FP32 (Large Batch)2869.055508.1410677.194096
Alexnet FP32 (Normal Batch)4143.627979.5810632.83512
Alexnet XLA FP323049.845895.6511129.324096
Alexnet XLA FP166733.8213160.1325290.298192


Run these deep learning benchmarks

configure the num_gpus to the number of GPUs desired to test, and omit use_fp16 flag to run in fp32.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=8192 --model=alexnet --variable_update=parameter_server --use_fp16=True

eBook-DL-1024x202.jpg


System Specifications:

SystemExxact Valence Workstation
GPU4 x NVIDIA Quadro RTX 6000
CPUIntel CORE I7-7820X 3.6GHZ
RAM32GB DDR4
SSD480 GB SSD
HDD (data)10 TB HDD
OSUbuntu 18.04
NVIDIA DRIVER418.43
CUDA Version10.1
Python2.7
TensorFlow1.14
Docker Imagetensorflow/tensorflow:nightly-gpu


Training Parameters (non XLA)

Dataset:Imagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:Varied
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:sgd
Variables:parameter_server


Training Parameters (XLA)

Dataset:Imagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:Varied
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:momentum
Variables:replicated
AllReducenccl


More Deep Learning Benchmarks

Free Resources

Browse our whitepapers, e-books, case studies, and reference architecture.

Explore
20190403_232013169_iOS-e1556306243710.jpg
Benchmarks

NVIDIA Quadro RTX 6000 GPU Performance Benchmarks for TensorFlow

April 26, 2019 146 min read

NVIDIA Quadro RTX 6000 Benchmarks

For this post, we conducted deep learning performance benchmarks for TensorFlow using the new NVIDIA Quadro RTX 6000 GPUs. Our Exxact Valence Workstation was fitted with 4x Quadro RTX 6000's giving us 96 GB of GPU memory for our system.

We ran the standard "tf_cnn_benchmarks.py" benchmark script (found here in the official TensorFlow github) on the following networks: ResNet-50, ResNet-152, Inception v3, Inception v4, VGG-16, AlexNet, and Nasnet.

We also compared FP16 to FP32 performance, and used 'typical' batch sizes (64 in most cases), then incrementally doubled the batch size until we threw a memory error. We ran the same tests using 1,2 and 4 GPU configurations. In addition we've also ran benchmarks using XLA and noticed substantial improvements.

Key Points and Observations

  • In terms of purely img/sec, the RTX 6000 is on par, with performance of the RTX 8000. The two cards use the same Turing processor, yet have different memory sizes.
  • However in terms of batch size, the RTX 6000 cannot fit the large batch sizes that the RTX 8000 can.
  • XLA significantly increases the amount of Img/sec across most models. This is true for both FP 16 and FP32, however the most dramatic gains were seen in FP 16.

Quadro RTX 6000 Benchmark Snapshot, XLA on/off, FP32, FP16

Chart-1-1024x576.png

Quadro RTX 6000 Deep Learning Benchmarks: FP16, Large Batch Size (XLA on)

Slide7-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
ResNet501081.72129.514132.64512
ResNet152419.07816.221547.55256
InceptionV3566.321018.81962.5256
Inception V4313.07454.6792.61256
VGG16535.071051.861996.54512
NASNET389.39761.651521256


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=inception4 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --use_fp16=True --nodistortions --gradient_repacking=2 --datasets_use_codefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10

Quadro RTX 6000 Deep Learning Benchmarks: FP16 Batch Size 64 (XLA off)

Slide6-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
ResNet50555.66989.021764.5764
ResNet152251.12404.76687.1764
InceptionV3338.07584.421064.1464
InceptionV4180.15327.88606.4764
VGG16354.14589.33776.8864
NASNET163.96283.55518.3264


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 6000 Deep Learning Benchmarks: FP16 Large Batch Size (XLA off)

Slide5-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
ResNet50637.561248.542450.09512
ResNet152281.94519.761019.9256
InceptionV3391.08737.771441.87256
InceptionV4203.9392.14763.02256
VGG16277.53534.93997.29512
NASNET185.68354.48684.18256



Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=512 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Quadro RTX 6000 Deep Learning Benchmarks: FP32, Large Batch Size (XLA on)

RTX 6000 Large Batch XLA Fp32

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
NASNET393.08765.761526.41256
ResNet50371.39730.331391.21256
InceptionV3235.37455.26872.55128
VGG16212.84421.94746.47256
ResNet152148.4287.98531.83128
InceptionV4113.85220.51423.12128


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture use .

python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 --num_batches=100 --model=resnet50 --optimizer=momentum --variable_update=replicated --all_reduce_spec=nccl --nodistortions --gradient_repacking=2 --datasets_use_codefetch=True --per_gpu_thread_count=2 --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --xla_compile=True --local_parameter_device=gpu --num_gpus=1 --display_every=10


Quadro RTX 6000 Deep Learning Benchmarks: FP32, Batch Size 64

Slide2-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
ResNet50314.08589.161055.6564
ResNet152128.32235.46430.6564
InceptionV3206.29388.36726.5364
InceptionV4101.99192.91370.4764
VGG16192.97347.27549.9764
NASNET159.2275.47517.8364


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=64 --model=resnet50 --variable_update=parameter_server

Quadro RTX 6000 Deep Learning Benchmarks: FP32 Large Batch Size

RTX 6000 Large Batch FP32

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
ResNet50327.73642.621247.82256
ResNet152136.5255.47496.64128
InceptionV3*215.95266.22429.75256 (*128 when using 1 GPU)
InceptionV4106.34204.33397.83128
VGG16186.79365.06695.92512
NASNET181.28343.81669.67256


Run these benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server

Quadro RTX 6000 Deep Learning Benchmarks: Alexnet (FP32, FP16, FP16 XLA on,FP32 XLA off)

Slide8-1024x576.png

1 GPU img/sec2 GPU img/sec4 GPU img/secBatch Size
Alexnet FP16 (Large Batch)5893.1111471.5622062.748192
Alexnet FP16 (Normal Batch)6033.5511361.1514933.42512
Alexnet FP32 (Large Batch)2869.055508.1410677.194096
Alexnet FP32 (Normal Batch)4143.627979.5810632.83512
Alexnet XLA FP323049.845895.6511129.324096
Alexnet XLA FP166733.8213160.1325290.298192


Run these deep learning benchmarks

configure the num_gpus to the number of GPUs desired to test, and omit use_fp16 flag to run in fp32.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=8192 --model=alexnet --variable_update=parameter_server --use_fp16=True

eBook-DL-1024x202.jpg


System Specifications:

SystemExxact Valence Workstation
GPU4 x NVIDIA Quadro RTX 6000
CPUIntel CORE I7-7820X 3.6GHZ
RAM32GB DDR4
SSD480 GB SSD
HDD (data)10 TB HDD
OSUbuntu 18.04
NVIDIA DRIVER418.43
CUDA Version10.1
Python2.7
TensorFlow1.14
Docker Imagetensorflow/tensorflow:nightly-gpu


Training Parameters (non XLA)

Dataset:Imagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:Varied
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:sgd
Variables:parameter_server


Training Parameters (XLA)

Dataset:Imagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:Varied
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:momentum
Variables:replicated
AllReducenccl


More Deep Learning Benchmarks

Free Resources

Browse our whitepapers, e-books, case studies, and reference architecture.

Explore