Benchmarks

HGX-2 Benchmarks for Deep Learning in TensorFlow: A 16x V100 SXM3 NVSwitch GPU Server

August 16, 2019
5 min read
exxact-hgx-2.jpg

Exxact HGX-2 TensorEX Server Smashes Deep Learning Benchmarks

For this post, we show deep learning benchmarks for TensorFlow on an Exxact TensorEX HGX-2 Server. This behemoth of a Deep Learning Server has 16 NVIDIA Tesla V100 GPUs.

We ran the standard "tf_cnn_benchmarks.py" benchmark script from TensorFlow's github. To compare, tests were run on the following networks: ResNet-50, ResNet-152, Inception V3, VGG-16. In addition we compared the FP16 to FP32 performance, and used batch size of 256 (except for ResNet152 FP32, the batch size was 64). As you'll see, the same tests were run using 1,2,4,8 and 16 GPU configurations. All benchmarks were done using 'vanilla' TensorFlow settings for FP16 and FP32.

Notable HGX2 Server Features

  • 16x NVIDIA Tesla V100 SXM3
  • 81,920 NVIDIA CUDA Cores
  • 10,240 NVIDIA Tensor Cores
  • .5TB Total GPU Memory
  • NVSwitch powered by NVLink 2.4TB/sec aggregate speed

Tesla GPU servers

Exxact TensorEX HGX-2 Deep Learning Benchmarks: FP16

Slide1.PNG


2 GPU img/sec 4 GPU img/sec Batch Size
ResNet50 1735.56 3218 128
ResNet152 760.57 1415.56 128
Inception V3 1134.88 2161.02 128
Inception V4 602.36 1205.97 128
googlenet 2820.47 5265.14 128


Run these FP16 benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired model architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Exxact TensorEX HGX-2 Deep Learning Benchmarks: FP32

Slide2.PNG


2 GPU img/sec 4 GPU img/sec Batch Size
ResNet50 762.21 1432.69 128
ResNet152 278.17 577.26 128
Inception V3 495.51 926.93 128
Inception V4 227.05 455.65 128
googlenet 1692.94 3393.91 128


Run these FP32 benchmarks

To run FP32, remove fp16 flag, configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server

Other Notes and Future Plans for HGX2

The HGX2 GPU server is an absolute monster for deep learning or any GPU powered HPC tasks. In the future, we would like to conduct further benchmarks on more models as well as other acceleration methods such as XLA for TensorFlow, where we would expect significant performance gains. Also training models on even larger batch sizes is another area we will consider exploring.

System Specifications:

SystemExxact TensorEX HGX-2
GPU16x NVIDIA Tesla V100 32 GB SXM3
CPU2x Intel Xeon Platinum 8168
RAM1.5 TB DDR4
SSD (OS)1TB x2 NVMe (RAID 1)
SSD (Data)32 TB NVMe Storage
OSUbuntu 16.04
NVIDIA DRIVER418.67
CUDA Version10.1
Python2.7
TensorFlow1.14
Docker Imagetensorflow/tensorflow:nightly-gpu

Deep Learning Workstations Transformer

Training Parameters

Dataset:Imagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:256 per device*
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:sgd
Variables:parameter_server

Interested in More Deep Learning Benchmarks?

Topics

exxact-hgx-2.jpg
Benchmarks

HGX-2 Benchmarks for Deep Learning in TensorFlow: A 16x V100 SXM3 NVSwitch GPU Server

August 16, 20195 min read

Exxact HGX-2 TensorEX Server Smashes Deep Learning Benchmarks

For this post, we show deep learning benchmarks for TensorFlow on an Exxact TensorEX HGX-2 Server. This behemoth of a Deep Learning Server has 16 NVIDIA Tesla V100 GPUs.

We ran the standard "tf_cnn_benchmarks.py" benchmark script from TensorFlow's github. To compare, tests were run on the following networks: ResNet-50, ResNet-152, Inception V3, VGG-16. In addition we compared the FP16 to FP32 performance, and used batch size of 256 (except for ResNet152 FP32, the batch size was 64). As you'll see, the same tests were run using 1,2,4,8 and 16 GPU configurations. All benchmarks were done using 'vanilla' TensorFlow settings for FP16 and FP32.

Notable HGX2 Server Features

  • 16x NVIDIA Tesla V100 SXM3
  • 81,920 NVIDIA CUDA Cores
  • 10,240 NVIDIA Tensor Cores
  • .5TB Total GPU Memory
  • NVSwitch powered by NVLink 2.4TB/sec aggregate speed

Tesla GPU servers

Exxact TensorEX HGX-2 Deep Learning Benchmarks: FP16

Slide1.PNG


2 GPU img/sec 4 GPU img/sec Batch Size
ResNet50 1735.56 3218 128
ResNet152 760.57 1415.56 128
Inception V3 1134.88 2161.02 128
Inception V4 602.36 1205.97 128
googlenet 2820.47 5265.14 128


Run these FP16 benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired model architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Exxact TensorEX HGX-2 Deep Learning Benchmarks: FP32

Slide2.PNG


2 GPU img/sec 4 GPU img/sec Batch Size
ResNet50 762.21 1432.69 128
ResNet152 278.17 577.26 128
Inception V3 495.51 926.93 128
Inception V4 227.05 455.65 128
googlenet 1692.94 3393.91 128


Run these FP32 benchmarks

To run FP32, remove fp16 flag, configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server

Other Notes and Future Plans for HGX2

The HGX2 GPU server is an absolute monster for deep learning or any GPU powered HPC tasks. In the future, we would like to conduct further benchmarks on more models as well as other acceleration methods such as XLA for TensorFlow, where we would expect significant performance gains. Also training models on even larger batch sizes is another area we will consider exploring.

System Specifications:

SystemExxact TensorEX HGX-2
GPU16x NVIDIA Tesla V100 32 GB SXM3
CPU2x Intel Xeon Platinum 8168
RAM1.5 TB DDR4
SSD (OS)1TB x2 NVMe (RAID 1)
SSD (Data)32 TB NVMe Storage
OSUbuntu 16.04
NVIDIA DRIVER418.67
CUDA Version10.1
Python2.7
TensorFlow1.14
Docker Imagetensorflow/tensorflow:nightly-gpu

Deep Learning Workstations Transformer

Training Parameters

Dataset:Imagenet (synthetic)
Mode:training
SingleSess:False
Batch Size:256 per device*
Num Batches:100
Num Epochs:0.08
Devices:['/gpu:0']...(varied)
NUMA bind:False
Data format:NCHW
Optimizer:sgd
Variables:parameter_server

Interested in More Deep Learning Benchmarks?

Topics