Benchmarks

HGX-2 Benchmarks for Deep Learning in TensorFlow: A 16x V100 SXM3 NVSwitch GPU Server

August 16, 2019

5 min read

Exxact HGX-2 TensorEX Server Smashes Deep Learning Benchmarks

For this post, we show deep learning benchmarks for TensorFlow on an Exxact TensorEX HGX-2 Server. This behemoth of a Deep Learning Server has 16 NVIDIA Tesla V100 GPUs.

We ran the standard "tf_cnn_benchmarks.py" benchmark script from TensorFlow's github. To compare, tests were run on the following networks: ResNet-50, ResNet-152, Inception V3, VGG-16. In addition we compared the FP16 to FP32 performance, and used batch size of 256 (except for ResNet152 FP32, the batch size was 64). As you'll see, the same tests were run using 1,2,4,8 and 16 GPU configurations. All benchmarks were done using 'vanilla' TensorFlow settings for FP16 and FP32.

Notable HGX2 Server Features

16x NVIDIA Tesla V100 SXM3
81,920 NVIDIA CUDA Cores
10,240 NVIDIA Tensor Cores
.5TB Total GPU Memory
NVSwitch powered by NVLink 2.4TB/sec aggregate speed

Exxact TensorEX HGX-2 Deep Learning Benchmarks: FP16

	2 GPU img/sec	4 GPU img/sec	Batch Size
ResNet50	1735.56	3218	128
ResNet152	760.57	1415.56	128
Inception V3	1134.88	2161.02	128
Inception V4	602.36	1205.97	128
googlenet	2820.47	5265.14	128

Run these FP16 benchmarks

Configure the num_gpus to the number of GPUs desired to test. Change model to desired model architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server --use_fp16=True

Exxact TensorEX HGX-2 Deep Learning Benchmarks: FP32

	2 GPU img/sec	4 GPU img/sec	Batch Size
ResNet50	762.21	1432.69	128
ResNet152	278.17	577.26	128
Inception V3	495.51	926.93	128
Inception V4	227.05	455.65	128
googlenet	1692.94	3393.91	128

Run these FP32 benchmarks

To run FP32, remove fp16 flag, configure the num_gpus to the number of GPUs desired to test. Change model to desired architecture.

python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=256 --model=resnet50 --variable_update=parameter_server

Other Notes and Future Plans for HGX2

The HGX2 GPU server is an absolute monster for deep learning or any GPU powered HPC tasks. In the future, we would like to conduct further benchmarks on more models as well as other acceleration methods such as XLA for TensorFlow, where we would expect significant performance gains. Also training models on even larger batch sizes is another area we will consider exploring.

System Specifications:

System	Exxact TensorEX HGX-2
GPU	16x NVIDIA Tesla V100 32 GB SXM3
CPU	2x Intel Xeon Platinum 8168
RAM	1.5 TB DDR4
SSD (OS)	1TB x2 NVMe (RAID 1)
SSD (Data)	32 TB NVMe Storage
OS	Ubuntu 16.04
NVIDIA DRIVER	418.67
CUDA Version	10.1
Python	2.7
TensorFlow	1.14
Docker Image	tensorflow/tensorflow:nightly-gpu

Training Parameters

Dataset:	Imagenet (synthetic)
Mode:	training
SingleSess:	False
Batch Size:	256 per device*
Num Batches:	100
Num Epochs:	0.08
Devices:	['/gpu:0']...(varied)
NUMA bind:	False
Data format:	NCHW
Optimizer:	sgd
Variables:	parameter_server

Interested in More Deep Learning Benchmarks?

Topics

Have any questions?

Benchmarks