NVIDIA A100 Ampere Solutions
NVIDIA Ampere: Unprecedented Acceleration at Every Scale
The NVIDIA A100 Tensor Core GPU is based on the new NVIDIA Ampere GPU architecture, and builds upon the capabilities of the prior NVIDIA V100 GPU. It adds many new features and delivers significantly faster performance for HPC, AI, and data analytics workloads. The A100 provides strong scaling for GPU compute and DL applications running in single– and multi-GPU workstations, servers, clusters, cloud data centers, systems at the edge, and supercomputers. The A100 GPU enables building elastic, versatile, and high throughput data centers.Let's Start Building
As an NVIDIA Elite Partner, Exxact Corporation works closely with the NVIDIA team to ensure seamless factory development and support. We pride ourselves on providing value-added service standards unmatched by our competitors.
Find the Right Fit for Your Needs
The Most Powerful End-to-End AI and HPC Data Center Platforms from Exxact
Compare Ampere’s Powerful Features
Fabricated on the TSMC 7nm N7 manufacturing process, the NVIDIA Ampere architecture-based GA100 GPU that powers A100 includes 54.2 billiontransistors with a die size of 826 mm2.
|Ampere A100 SXM4||40GB HBM2e||1.555 TB/s||6912||432||19.5||9.7||156/312*|
|Ampere A100 PCIe||40GB HBM2e||1.555 TB/s||6912||432||19.5||9.7||156/312*|
Faster Deep Learning with Sparsity Support
New Sparsity support in A100 Tensor Cores can exploit fine-grained structured sparsity in DL networks to double the throughput of Tensor Core operations. Sparsity features are described in detail in the A100 introduces fine-grained structured Sparsity section later in this post.
The larger and faster L1 cache and shared memory unit in A100 provides 1.5x the aggregate capacity per SM compared to V100 (192 KB vs. 128 KB per SM) to deliver additional acceleration for many HPC and AI workloads.
Several other new SM features improve efficiency and programmability and reduce software complexity.
High-Performance Computing with NVIDIA A100
To unlock next-generation discoveries, scientists look to simulations to better understand complex molecules for drug discovery, physics for potential new sources of energy, and atmospheric data to better predict and prepare for extreme weather patterns.
A100 introduces double-precision Tensor Cores, providing the biggest milestone since the introduction of double-precision computing in GPUs for HPC. This enables researchers to reduce a 10-hour, double-precision simulation running on NVIDIA V100 Tensor Core GPUs to just four hours on A100. HPC applications can also leverage TF32 precision in A100’s Tensor Cores to achieve up to 10x higher throughput for single-precision dense matrix multiply operations.
Geometric mean of application speedups vs. P100: benchmark application: Amber [PME-Cellulose_NVE], Chroma [szscl21_24_128], GROMACS [ADH Dodec], MILC [Apex Medium], NAMD [stmv_nve_cuda], Pytorch [BERT Large Fine Tuner], Quantum Espresso [AUSURF112-jR]; Random Forest FP32 [make_blobs (160000 x 64 : 10)], TensorFlow [ResNet-50], VASP 6 [Si Huge], | GPU node with dual-socket CPUs with 4x NVIDIA P100, V100, or A100 GPUs.
NVIDIA A100 Accelerates Deep Learning Training and Inference
BERT pre training throughput using Pytorch, including (2/3) Phase 1 and (1/3) Phase 2 | Phase 1 Seq Len = 128, Phase 2 Seq Len = 512; V100: NVIDIA DGX-1™ server with 8x V100 using FP32 precision; A100: DGX A100 Server with 8x A100 using TF32 precision.
BERT Large Inference | NVIDIA T4 Tensor Core GPU: NVIDIA TensorRT (TRT) 7.1, precision = INT8, batch size = 256 | V100: TRT 7.1, precision = FP16, batch size = 256 | A100 with 7 MIG instances of 1g.5gb: pre-production TRT, batch size = 94, precision = INT8 with sparsity.
A100 GPU Streaming Multiprocessor
The new streaming multiprocessor (SM) in the NVIDIA Ampere architecture-based A100 Tensor Core GPU significantly increases performance, builds upon features introduced in both the Volta and Turing SM architectures, and adds many new capabilities.
The A100 third-generation Tensor Cores enhance operand sharing and improve efficiency, and add powerful new data types, including the following:
- TF32 Tensor Core instructions that accelerate processing of FP32 data
- IEEE-compliant FP64 Tensor Core instructions for HPC
- BF16 Tensor Core instructions at the same throughput as FP16
|Peak FP641||9.7 TFLOPs|
|Peak FP64 Tensor Core1||19.5 TFLOPS|
|Peak FP321||19.5 TFLOPS|
|Peak FP161||78 TFLOPs|
|Peak FP161||39 TFLOPs|
|Peak FP32 Tensor Core1||156 TFLOPS | 312 TFLOPS2|
|Peak FP16 Tensor Core1||312 TFLOPS | 624 TFLOPS2|
|Peak BF16 Tensor Core1||312 TFLOPS | 624 TFLOPS2|
|Peak INT8 Tensor Core1||624 TFLOPS | 1,248 TFLOPS2|
|Peak INT4 Tensor Core1||1,248 TFLOPS | 2,496 TFLOPS2|
Table 1. A100 Tensor Core GPU performance specs.
1) Peak rates are based on the GPU boost clock.
2) Effective TFLOPS / TOPS using the new Sparsity feature.
Exxact Systems Featuring NVIDIA Ampere GPUs Provide State of The Art Performance
3 Year Warranty
Exxact provides a 3 year warranty on all our systems. Have peace of mind, and focus on what matters most, knowing you're taken care of.
Planning & Integration
Exxact works closely with customers to build and spec a system that meets your high-performance computing and AI infrastructure needs.
System Testing & Validation
Each NVIDIA A100 GPU system is thoroughly tested and validated to ensure system reliability, and ensure performance meets benchmarked performance.
- Rack Height: 2U
- Processor: 2x AMD EPYC 7002-Series
- Drive Bays: 4x 2.5" SATA/NVMe Hot-Swap
- 4x NVIDIA A100 SXM4 GPUs
- Rack Height: 4U
- Processor: 2x Intel Xeon Scalable family
- Drive Bays: 6x 2.5" SATA/NVMe Hot-Swap
- 8x NVIDIA A100 SXM4 GPUs
- Rack Height: 4U
- Processor Supported: 2x AMD EPYC 7001/7002 Series
- Drive Bays: 24x 2.5" Hot-Swap
- Supports up to 8x PCI 4.0 x16 Double-Wide cards