Benchmarks

NVIDIA RTX PRO 4500 Blackwell Server Edition — LLM Inference Benchmark

May 20, 2026
9 min read
EXX-Blog-rtx-pro-4500-bw-se-bm.jpg

Introduction

AI inference is no longer confined to the cloud. As stricter data-privacy requirements and tighter latency targets become the norm, the growing availability of capable open‑weight models is enabling more organizations to run LLMs on hardware they own and control. This increased flexibility, need for privacy, performance, and operational control, is accelerating the shift toward on‑premise and edge inference deployments.

We are testing NVIDIA’s newest GPU, the NVIDIA RTX PRO™ 4500 Blackwell Server Edition, to determine which model sizes it supports, how fast it can run, how it scales, and where this card fits in an LLM AI deployment.

The RTX PRO 4500 Blackwell Server Edition is a single-slot, passively cooled GPU featuring:

  • 10,496 CUDA cores
  • 32GB of GDDR7 memory
  • 256-bit bus at 800 GB/s of bandwidth
  • 165W TDP

It is designed for dense server and rack deployments where slot count, airflow, and power per GPU are constraints, especially on edge deployments.

For more information about the NVIDIA RTX PRO 4500 Server Edition, read our release blog here.

Why VRAM and Memory Bandwidth Are the Bottleneck

To understand why the RTX Pro 4500 Server Edition's 32GB matters, it helps to understand what actually limits LLM inference performance. The answer is rarely compute; it is memory. If the model does not fit in VRAM at all, it falls back to system RAM or storage, which is 10 to 50 times slower.

However, model size determines what fits, but quantization determines what's possible. Full-precision (FP16) stores each model weight at 16 bits. Quantization compresses those weights to fewer bits, trading a small amount of accuracy for a much smaller memory footprint. The practical effect:

  • Llama 3.1 8B at FP16: ~16GB, fits on any modern GPU with 16GB or more
  • Qwen2.5 32B at Q4_K_M (4-bit): ~19GB, requires at least 20GB VRAM; this is where 16GB cards are eliminated
  • Llama 3.3 70B at Q3_K_M (3-bit): ~28–30GB, theoretically fits in 32GB VRAM but overhead can affect performance. This is the RTX Pro 4500 Server Edition's differentiator

Quality loss at Q4 is minimal for most tasks. At Q3, degradation is measurable but acceptable for research, coding assistance, and document workflows.

NVIDIA RTX PRO 4500 Benchmark Results for LLM

The two primary metrics are token generation speed (tg tok/s), which is how fast the model produces output, and prompt processing speed (pp tok/s), which is how fast the model digests the input. Token generation is the primary user-experience metric. At 10 tok/s, the output is very slow but readable; 30 tok/s is the baseline for usable; above 50 tok/s is the gold standard; and 100+ tok/s is near instant.

The benchmark covers three models tested on 1, 2, and 4 GPU configurations:

  • Llama 3.1 8B FP16: the baseline, full precision, fits any GPU with 16GB or more
  • Qwen2.5 32B Q4_K_M: a mid-tier coding and reasoning model at 4-bit quantization, requiring ~19GB VRAM
  • Llama 3.3 70B Q3_K_M: flagship-class reasoning at 3-bit quantization, requiring ~28–30GB VRAM; a 32GB card is the minimum

NVIDIA RTX PRO 4500 LLM Inference Benchmark

NVIDIA RTX PRO 4500 Blackwell LLM Ollama Benchmark on Llama 3.1, Qwen 2.5, and Llama 3.3

  • Scaling is limited on PCIe: For 8B and 32B models, going from 1 → 4 GPUs delivers ~0% throughput gain while increasing power draw ~38–42%, indicating the workload is already saturated on a single card (compute-bound at 8B; memory-bandwidth-bound at 32B).
  • Small + mid-size models are already “fast enough” on 1 GPU
    • Llama 3.1 8B delivers ~135–149 tg tok/s with extremely high prompt processing (~6,257 tok/s).
    • Qwen2.5 32B at Q4 delivers ~36–38 tg tok/s (real-time interactive) with prompt processing ~1,600–1,760 tok/s.
  • 70B is the inflection point where multi-GPU or higher VRAM GPU becomes necessary
    • Single-GPU performance is constrained (llama-bench OOM; Ollama ~8.6 tok/s)
    • 2 GPUs is the practical minimum for usability (~20.4 tok/s), but additional GPUs beyond 2 show minimal benefit due to inter-GPU sync/PCIe bandwidth limits. Not very scalable; we would recommend choosing a GPU with higher VRAM.

The FP4 Advantage and vLLM with Native NVFP4

The NVIDIA RTX PRO Blackwell lineup of GPUs, including the RTX PRO 4500 Server Edition, supports NVFP4 natively, an accelerated floating point format, executing 4-bit precision without a dequantization step. We ran llama 3.1 8B on vLLM since it supports NVFP4, and Ollama doesn’t.

NVIDIA RTX PRO 4500 NVFP4 vs BF16 LLM Inference Benchmark

NVIDIA RTX PRO 4500 NVFP4 vs BF16 on Llama 3.1 8B

A single NVIDIA RTX PRO 4500 Blackwell Server Edition delivers 4,870 tok/s aggregate and a 13ms time-to-first-token, a 2.4x the throughput at the same power draw on BF16. Two cards under NVFP4 reach 6,810 tok/s in batch, but hit a wall when scaling to 4x GPUs due to PCIe bottleneck.

These numbers, however, are not directly comparable to the Ollama results above since Ollama measures single-user, single-request, whereas vLLM batches requests across many concurrent users. Here's how to interpret tok/s across stacks:

  • Ollama tok/s is best read as per-user interactive streaming speed for a single in-flight request.
  • vLLM tok/s is best read as aggregate server throughput under batching (how many total tokens the GPU can produce across many concurrent requests).
  • The large gap is meaningful because it shows the headroom unlocked by a production serving stack (and NVFP4): the same GPU can look “normal” in single-user mode, but deliver dramatically higher total capacity when fully utilized for multi-user inference.

However, this batching is beneficial You can read more about Ollama vs vLLM on our previous post. The tok/s figure is aggregate output across all of them simultaneously. The right tool depends on the deployment pattern.

When Multi-GPU Helps and When It Doesn't

The benchmark data tells a clear story on multi-GPU scaling: for models that fit on a single card, adding GPUs does almost nothing for a single user's throughput.

  • The PCIe constraint: Multi-GPU communication runs over PCIe 5.0 at roughly 64 GB/s of inter-GPU bandwidth. When multiple GPUs run a single model in tensor-parallel mode, they must synchronize. Over PCIe, that synchronization overhead accumulates quickly and becomes the bottleneck before compute or bandwidth ever do.
  • Fleet Mode vs Tensor-Parallel
    • Tensor-parallel: split one request across GPUs; PCIe sync becomes the limiter.
    • Fleet mode: 1 GPU per user/instance; scales linearly for multi-user throughput.
    • If model fits in 32GB: projected 4x fleet ~19,480 tok/s vs 4x tensor-parallel 5,446 tok/s (NVFP4).
  Does help Doesn’t help
Parameter Count 70B on 2 GPUs (tensor-parallel)
Single GPU is limited (Ollama ~8.6 tok/s / llama-bench OOM). 2 GPUs gets ~20.4 tok/s (usable).
8B / 32B (single user)
Extra GPUs don’t help: 8B is compute-bound, 32B is bandwidth-bound; PCIe can’t add per-request speed.
Throughput Concurrent users (fleet mode)
Run 1 instance per GPU. Example: 4 GPUs = 4 users at full single-GPU speed (linear aggregate throughput).
Single request on 3–4 GPUs
PCIe sync (~64 GB/s) bottlenecks; beyond 2 GPUs adds power, not speed (more idle time).

When RTX PRO 4500 Server Edition for Local LLM

The RTX Pro 4500 Server Edition is not a fit for every AI workload. It is purpose-built for a specific set of requirements, and understanding where it fits well is more useful than a broad recommendation.

It is a strong fit for:

  • AI research and data science teams running open-weight models. A single card covers the full range from 8B to 32B parameter models comfortably and eliminates LLM API costs.
  • Organizations with regulated data environments. HIPAA-covered healthcare organizations running clinical NLP or document processing, and academic institutions with FERPA obligations, need inference to stay on-premises. The 32GB VRAM capacity makes the full useful model tier accessible.
  • Dense rack deployments with power and slot constraints. Single-slot passive design at 165W means four cards fit where two active dual-slot cards would otherwise go. For organizations building inference capacity within existing rack infrastructure, that density arithmetic matters.
  • Multi-user inference nodes for small to mid-size teams. Four cards in a 2U server, each running an independent inference instance, serve four concurrent users at full single-GPU throughput. That configuration covers most small team deployments.

It is not the right fit for:

  • Teams wanting to run 70B models. The Q3_K_M which barely fits utilizing 31GB of VRAM, where multi-GPU limits to 20 tok/s. Choosing a higher-tier GPU would increase throughput and not be limited by GPU-to-GPU PCIe communication.
  • Large-scale batch inference serving hundreds of concurrent users. At that scale, purpose-built data center accelerators with HBM memory are the correct infrastructure.

Conclusion

The RTX Pro 4500 Blackwell Server Edition occupies a specific and practical position in the on-premises AI inference landscape. Its 32GB of GDDR7 memory on a single-slot passive 165W card is what makes it relevant: that combination of capacity and form factor is not common at this price and power tier.

Use caseRecommendationDeployment
Single user, 8B or 32B model1x RTX Pro 4500 Server EditionOllama
70B at usable interactive speed (>15 tok/s)2x RTX Pro 4500 Server EditionvLLM
Multiple concurrent users, independent sessions4x RTX Pro 4500 Server Edition, 1 GPU per userOllama or vLLM
Models larger than 32GB (70B Q4, 105B+)2 to 4x as needed for capacity but slow. Recommend upgrading or choosing a stronger GPU.vLLM

 

  • A single card runs 8B models at 134 tok/s, 32B models at 36 tok/s, and can load a 70B model at reduced throughput. Two cards bring 70B inference to a usable 20 tok/s but isn’t scalable.
  • For multi-user deployments, four independent instances across four cards scale linearly to 458 tok/s aggregate.
  • With vLLM NVFP4 enabled, the Blackwell Tensor Cores deliver 4,870 tok/s on the 8B at 2.4x the efficiency of BF16, at the same power draw.

For AI teams that need capable on-premises inference within the constraints of standard rack infrastructure, regulated data environments, or limited power budgets, the RTX Pro 4500 Server Edition is a well-matched option. The hardware is available now. Configurations built around it are straightforward to deploy and operate within existing server infrastructure without specialized cooling or power provisioning.

Fueling Innovation with an Exxact Multi-GPU Server

Run your own open-weight LLM and skip the snowballing API costs. Built for individuals and teams alike. It's not just a piece of hardware, but the tool that propels, accelerates, and enables your research.

Configure Now
EXX-Blog-rtx-pro-4500-bw-se-bm.jpg
Benchmarks

NVIDIA RTX PRO 4500 Blackwell Server Edition — LLM Inference Benchmark

May 20, 20269 min read

Introduction

AI inference is no longer confined to the cloud. As stricter data-privacy requirements and tighter latency targets become the norm, the growing availability of capable open‑weight models is enabling more organizations to run LLMs on hardware they own and control. This increased flexibility, need for privacy, performance, and operational control, is accelerating the shift toward on‑premise and edge inference deployments.

We are testing NVIDIA’s newest GPU, the NVIDIA RTX PRO™ 4500 Blackwell Server Edition, to determine which model sizes it supports, how fast it can run, how it scales, and where this card fits in an LLM AI deployment.

The RTX PRO 4500 Blackwell Server Edition is a single-slot, passively cooled GPU featuring:

  • 10,496 CUDA cores
  • 32GB of GDDR7 memory
  • 256-bit bus at 800 GB/s of bandwidth
  • 165W TDP

It is designed for dense server and rack deployments where slot count, airflow, and power per GPU are constraints, especially on edge deployments.

For more information about the NVIDIA RTX PRO 4500 Server Edition, read our release blog here.

Why VRAM and Memory Bandwidth Are the Bottleneck

To understand why the RTX Pro 4500 Server Edition's 32GB matters, it helps to understand what actually limits LLM inference performance. The answer is rarely compute; it is memory. If the model does not fit in VRAM at all, it falls back to system RAM or storage, which is 10 to 50 times slower.

However, model size determines what fits, but quantization determines what's possible. Full-precision (FP16) stores each model weight at 16 bits. Quantization compresses those weights to fewer bits, trading a small amount of accuracy for a much smaller memory footprint. The practical effect:

  • Llama 3.1 8B at FP16: ~16GB, fits on any modern GPU with 16GB or more
  • Qwen2.5 32B at Q4_K_M (4-bit): ~19GB, requires at least 20GB VRAM; this is where 16GB cards are eliminated
  • Llama 3.3 70B at Q3_K_M (3-bit): ~28–30GB, theoretically fits in 32GB VRAM but overhead can affect performance. This is the RTX Pro 4500 Server Edition's differentiator

Quality loss at Q4 is minimal for most tasks. At Q3, degradation is measurable but acceptable for research, coding assistance, and document workflows.

NVIDIA RTX PRO 4500 Benchmark Results for LLM

The two primary metrics are token generation speed (tg tok/s), which is how fast the model produces output, and prompt processing speed (pp tok/s), which is how fast the model digests the input. Token generation is the primary user-experience metric. At 10 tok/s, the output is very slow but readable; 30 tok/s is the baseline for usable; above 50 tok/s is the gold standard; and 100+ tok/s is near instant.

The benchmark covers three models tested on 1, 2, and 4 GPU configurations:

  • Llama 3.1 8B FP16: the baseline, full precision, fits any GPU with 16GB or more
  • Qwen2.5 32B Q4_K_M: a mid-tier coding and reasoning model at 4-bit quantization, requiring ~19GB VRAM
  • Llama 3.3 70B Q3_K_M: flagship-class reasoning at 3-bit quantization, requiring ~28–30GB VRAM; a 32GB card is the minimum

NVIDIA RTX PRO 4500 LLM Inference Benchmark

  • Scaling is limited on PCIe: For 8B and 32B models, going from 1 → 4 GPUs delivers ~0% throughput gain while increasing power draw ~38–42%, indicating the workload is already saturated on a single card (compute-bound at 8B; memory-bandwidth-bound at 32B).
  • Small + mid-size models are already “fast enough” on 1 GPU
    • Llama 3.1 8B delivers ~135–149 tg tok/s with extremely high prompt processing (~6,257 tok/s).
    • Qwen2.5 32B at Q4 delivers ~36–38 tg tok/s (real-time interactive) with prompt processing ~1,600–1,760 tok/s.
  • 70B is the inflection point where multi-GPU or higher VRAM GPU becomes necessary
    • Single-GPU performance is constrained (llama-bench OOM; Ollama ~8.6 tok/s)
    • 2 GPUs is the practical minimum for usability (~20.4 tok/s), but additional GPUs beyond 2 show minimal benefit due to inter-GPU sync/PCIe bandwidth limits. Not very scalable; we would recommend choosing a GPU with higher VRAM.

The FP4 Advantage and vLLM with Native NVFP4

The NVIDIA RTX PRO Blackwell lineup of GPUs, including the RTX PRO 4500 Server Edition, supports NVFP4 natively, an accelerated floating point format, executing 4-bit precision without a dequantization step. We ran llama 3.1 8B on vLLM since it supports NVFP4, and Ollama doesn’t.

NVIDIA RTX PRO 4500 NVFP4 vs BF16 LLM Inference Benchmark

A single NVIDIA RTX PRO 4500 Blackwell Server Edition delivers 4,870 tok/s aggregate and a 13ms time-to-first-token, a 2.4x the throughput at the same power draw on BF16. Two cards under NVFP4 reach 6,810 tok/s in batch, but hit a wall when scaling to 4x GPUs due to PCIe bottleneck.

These numbers, however, are not directly comparable to the Ollama results above since Ollama measures single-user, single-request, whereas vLLM batches requests across many concurrent users. Here's how to interpret tok/s across stacks:

  • Ollama tok/s is best read as per-user interactive streaming speed for a single in-flight request.
  • vLLM tok/s is best read as aggregate server throughput under batching (how many total tokens the GPU can produce across many concurrent requests).
  • The large gap is meaningful because it shows the headroom unlocked by a production serving stack (and NVFP4): the same GPU can look “normal” in single-user mode, but deliver dramatically higher total capacity when fully utilized for multi-user inference.

However, this batching is beneficial You can read more about Ollama vs vLLM on our previous post. The tok/s figure is aggregate output across all of them simultaneously. The right tool depends on the deployment pattern.

When Multi-GPU Helps and When It Doesn't

The benchmark data tells a clear story on multi-GPU scaling: for models that fit on a single card, adding GPUs does almost nothing for a single user's throughput.

  • The PCIe constraint: Multi-GPU communication runs over PCIe 5.0 at roughly 64 GB/s of inter-GPU bandwidth. When multiple GPUs run a single model in tensor-parallel mode, they must synchronize. Over PCIe, that synchronization overhead accumulates quickly and becomes the bottleneck before compute or bandwidth ever do.
  • Fleet Mode vs Tensor-Parallel
    • Tensor-parallel: split one request across GPUs; PCIe sync becomes the limiter.
    • Fleet mode: 1 GPU per user/instance; scales linearly for multi-user throughput.
    • If model fits in 32GB: projected 4x fleet ~19,480 tok/s vs 4x tensor-parallel 5,446 tok/s (NVFP4).
  Does help Doesn’t help
Parameter Count 70B on 2 GPUs (tensor-parallel)
Single GPU is limited (Ollama ~8.6 tok/s / llama-bench OOM). 2 GPUs gets ~20.4 tok/s (usable).
8B / 32B (single user)
Extra GPUs don’t help: 8B is compute-bound, 32B is bandwidth-bound; PCIe can’t add per-request speed.
Throughput Concurrent users (fleet mode)
Run 1 instance per GPU. Example: 4 GPUs = 4 users at full single-GPU speed (linear aggregate throughput).
Single request on 3–4 GPUs
PCIe sync (~64 GB/s) bottlenecks; beyond 2 GPUs adds power, not speed (more idle time).

When RTX PRO 4500 Server Edition for Local LLM

The RTX Pro 4500 Server Edition is not a fit for every AI workload. It is purpose-built for a specific set of requirements, and understanding where it fits well is more useful than a broad recommendation.

It is a strong fit for:

  • AI research and data science teams running open-weight models. A single card covers the full range from 8B to 32B parameter models comfortably and eliminates LLM API costs.
  • Organizations with regulated data environments. HIPAA-covered healthcare organizations running clinical NLP or document processing, and academic institutions with FERPA obligations, need inference to stay on-premises. The 32GB VRAM capacity makes the full useful model tier accessible.
  • Dense rack deployments with power and slot constraints. Single-slot passive design at 165W means four cards fit where two active dual-slot cards would otherwise go. For organizations building inference capacity within existing rack infrastructure, that density arithmetic matters.
  • Multi-user inference nodes for small to mid-size teams. Four cards in a 2U server, each running an independent inference instance, serve four concurrent users at full single-GPU throughput. That configuration covers most small team deployments.

It is not the right fit for:

  • Teams wanting to run 70B models. The Q3_K_M which barely fits utilizing 31GB of VRAM, where multi-GPU limits to 20 tok/s. Choosing a higher-tier GPU would increase throughput and not be limited by GPU-to-GPU PCIe communication.
  • Large-scale batch inference serving hundreds of concurrent users. At that scale, purpose-built data center accelerators with HBM memory are the correct infrastructure.

Conclusion

The RTX Pro 4500 Blackwell Server Edition occupies a specific and practical position in the on-premises AI inference landscape. Its 32GB of GDDR7 memory on a single-slot passive 165W card is what makes it relevant: that combination of capacity and form factor is not common at this price and power tier.

Use caseRecommendationDeployment
Single user, 8B or 32B model1x RTX Pro 4500 Server EditionOllama
70B at usable interactive speed (>15 tok/s)2x RTX Pro 4500 Server EditionvLLM
Multiple concurrent users, independent sessions4x RTX Pro 4500 Server Edition, 1 GPU per userOllama or vLLM
Models larger than 32GB (70B Q4, 105B+)2 to 4x as needed for capacity but slow. Recommend upgrading or choosing a stronger GPU.vLLM

 

  • A single card runs 8B models at 134 tok/s, 32B models at 36 tok/s, and can load a 70B model at reduced throughput. Two cards bring 70B inference to a usable 20 tok/s but isn’t scalable.
  • For multi-user deployments, four independent instances across four cards scale linearly to 458 tok/s aggregate.
  • With vLLM NVFP4 enabled, the Blackwell Tensor Cores deliver 4,870 tok/s on the 8B at 2.4x the efficiency of BF16, at the same power draw.

For AI teams that need capable on-premises inference within the constraints of standard rack infrastructure, regulated data environments, or limited power budgets, the RTX Pro 4500 Server Edition is a well-matched option. The hardware is available now. Configurations built around it are straightforward to deploy and operate within existing server infrastructure without specialized cooling or power provisioning.

Fueling Innovation with an Exxact Multi-GPU Server

Run your own open-weight LLM and skip the snowballing API costs. Built for individuals and teams alike. It's not just a piece of hardware, but the tool that propels, accelerates, and enables your research.

Configure Now