Benchmarks

Benchmarking Local AI Agents on NVIDIA DGX Spark (Ollama + OpenClaw)

May 14, 2026

12 min read

Left_Aligned_-_-NVIDIA_DGX_Spar-NVIDIA_DGX_Spar-NVIDIA_DGX_Spar-NemoTron_3_+_Qw-NemoTron_3_+_Qw-NemoTron_3_+_Qw.jpg

Introduction

Running an open LLM locally now has an extremely low barrier to entry with Ollama. The real question is whether a local model can behave like a reliable agent: call tools correctly, pass valid arguments, handle messy outputs, and finish multi-step tasks without drifting.

To measure that, we built a reproducible benchmark suite around Ollama + OpenClaw and ran it on NVIDIA DGX Spark. This post covers what we tested, how we scored agent behavior, and which models held up best in multi-step tool workflows.

TL;DR (What this post answers)

This benchmark measures agent behavior (tool calling, argument discipline, injection resistance, and multi-step chaining), not just tokens/sec.
Tested 3 models each from the Qwen, Gemma, and Nemotron model families.
In our DGX Spark runs, larger “agent-oriented” models tended to be more reliable over multi-step chains.
One mid-size model stood out for speed + clean agent behavior, making it a practical local-agent default if you don’t want 100B-class latency.
If you’re choosing a local agent model, reliability and hop depth matter more than raw throughput.

Why DGX Spark Fits This Workload

DGX Spark is a compact desktop system built on the Grace Blackwell GB10 with 128GB of unified memory—and that memory architecture is a big reason it’s compelling for agentic workflows. In a unified-memory design, the CPU and GPU share a single, coherent memory pool instead of juggling separate “system RAM” and “GPU VRAM” with copies in between.

This is the advantage over traditional workstations because real agents are not single-prompt demos. They accumulate large system prompts, retrieved documents, evaluate tool schemas and outputs, determine intermediate plans, and solve multi-step problems. That means your agent can keep more of its working set in one place and access it with fewer “move data around” penalties.

With 128GB of unified memory headroom, Spark makes it easier to run larger local models and sustain longer multi-step chains without constantly trimming context or hitting memory cliffs, ideal for keeping the action loop intact with consistent tool calls, stable context retention, and reliable follow-through over many hops. More on hops later.

Unified memory pool: CPU + GPU share one coherent pool (less “RAM vs VRAM juggling”).
Bigger working set: more room for model weights, KV cache, system prompts, retrieved docs, and tool outputs.
Fewer memory cliffs: avoids “barely fits” behavior that can spike latency mid-chain.
More reliable multi-step runs: helps sustain longer tool-heavy workflows without constantly trimming context.

OpenClaw Benchmark Setup (and Why It’s Different From Production)

Most local LLM benchmarks measure generation quality or speed, but they don’t tell you whether a model can behave like a reliable agent. Agent workflows involve repeated tool calls. To measure those behaviors, we built a reproducible benchmark harness around Ollama + OpenClaw that runs structured agent tests and records pass/fail behavior, latency, throughput, and multi-hop chain depth. You can view our benchmark here: https://github.com/Exxact-Software/local-agent-benchmark

The best benchmark model is not automatically the best OpenClaw runtime model. The benchmark talks directly to Ollama for controlled measurement, while OpenClaw is the actual agent runtime and carries a larger system prompt and more runtime context. That changes the practical model choice: a 120B-class model can look strongest in isolation but still be a poor fit due to usable headroom and latency.

Our OpenClaw testing confirms this: nemotron-3-super:120b-a12b was the strongest benchmark profile, but it wasn’t the practical OpenClaw choice. nemotron-3-nano:30b fit the runtime better by leaving more usable context and responding fast enough for interactive agent work.

Models We Tested

For the DGX Spark reference run, we tested a set of open models chosen to cover a useful range. We included large “Spark-native” tests (e.g., nemotron-3-super:120b-a12b), fast MoE baselines (e.g., qwen3.5:35b-a3b), dense reference points (e.g., qwen3.5:27b), a memory-stress case (qwen3.5:122b-a10b), and smaller/faster models (Nemotron + Gemma variants) to see how often speed trades off against tool discipline.

Architecture matters because parameter count alone is incomplete: dense models use the full network each token (straightforward but cost scales directly), while MoE models route each token through a smaller active subset (often better capability/speed tradeoffs). The benchmark keeps both so we can compare the real agent tradeoffs: speed, memory pressure, context headroom, and reliability.

Model	Parameters	Type	Why It’s In The Set
nemotron-3-nano:4b	4 billion	Dense	Small baseline, fast sanity-check model
gemma4:e4b	4 billion	Dense	Small/fast Gemma baseline
gemma4:26b	26 billion	Dense	Practical mid-size local-agent candidate
qwen3.5:27b	27 billion	Dense	Larger dense comparison point
gemma4:31b	31 billion	Dense	Larger Gemma comparison point
nemotron-3-nano:30b	30 billion	MoE	Mid-sized speed-oriented model with agent potential
qwen3.5:35b-a3b	35 billion	MoE	Strong local MoE baseline
nemotron-3-super:120b-a12b	120 billion	MoE	Large flagship model and the most interesting Spark-native candidate
qwen3.5:122b-a10b	122 billion	MoE	Large high-capability model that stresses memory assumptions

Your Personal AI Inference: The NVIDIA DGX Spark

Take enterprise AI anywhere. NVIDIA DGX Spark can power your LLM, Agentic AI, and model prototyping on the go! Harness datacenter power today, available now through Exxact Corporation.

Get a Quote Today

DGX Spark OpenClaw Benchmark Results

Model	Type	T1-T17	Hop Depth	Avg tok/s	Notes
NemoTron 3 Nano
nemotron-3-nano:4b	Dense	16/17	4	64.2	Fast, but one noisy full-run miss
nemotron-3-nano:30b	Dense	15/17	5	64.7	Fast, but flaky on T13
nemotron-3-super:120b-a12b	MoE	17/17	6	16.4	Strongest overall agent profile in this Spark run

Qwen 3.5
qwen3.5:27b	Dense	17/17	5	10.4	Clean and reliable but slow
qwen3.5:35b-a3b	Dense	17/17	4	48.2	Clean, reliable, fast
qwen3.5:122b-a10b	Dense	17/17	3	20.1	Clean after T11 timeout fix

Gemma4
gemma4:e4b	Dense	17/17	2	52.6	Fast and clean, but shallow multi-hop behavior
gemma4:26b	MoE	17/17	4	52.7	Best Gemma result; strong speed/reliability balance
gemma4:31b	MoE	17/17	2	9.7	Clean, but slow and shallow compared with the others

How To Read This Benchmark

The result tables in this post are not meant to be read like a normal model leaderboard. A single score does not tell the whole story.

T1–T17 is the structured agent test suite. It checks whether the model can call the right tools, pass the right arguments, handle malformed outputs, follow formatting rules, resist prompt injection, and track simple state changes. A clean 17/17 is the best score.
- T1–T4 Basic tool calling: Correct tool choice; correct arguments; doesn’t hallucinate tools
- T5–T7 Parallel tool calls: Multiple calls in one turn; result attribution; conflict handling
- T8–T11 Stress inputs: 404s; malformed/partial tool outputs; timeouts; very large payloads
- T12–T15 Instruction adherence: One-call constraints; JSON-only discipline; prompt injection resistance; instruction conflict handling
- T16–T17 Edge cases: Knowing when not to call a tool; state mutation across operations
Hop Depth is the escalating chain test. Each tool result becomes input to the next step, so the model has to preserve context and keep the task moving. This is one of the most important numbers for agent work because real agents rarely stop after one clean tool call. A hop-depth of 4 chain might look like:
- search_places → get_place_details → get_directions → search_parking
Avg Tok/s is speed. We would ideally like to see faster than 20 Tok/s with 40+ Tok/s being the most responsive. It matters, especially for an always-on local agent, but it is not the same thing as reliability. A fast model that breaks the tool contract can still be the wrong choice for an agent.

The best result is the model that balances all three: an Agentic AI that has clean structured test behavior, deep multi-hop stability, and highly usable speed.

The strongest Spark-side model in this run was nemotron-3-super:120b-a12b with a clean 17/17, best hop depth, and strongest agentic profile across the suite.
The Qwen family also held up very well, especially after the benchmark prompt/scoring track stabilized and the T11 timeout path was handled fairly.
All three Gemma models reached clean 17/17 t-score and handled T-11 (timeouts) well.
gemma4:26b was the strongest Gemma result, with hop-depth of 4 at 52.7 tok/s
gemma4:e4b was almost as fast, but stopped after 2 hops
gemma4:31b was clean but much too slow, and also stopped after 2 hops

That makes gemma4:26b the surprise practical result. It did not beat nemotron-3-super:120b-a12b on agent depth, but it matched qwen3.5:35b-a3b on hop-depth while having higher tok/s speed.

Speed vs Reliability

One of the most important conclusions is that the fastest models were not automatically the best agent models. For example, nemotron-3-nano:4b and nemotron-3-nano:30b were very fast, but the remaining benchmark issues clustered in the smaller Nemotron models. gemma4:e4b and gemma4:26b were also fast, but only gemma4:26b combined that speed with a stronger multi-hop result.

That creates a real tradeoff:

smaller models can be operationally attractive for speed
larger models can be much more trustworthy when the task becomes agentic
mid-size models can be the sweet spot when they keep the tool contract clean without giving up too much latency

For an always-on agent, reliability is often more valuable than raw tokens per second.

That is one of the core points of the benchmark.

The Gemma results sharpen that point. gemma4:31b was larger than gemma4:26b, but it was, by far, not better in this benchmark: it was slower and stopped earlier in hop depth. The best Gemma result came from the model with the best balance, not the largest parameter count within the DGX Spark memory limit.

Why This Matters for Local Agents

An always-on agent needs to do boring things reliably. It needs to call the right tool, pass the right arguments, wait for the result, read that result correctly, and decide what to do next. It needs to keep doing that even when the tool returns partial data, a timeout, a giant payload, or text that tries to hijack the instruction hierarchy.

That is why this benchmark rewards consistency. A model that is fast but casually breaks JSON, invents tool arguments, or loses the thread after two steps is harder to trust in a real agent runtime. It may still be useful for narrow tasks, but it needs guardrails, retries, or a smaller responsibility surface.

The strongest models in this run point to different local-agent profiles:

nemotron-3-super:120b-a12b is the strongest deeper-agent candidate. It was slower than the small models, but it had the best hop depth and a clean structured-test run.
gemma4:26b is the practical surprise. It was fast, clean, and reached the same hop-depth as qwen3.5:35b-a3b.
qwen3.5:35b-a3b remains a strong fast MoE baseline, with clean structured behavior and solid hop depth.

That is the real Spark story. The hardware opens the door to larger local models, but the benchmark helps decide which of those models are actually useful inside an agent. The goal is not to run the biggest model for its own sake. The goal is to run the model that can keep the action loop intact.

Choosing a Model for a Local Agent

This section connects the benchmark back to OpenClaw and real deployments.

In practice, pick the most reliable model you can run comfortably, and optimize for tool correctness + multi-hop stability before raw tokens/sec. Spark-class memory helps when your agent needs long context, many tool calls, and consistent follow-through.

Use case	Best fit from these runs	LLM Type
Strongest overall agent profile	nemotron-3-super:120b-a12b	MoE
Best OpenClaw runtime fit	nemotron-3-nano:30b	Dense
Best speed/reliability surprise	gemma4:26b	MoE
Strong fast Dense model	qwen3.5:35b-a3b	Dense
Fast experiments with more caution	nemotron-3-nano:4b, nemotron-3-nano:30b, gemma4:e4b	Dense

That distinction is important. nemotron-3-super:120b-a12b is the strongest model in the benchmark table, but nemotron-3-nano:30b is the better OpenClaw fit because the runtime itself consumes a large amount of context. In real agent deployments, the usable context window matters as much as the model’s isolated benchmark score.

This is also the bridge into the next stage of the project:

OpenClaw as the active runtime
NemoClaw and related hardening work for production readiness
a cleaned public benchmark repository so developers can reproduce and extend the test suite

Conclusion

DGX Spark makes local agent benchmarking practical at scale by providing enough memory headroom for larger models and longer tool-heavy contexts.

In this run, nemotron-3-super:120b-a12b delivered the strongest overall agent profile, while gemma4:26b stood out as the best speed/reliability balance. The takeaway is simple: for local agents, choose models based on tool reliability and multi-hop stability, not just tokens per second.

Exxact is a solution integration partner with NVIDIA and offer a wide range of solutions featuring NVIDIA products. Get a quote on DGX Spark, DGX Station, or configure an Exxact workstation/server featuring NVIDIA GPUs today. Unsure on what you need? Talk to our engineers for hardware recommendations!

Powering Workloads with Configurable Confidence

Power your workloads with confidence and configure an Exxact 2U TensorEX Server equipped with multiple GPUs, storage, networking, and more to build your ideal computing infrastructure.

Configure Now

Topics

Have any questions?

Benchmarks