
Introduction
Running an open LLM locally now has an extremely low barrier to entry with Ollama. The real question is whether a local model can behave like a reliable agent: call tools correctly, pass valid arguments, handle messy outputs, and finish multi-step tasks without drifting.
To measure that, we built a reproducible benchmark suite around Ollama + OpenClaw and ran it on NVIDIA DGX Spark. This post covers what we tested, how we scored agent behavior, and which models held up best in multi-step tool workflows.
TL;DR (What this post answers)
- This benchmark measures agent behavior (tool calling, argument discipline, injection resistance, and multi-step chaining), not just tokens/sec.
- Tested 3 models each from the Qwen, Gemma, and Nemotron model families.
- In our DGX Spark runs, larger “agent-oriented” models tended to be more reliable over multi-step chains.
- One mid-size model stood out for speed + clean agent behavior, making it a practical local-agent default if you don’t want 100B-class latency.
- If you’re choosing a local agent model, reliability and hop depth matter more than raw throughput.
Why DGX Spark Fits This Workload
DGX Spark is a compact desktop system built on the Grace Blackwell GB10 with 128GB of unified memory—and that memory architecture is a big reason it’s compelling for agentic workflows. In a unified-memory design, the CPU and GPU share a single, coherent memory pool instead of juggling separate “system RAM” and “GPU VRAM” with copies in between.
This is the advantage over traditional workstations because real agents are not single-prompt demos. They accumulate large system prompts, retrieved documents, evaluate tool schemas and outputs, determine intermediate plans, and solve multi-step problems. That means your agent can keep more of its working set in one place and access it with fewer “move data around” penalties.
With 128GB of unified memory headroom, Spark makes it easier to run larger local models and sustain longer multi-step chains without constantly trimming context or hitting memory cliffs, ideal for keeping the action loop intact with consistent tool calls, stable context retention, and reliable follow-through over many hops. More on hops later.
- Unified memory pool: CPU + GPU share one coherent pool (less “RAM vs VRAM juggling”).
- Bigger working set: more room for model weights, KV cache, system prompts, retrieved docs, and tool outputs.
- Fewer memory cliffs: avoids “barely fits” behavior that can spike latency mid-chain.
- More reliable multi-step runs: helps sustain longer tool-heavy workflows without constantly trimming context.
OpenClaw Benchmark Setup (and Why It’s Different From Production)
Most local LLM benchmarks measure generation quality or speed, but they don’t tell you whether a model can behave like a reliable agent. Agent workflows involve repeated tool calls. To measure those behaviors, we built a reproducible benchmark harness around Ollama + OpenClaw that runs structured agent tests and records pass/fail behavior, latency, throughput, and multi-hop chain depth. You can view our benchmark here: https://github.com/Exxact-Software/local-agent-benchmark
The best benchmark model is not automatically the best OpenClaw runtime model. The benchmark talks directly to Ollama for controlled measurement, while OpenClaw is the actual agent runtime and carries a larger system prompt and more runtime context. That changes the practical model choice: a 120B-class model can look strongest in isolation but still be a poor fit due to usable headroom and latency.
Our OpenClaw testing confirms this: nemotron-3-super:120b-a12b was the strongest benchmark profile, but it wasn’t the practical OpenClaw choice. nemotron-3-nano:30b fit the runtime better by leaving more usable context and responding fast enough for interactive agent work.
Models We Tested
For the DGX Spark reference run, we tested a set of open models chosen to cover a useful range. We included large “Spark-native” tests (e.g., nemotron-3-super:120b-a12b), fast MoE baselines (e.g., qwen3.5:35b-a3b), dense reference points (e.g., qwen3.5:27b), a memory-stress case (qwen3.5:122b-a10b), and smaller/faster models (Nemotron + Gemma variants) to see how often speed trades off against tool discipline.
Architecture matters because parameter count alone is incomplete: dense models use the full network each token (straightforward but cost scales directly), while MoE models route each token through a smaller active subset (often better capability/speed tradeoffs). The benchmark keeps both so we can compare the real agent tradeoffs: speed, memory pressure, context headroom, and reliability.
| Model | Parameters | Type | Why It’s In The Set |
|---|---|---|---|
| nemotron-3-nano:4b | 4 billion | Dense | Small baseline, fast sanity-check model |
| gemma4:e4b | 4 billion | Dense | Small/fast Gemma baseline |
| gemma4:26b | 26 billion | Dense | Practical mid-size local-agent candidate |
| qwen3.5:27b | 27 billion | Dense | Larger dense comparison point |
| gemma4:31b | 31 billion | Dense | Larger Gemma comparison point |
| nemotron-3-nano:30b | 30 billion | MoE | Mid-sized speed-oriented model with agent potential |
| qwen3.5:35b-a3b | 35 billion | MoE | Strong local MoE baseline |
| nemotron-3-super:120b-a12b | 120 billion | MoE | Large flagship model and the most interesting Spark-native candidate |
| qwen3.5:122b-a10b | 122 billion | MoE | Large high-capability model that stresses memory assumptions |

Accelerate AI Training an NVIDIA DGX Spark
Take enterprise AI compute anywhere. NVIDIA DGX Spark delivers up to 1 petaFLOP in a portable 6" x 6" x 2" form factor. Harness datacenter power in your backpack. Available now through Exxact Corporation.
Get a Quote TodayDGX Spark OpenClaw Benchmark Results
Model | Type | T1-T17 | Hop Depth | Avg tok/s | Notes | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NemoTron 3 Nano | |||||||||||||||||||||||||||||
| nemotron-3-nano:4b | Dense | 16/17 | 4 | 64.2 | Fast, but one noisy full-run miss | ||||||||||||||||||||||||
| nemotron-3-nano:30b | Dense | 15/17 | 5 | 64.7 | Fast, but flaky on T13 | ||||||||||||||||||||||||
| nemotron-3-super:120b-a12b | MoE | 17/17 | 6 | 16.4 | Strongest overall agent profile in this Spark run | ||||||||||||||||||||||||
| Qwen 3.5 | |||||||||||||||||||||||||||||
| qwen3.5:27b | Dense | 17/17 | 5 | 10.4 | Clean and reliable but slow | ||||||||||||||||||||||||
| qwen3.5:35b-a3b | Dense | 17/17 | 4 | 48.2 | Clean, reliable, fast | ||||||||||||||||||||||||
| qwen3.5:122b-a10b | Dense | 17/17 | 3 | 20.1 | Clean after T11 timeout fix | ||||||||||||||||||||||||
| Gemma4 | |||||||||||||||||||||||||||||
| gemma4:e4b | Dense | 17/17 | 2 | 52.6 | Fast and clean, but shallow multi-hop behavior | ||||||||||||||||||||||||
| gemma4:26b | MoE | 17/17 | 4 | 52.7 | Best Gemma result; strong speed/reliability balance | ||||||||||||||||||||||||
| gemma4:31b | MoE | 17/17 | 2 | 9.7 | Clean, but slow and shallow compared with the others | ||||||||||||||||||||||||
How To Read This Benchmark
The result tables in this post are not meant to be read like a normal model leaderboard. A single score does not tell the whole story.
- T1–T17 is the structured agent test suite. It checks whether the model can call the right tools, pass the right arguments, handle malformed outputs, follow formatting rules, resist prompt injection, and track simple state changes. A clean 17/17 is the best score.
- T1–T4 Basic tool calling: Correct tool choice; correct arguments; doesn’t hallucinate tools
- T5–T7 Parallel tool calls: Multiple calls in one turn; result attribution; conflict handling
- T8–T11 Stress inputs: 404s; malformed/partial tool outputs; timeouts; very large payloads
- T12–T15 Instruction adherence: One-call constraints; JSON-only discipline; prompt injection resistance; instruction conflict handling
- T16–T17 Edge cases: Knowing when not to call a tool; state mutation across operations
- Hop Depth is the escalating chain test. Each tool result becomes input to the next step, so the model has to preserve context and keep the task moving. This is one of the most important numbers for agent work because real agents rarely stop after one clean tool call. A hop-depth of 4 chain might look like:
- search_places → get_place_details → get_directions → search_parking
- Avg Tok/s is speed. We would ideally like to see faster than 20 Tok/s with 40+ Tok/s being the most responsive. It matters, especially for an always-on local agent, but it is not the same thing as reliability. A fast model that breaks the tool contract can still be the wrong choice for an agent.
The best result is the model that balances all three: an Agentic AI that has clean structured test behavior, deep multi-hop stability, and highly usable speed.
- The strongest Spark-side model in this run was nemotron-3-super:120b-a12b with a clean 17/17, best hop depth, and strongest agentic profile across the suite.
- The Qwen family also held up very well, especially after the benchmark prompt/scoring track stabilized and the T11 timeout path was handled fairly.
- All three Gemma models reached clean 17/17 t-score and handled T-11 (timeouts) well.
- gemma4:26b was the strongest Gemma result, with hop-depth of 4 at 52.7 tok/s
- gemma4:e4b was almost as fast, but stopped after 2 hops
- gemma4:31b was clean but much too slow, and also stopped after 2 hops
That makes gemma4:26b the surprise practical result. It did not beat nemotron-3-super:120b-a12b on agent depth, but it matched qwen3.5:35b-a3b on hop-depth while having higher tok/s speed.
Speed vs Reliability
One of the most important conclusions is that the fastest models were not automatically the best agent models. For example, nemotron-3-nano:4b and nemotron-3-nano:30b were very fast, but the remaining benchmark issues clustered in the smaller Nemotron models. gemma4:e4b and gemma4:26b were also fast, but only gemma4:26b combined that speed with a stronger multi-hop result.
That creates a real tradeoff:
- smaller models can be operationally attractive for speed
- larger models can be much more trustworthy when the task becomes agentic
- mid-size models can be the sweet spot when they keep the tool contract clean without giving up too much latency
For an always-on agent, reliability is often more valuable than raw tokens per second.
That is one of the core points of the benchmark.
The Gemma results sharpen that point. gemma4:31b was larger than gemma4:26b, but it was, by far, not better in this benchmark: it was slower and stopped earlier in hop depth. The best Gemma result came from the model with the best balance, not the largest parameter count within the DGX Spark memory limit.
Why This Matters for Local Agents
An always-on agent needs to do boring things reliably. It needs to call the right tool, pass the right arguments, wait for the result, read that result correctly, and decide what to do next. It needs to keep doing that even when the tool returns partial data, a timeout, a giant payload, or text that tries to hijack the instruction hierarchy.
That is why this benchmark rewards consistency. A model that is fast but casually breaks JSON, invents tool arguments, or loses the thread after two steps is harder to trust in a real agent runtime. It may still be useful for narrow tasks, but it needs guardrails, retries, or a smaller responsibility surface.
The strongest models in this run point to different local-agent profiles:
- nemotron-3-super:120b-a12b is the strongest deeper-agent candidate. It was slower than the small models, but it had the best hop depth and a clean structured-test run.
- gemma4:26b is the practical surprise. It was fast, clean, and reached the same hop-depth as qwen3.5:35b-a3b.
- qwen3.5:35b-a3b remains a strong fast MoE baseline, with clean structured behavior and solid hop depth.
That is the real Spark story. The hardware opens the door to larger local models, but the benchmark helps decide which of those models are actually useful inside an agent. The goal is not to run the biggest model for its own sake. The goal is to run the model that can keep the action loop intact.
Choosing a Model for a Local Agent
This section connects the benchmark back to OpenClaw and real deployments.
In practice, pick the most reliable model you can run comfortably, and optimize for tool correctness + multi-hop stability before raw tokens/sec. Spark-class memory helps when your agent needs long context, many tool calls, and consistent follow-through.
| Use case | Best fit from these runs | LLM Type |
|---|---|---|
| Strongest overall agent profile | nemotron-3-super:120b-a12b | MoE |
| Best OpenClaw runtime fit | nemotron-3-nano:30b | Dense |
| Best speed/reliability surprise | gemma4:26b | MoE |
| Strong fast Dense model | qwen3.5:35b-a3b | Dense |
| Fast experiments with more caution | nemotron-3-nano:4b, nemotron-3-nano:30b, gemma4:e4b | Dense |
That distinction is important. nemotron-3-super:120b-a12b is the strongest model in the benchmark table, but nemotron-3-nano:30b is the better OpenClaw fit because the runtime itself consumes a large amount of context. In real agent deployments, the usable context window matters as much as the model’s isolated benchmark score.
This is also the bridge into the next stage of the project:
- OpenClaw as the active runtime
- NemoClaw and related hardening work for production readiness
- a cleaned public benchmark repository so developers can reproduce and extend the test suite
Conclusion
DGX Spark makes local agent benchmarking practical at scale by providing enough memory headroom for larger models and longer tool-heavy contexts.
In this run, nemotron-3-super:120b-a12b delivered the strongest overall agent profile, while gemma4:26b stood out as the best speed/reliability balance. The takeaway is simple: for local agents, choose models based on tool reliability and multi-hop stability, not just tokens per second.
Exxact is a solution integration partner with NVIDIA and offer a wide range of solutions featuring NVIDIA products. Get a quote on DGX Spark, DGX Station, or configure an Exxact workstation/server featuring NVIDIA GPUs today. Unsure on what you need? Talk to our engineers for hardware recommendations!

Powering Workloads with Configurable Confidence
Power your workloads with confidence and configure an Exxact 2U TensorEX Server equipped with multiple GPUs, storage, networking, and more to build your ideal computing infrastructure.
Configure Now
Benchmarking Local AI Agents on NVIDIA DGX Spark (Ollama + OpenClaw)
Introduction
Running an open LLM locally now has an extremely low barrier to entry with Ollama. The real question is whether a local model can behave like a reliable agent: call tools correctly, pass valid arguments, handle messy outputs, and finish multi-step tasks without drifting.
To measure that, we built a reproducible benchmark suite around Ollama + OpenClaw and ran it on NVIDIA DGX Spark. This post covers what we tested, how we scored agent behavior, and which models held up best in multi-step tool workflows.
TL;DR (What this post answers)
- This benchmark measures agent behavior (tool calling, argument discipline, injection resistance, and multi-step chaining), not just tokens/sec.
- Tested 3 models each from the Qwen, Gemma, and Nemotron model families.
- In our DGX Spark runs, larger “agent-oriented” models tended to be more reliable over multi-step chains.
- One mid-size model stood out for speed + clean agent behavior, making it a practical local-agent default if you don’t want 100B-class latency.
- If you’re choosing a local agent model, reliability and hop depth matter more than raw throughput.
Why DGX Spark Fits This Workload
DGX Spark is a compact desktop system built on the Grace Blackwell GB10 with 128GB of unified memory—and that memory architecture is a big reason it’s compelling for agentic workflows. In a unified-memory design, the CPU and GPU share a single, coherent memory pool instead of juggling separate “system RAM” and “GPU VRAM” with copies in between.
This is the advantage over traditional workstations because real agents are not single-prompt demos. They accumulate large system prompts, retrieved documents, evaluate tool schemas and outputs, determine intermediate plans, and solve multi-step problems. That means your agent can keep more of its working set in one place and access it with fewer “move data around” penalties.
With 128GB of unified memory headroom, Spark makes it easier to run larger local models and sustain longer multi-step chains without constantly trimming context or hitting memory cliffs, ideal for keeping the action loop intact with consistent tool calls, stable context retention, and reliable follow-through over many hops. More on hops later.
- Unified memory pool: CPU + GPU share one coherent pool (less “RAM vs VRAM juggling”).
- Bigger working set: more room for model weights, KV cache, system prompts, retrieved docs, and tool outputs.
- Fewer memory cliffs: avoids “barely fits” behavior that can spike latency mid-chain.
- More reliable multi-step runs: helps sustain longer tool-heavy workflows without constantly trimming context.
OpenClaw Benchmark Setup (and Why It’s Different From Production)
Most local LLM benchmarks measure generation quality or speed, but they don’t tell you whether a model can behave like a reliable agent. Agent workflows involve repeated tool calls. To measure those behaviors, we built a reproducible benchmark harness around Ollama + OpenClaw that runs structured agent tests and records pass/fail behavior, latency, throughput, and multi-hop chain depth. You can view our benchmark here: https://github.com/Exxact-Software/local-agent-benchmark
The best benchmark model is not automatically the best OpenClaw runtime model. The benchmark talks directly to Ollama for controlled measurement, while OpenClaw is the actual agent runtime and carries a larger system prompt and more runtime context. That changes the practical model choice: a 120B-class model can look strongest in isolation but still be a poor fit due to usable headroom and latency.
Our OpenClaw testing confirms this: nemotron-3-super:120b-a12b was the strongest benchmark profile, but it wasn’t the practical OpenClaw choice. nemotron-3-nano:30b fit the runtime better by leaving more usable context and responding fast enough for interactive agent work.
Models We Tested
For the DGX Spark reference run, we tested a set of open models chosen to cover a useful range. We included large “Spark-native” tests (e.g., nemotron-3-super:120b-a12b), fast MoE baselines (e.g., qwen3.5:35b-a3b), dense reference points (e.g., qwen3.5:27b), a memory-stress case (qwen3.5:122b-a10b), and smaller/faster models (Nemotron + Gemma variants) to see how often speed trades off against tool discipline.
Architecture matters because parameter count alone is incomplete: dense models use the full network each token (straightforward but cost scales directly), while MoE models route each token through a smaller active subset (often better capability/speed tradeoffs). The benchmark keeps both so we can compare the real agent tradeoffs: speed, memory pressure, context headroom, and reliability.
| Model | Parameters | Type | Why It’s In The Set |
|---|---|---|---|
| nemotron-3-nano:4b | 4 billion | Dense | Small baseline, fast sanity-check model |
| gemma4:e4b | 4 billion | Dense | Small/fast Gemma baseline |
| gemma4:26b | 26 billion | Dense | Practical mid-size local-agent candidate |
| qwen3.5:27b | 27 billion | Dense | Larger dense comparison point |
| gemma4:31b | 31 billion | Dense | Larger Gemma comparison point |
| nemotron-3-nano:30b | 30 billion | MoE | Mid-sized speed-oriented model with agent potential |
| qwen3.5:35b-a3b | 35 billion | MoE | Strong local MoE baseline |
| nemotron-3-super:120b-a12b | 120 billion | MoE | Large flagship model and the most interesting Spark-native candidate |
| qwen3.5:122b-a10b | 122 billion | MoE | Large high-capability model that stresses memory assumptions |

Accelerate AI Training an NVIDIA DGX Spark
Take enterprise AI compute anywhere. NVIDIA DGX Spark delivers up to 1 petaFLOP in a portable 6" x 6" x 2" form factor. Harness datacenter power in your backpack. Available now through Exxact Corporation.
Get a Quote TodayDGX Spark OpenClaw Benchmark Results
Model | Type | T1-T17 | Hop Depth | Avg tok/s | Notes | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NemoTron 3 Nano | |||||||||||||||||||||||||||||
| nemotron-3-nano:4b | Dense | 16/17 | 4 | 64.2 | Fast, but one noisy full-run miss | ||||||||||||||||||||||||
| nemotron-3-nano:30b | Dense | 15/17 | 5 | 64.7 | Fast, but flaky on T13 | ||||||||||||||||||||||||
| nemotron-3-super:120b-a12b | MoE | 17/17 | 6 | 16.4 | Strongest overall agent profile in this Spark run | ||||||||||||||||||||||||
| Qwen 3.5 | |||||||||||||||||||||||||||||
| qwen3.5:27b | Dense | 17/17 | 5 | 10.4 | Clean and reliable but slow | ||||||||||||||||||||||||
| qwen3.5:35b-a3b | Dense | 17/17 | 4 | 48.2 | Clean, reliable, fast | ||||||||||||||||||||||||
| qwen3.5:122b-a10b | Dense | 17/17 | 3 | 20.1 | Clean after T11 timeout fix | ||||||||||||||||||||||||
| Gemma4 | |||||||||||||||||||||||||||||
| gemma4:e4b | Dense | 17/17 | 2 | 52.6 | Fast and clean, but shallow multi-hop behavior | ||||||||||||||||||||||||
| gemma4:26b | MoE | 17/17 | 4 | 52.7 | Best Gemma result; strong speed/reliability balance | ||||||||||||||||||||||||
| gemma4:31b | MoE | 17/17 | 2 | 9.7 | Clean, but slow and shallow compared with the others | ||||||||||||||||||||||||
How To Read This Benchmark
The result tables in this post are not meant to be read like a normal model leaderboard. A single score does not tell the whole story.
- T1–T17 is the structured agent test suite. It checks whether the model can call the right tools, pass the right arguments, handle malformed outputs, follow formatting rules, resist prompt injection, and track simple state changes. A clean 17/17 is the best score.
- T1–T4 Basic tool calling: Correct tool choice; correct arguments; doesn’t hallucinate tools
- T5–T7 Parallel tool calls: Multiple calls in one turn; result attribution; conflict handling
- T8–T11 Stress inputs: 404s; malformed/partial tool outputs; timeouts; very large payloads
- T12–T15 Instruction adherence: One-call constraints; JSON-only discipline; prompt injection resistance; instruction conflict handling
- T16–T17 Edge cases: Knowing when not to call a tool; state mutation across operations
- Hop Depth is the escalating chain test. Each tool result becomes input to the next step, so the model has to preserve context and keep the task moving. This is one of the most important numbers for agent work because real agents rarely stop after one clean tool call. A hop-depth of 4 chain might look like:
- search_places → get_place_details → get_directions → search_parking
- Avg Tok/s is speed. We would ideally like to see faster than 20 Tok/s with 40+ Tok/s being the most responsive. It matters, especially for an always-on local agent, but it is not the same thing as reliability. A fast model that breaks the tool contract can still be the wrong choice for an agent.
The best result is the model that balances all three: an Agentic AI that has clean structured test behavior, deep multi-hop stability, and highly usable speed.
- The strongest Spark-side model in this run was nemotron-3-super:120b-a12b with a clean 17/17, best hop depth, and strongest agentic profile across the suite.
- The Qwen family also held up very well, especially after the benchmark prompt/scoring track stabilized and the T11 timeout path was handled fairly.
- All three Gemma models reached clean 17/17 t-score and handled T-11 (timeouts) well.
- gemma4:26b was the strongest Gemma result, with hop-depth of 4 at 52.7 tok/s
- gemma4:e4b was almost as fast, but stopped after 2 hops
- gemma4:31b was clean but much too slow, and also stopped after 2 hops
That makes gemma4:26b the surprise practical result. It did not beat nemotron-3-super:120b-a12b on agent depth, but it matched qwen3.5:35b-a3b on hop-depth while having higher tok/s speed.
Speed vs Reliability
One of the most important conclusions is that the fastest models were not automatically the best agent models. For example, nemotron-3-nano:4b and nemotron-3-nano:30b were very fast, but the remaining benchmark issues clustered in the smaller Nemotron models. gemma4:e4b and gemma4:26b were also fast, but only gemma4:26b combined that speed with a stronger multi-hop result.
That creates a real tradeoff:
- smaller models can be operationally attractive for speed
- larger models can be much more trustworthy when the task becomes agentic
- mid-size models can be the sweet spot when they keep the tool contract clean without giving up too much latency
For an always-on agent, reliability is often more valuable than raw tokens per second.
That is one of the core points of the benchmark.
The Gemma results sharpen that point. gemma4:31b was larger than gemma4:26b, but it was, by far, not better in this benchmark: it was slower and stopped earlier in hop depth. The best Gemma result came from the model with the best balance, not the largest parameter count within the DGX Spark memory limit.
Why This Matters for Local Agents
An always-on agent needs to do boring things reliably. It needs to call the right tool, pass the right arguments, wait for the result, read that result correctly, and decide what to do next. It needs to keep doing that even when the tool returns partial data, a timeout, a giant payload, or text that tries to hijack the instruction hierarchy.
That is why this benchmark rewards consistency. A model that is fast but casually breaks JSON, invents tool arguments, or loses the thread after two steps is harder to trust in a real agent runtime. It may still be useful for narrow tasks, but it needs guardrails, retries, or a smaller responsibility surface.
The strongest models in this run point to different local-agent profiles:
- nemotron-3-super:120b-a12b is the strongest deeper-agent candidate. It was slower than the small models, but it had the best hop depth and a clean structured-test run.
- gemma4:26b is the practical surprise. It was fast, clean, and reached the same hop-depth as qwen3.5:35b-a3b.
- qwen3.5:35b-a3b remains a strong fast MoE baseline, with clean structured behavior and solid hop depth.
That is the real Spark story. The hardware opens the door to larger local models, but the benchmark helps decide which of those models are actually useful inside an agent. The goal is not to run the biggest model for its own sake. The goal is to run the model that can keep the action loop intact.
Choosing a Model for a Local Agent
This section connects the benchmark back to OpenClaw and real deployments.
In practice, pick the most reliable model you can run comfortably, and optimize for tool correctness + multi-hop stability before raw tokens/sec. Spark-class memory helps when your agent needs long context, many tool calls, and consistent follow-through.
| Use case | Best fit from these runs | LLM Type |
|---|---|---|
| Strongest overall agent profile | nemotron-3-super:120b-a12b | MoE |
| Best OpenClaw runtime fit | nemotron-3-nano:30b | Dense |
| Best speed/reliability surprise | gemma4:26b | MoE |
| Strong fast Dense model | qwen3.5:35b-a3b | Dense |
| Fast experiments with more caution | nemotron-3-nano:4b, nemotron-3-nano:30b, gemma4:e4b | Dense |
That distinction is important. nemotron-3-super:120b-a12b is the strongest model in the benchmark table, but nemotron-3-nano:30b is the better OpenClaw fit because the runtime itself consumes a large amount of context. In real agent deployments, the usable context window matters as much as the model’s isolated benchmark score.
This is also the bridge into the next stage of the project:
- OpenClaw as the active runtime
- NemoClaw and related hardening work for production readiness
- a cleaned public benchmark repository so developers can reproduce and extend the test suite
Conclusion
DGX Spark makes local agent benchmarking practical at scale by providing enough memory headroom for larger models and longer tool-heavy contexts.
In this run, nemotron-3-super:120b-a12b delivered the strongest overall agent profile, while gemma4:26b stood out as the best speed/reliability balance. The takeaway is simple: for local agents, choose models based on tool reliability and multi-hop stability, not just tokens per second.
Exxact is a solution integration partner with NVIDIA and offer a wide range of solutions featuring NVIDIA products. Get a quote on DGX Spark, DGX Station, or configure an Exxact workstation/server featuring NVIDIA GPUs today. Unsure on what you need? Talk to our engineers for hardware recommendations!

Powering Workloads with Configurable Confidence
Power your workloads with confidence and configure an Exxact 2U TensorEX Server equipped with multiple GPUs, storage, networking, and more to build your ideal computing infrastructure.
Configure Now