NVIDIA Nemotron Super — The Open-Weight Model That Actually Cares About Throughput
NVIDIA Nemotron Super is a 120B MoE model with 12B active parameters hitting 478 tokens/sec on NIM/TensorRT — a strategic full-stack play, not a benchmark release.
NVIDIA is building the full-stack position for agents: hardware, inference infrastructure, and now the model that ties both together. With Nemotron Super, released March 11 at GTC 2026, NVIDIA isn’t shipping a research artifact — it’s completing a vertical integration play. The hardware runs on NVIDIA GPUs, the inference stack is NIM/TensorRT, and now there’s an open-weight model explicitly engineered to run fastest on that exact stack. For developers building self-hosted agent infrastructure, this shifts the build-vs-buy calculation in a concrete way: for teams already on NVIDIA infra, this is the first open-weight model where the throughput argument is clearly compelling at production scale.
TL;DR
- What: NVIDIA released Nemotron Super — 120B total parameters, 12B active (MoE), open-weight under Apache 2.0 — optimized for agent-loop throughput, not benchmark headlines
- Speed: NVIDIA reports 478 tokens/sec on NIM/TensorRT (Artificial Analysis), roughly 1.8× GPT-OSS-120B at 264 tokens/sec
- Signal: First open-weight model explicitly engineered for agent-loop throughput on a vertically integrated inference stack
- Action: If you’re running self-hosted agent loops on NVIDIA infrastructure, benchmark this against your current setup before your next model decision
NVIDIA Nemotron Super — What Happened
NVIDIA released Nemotron Super on March 11 at GTC 2026: a single model with 120B total parameters and 12B active parameters per forward pass, using a hybrid Latent Mixture-of-Experts architecture. NVIDIA reports 478 tokens/sec on NIM/TensorRT (Artificial Analysis benchmarks, H100 hardware). The model ships open-weight under Apache 2.0 and is available on Hugging Face.
Why This Matters
The throughput number is the story — but only if you understand why throughput matters differently for agents than for chat.
In a standard chat application, a user waits for one response. Latency matters, throughput is secondary. In an agent loop, a single task might require 50–200 model calls: planning, tool selection, reflection, correction, output synthesis. At 264 tokens/sec (GPT-OSS-120B baseline), a complex research task that generates 100,000 tokens costs you roughly 6 minutes of pure inference time. At 478 tokens/sec, the same task takes 3.5 minutes. Multiply that across concurrent agents running in parallel, and the operational math changes significantly.
The architecture explains the speed. Nemotron Super isn’t a standard Transformer stack. NVIDIA used a hybrid design combining State-Space Models (SSMs), Transformers, and a latent MoE routing layer. SSMs handle long-context sequence processing more efficiently than pure attention at scale; the latent MoE layer keeps active parameters low — only 12B of the 120B total are active per forward pass — which means less compute per token. The result is a model that’s architecturally unusual — and that unusual architecture is precisely why it can hit these throughput numbers on the same hardware where a denser model would be slower.
The NIM/TensorRT coupling is intentional, not incidental. Nemotron Super runs fastest on NVIDIA’s own inference stack, and that’s not an accident. NVIDIA has spent years optimizing TensorRT for exactly this model class — MoE with sparse activation patterns. You can run Nemotron Super on other inference runtimes, but you won’t see the same numbers. For developers already running NVIDIA infrastructure, that’s a feature. For teams on other stacks, it’s a constraint worth acknowledging before you benchmark.
The honest benchmark picture is more nuanced. SWE-Bench Verified: 60.47% (NVIDIA reports). Overall Intelligence: 36 points against a frontier model average around 57 points. Qwen 3.5 122B outperforms Nemotron Super on BrowseComp and TAU2-Bench. This is not the most capable open-weight model available. The gap to frontier models like GPT-5.4 or Claude 3.7 Sonnet on complex multi-step reasoning tasks is real and shouldn’t be papered over. Nemotron Super wins on throughput and self-hosted deployment economics — not raw intelligence.
Nemotron Super has no vision or multimodal support. If your agent architecture requires image understanding, document parsing, or screenshot-based tool use, this model doesn’t cover that workload. It’s a text-in, text-out system.
The strategic picture. This release makes most sense when you look at NVIDIA’s full stack: H100/H200 GPUs → NIM/TensorRT → Nemotron Super. Each layer reinforces the others. If agentic workloads become the dominant AI compute category — and there’s good reason to think they will, as covered in the Agentic Infrastructure Stack 2026 — NVIDIA doesn’t just want to sell the chips running those workloads. They want the default model running on those chips to be theirs. That’s a different business than pure hardware. It’s closer to the AWS/Azure playbook: own the infrastructure, provide the optimized services on top, make switching incrementally more expensive.
For developers, the practical question is when this changes your architecture decisions. The open source vs. proprietary model tradeoff has always included deployment cost, data control, and customization — but throughput at this scale is a newer variable. API-first deployments (Anthropic, OpenAI) remain the right call when you need frontier-level reasoning, multimodal support, or don’t want to manage infrastructure. Nemotron Super is compelling when you’re running high-volume agent loops, have NVIDIA hardware already, and need predictable per-token economics without vendor API pricing exposure.
Nemotron Super’s 120B parameter footprint requires significant VRAM. NIM/TensorRT with FP8 quantization is the recommended deployment path and the one where the 478 tokens/sec figure was measured. Other runtimes and quantization levels will yield different (typically lower) throughput numbers — benchmark your specific setup before committing infrastructure.
The Take
NVIDIA didn’t release a model. They released the third layer of a vertical stack. That’s worth taking seriously.
The throughput case for Nemotron Super is real and specific: if you’re running self-hosted agent loops on NVIDIA infrastructure, this is the most compelling open-weight option we’ve seen for that exact workload. Not because it’s the smartest model — it isn’t — but because it’s the first open-weight model where architectural choices, inference stack optimization, and hardware alignment were all pointing at the same target from the start.
The risk for developers is the lock-in gradient. The model is open-weight; the performance advantage is not. Running Nemotron Super on non-NVIDIA inference infrastructure closes the throughput gap. That’s a subtle but important distinction: you have model portability, but not full infrastructure portability if you want the headline numbers.
Open weights don’t mean open stack. Know which one you’re actually adopting.
For teams deciding right now: benchmark it against your current self-hosted setup on actual agent workloads — not synthetic benchmarks. The 478 tokens/sec figure is on H100 hardware with NIM/TensorRT. Your numbers will vary. But if the production throughput holds in your environment, the economics case builds quickly.
Related
- Open Source vs. Proprietary AI Models — The tradeoff framework for deciding when self-hosted open models beat API-first deployments
- The Agentic Infrastructure Stack 2026 — How the full agent infrastructure layer is consolidating, and where model choice fits in that picture