Tracing a NIM Request with Nsight Systems — What the 24.8 tok/s Number Hides

What this article will answer

Headline throughput numbers are a consequence, not a cause. This piece opens the hood on the 8B NIM at inference time and asks, for a single representative request, which kernels own the latency budget and which are rounding error.

NVIDIA technologies to be covered

Nsight Systems — launching a trace against a running NIM container; NVTX ranges; filtering the timeline to the request of interest.
CUDA Toolkit — minimum version requirements; nsys CLI on the Spark host vs inside a container.
Kernel trace interpretation — attention kernels, GEMM tiles, the sampling loop, memcpy H2D/D2H; what’s slow because of the model and what’s slow because of the plumbing.
Nsight Compute — when timeline sampling isn’t enough and you need per-kernel occupancy and achieved memory throughput.
Editor integration — launching captures from VS Code / Cursor without breaking the container’s lifecycle.

What I expect to find

Paged-KV attention will dominate the decode phase. Prefill will be a GEMM wall. The memcpy between host-side request parsing and GPU-side tokenization will be smaller than feared thanks to unified memory. The piece closes by feeding the trace back into the TRT-LLM article as a prioritization tool: here’s what to try to fix first.

Where it sits in the arc

Cross-cutting. Belongs to dev-tools primary, but pairs with the TRT-LLM deployment article — a trace in hand is the right prerequisite for deciding which compile-time knob to turn.