← ai-field-notes

Upcoming — this is a proposed abstract, not a published piece.

Watching the GPU — DCGM, Prometheus, and a Local Grafana for the Spark

A planned setup of DCGM Exporter → Prometheus → Grafana entirely on the Spark itself. The goal is a single dashboard that tells the truth about GPU memory, SM occupancy, and per-container utilization for a rig that's running NIMs, pgvector, and an occasional training job at the same time.

What this article will answer

The Spark’s unified memory is shared CPU+GPU, which means nvidia-smi alone doesn’t tell the whole story. This piece wires up a proper telemetry stack so the next OOM landmine gets caught by a dashboard, not a hard-hung box.

NVIDIA technologies to be covered

  • DCGM Exporter — the NVIDIA-maintained Prometheus exporter; which metrics are worth scraping on GB10, which are noise.
  • nvidia-smi dmon — fast sanity-check loop when Grafana is down.
  • Prometheus — local scrape config, retention sized for a personal rig.
  • Grafana dashboards — one pane for memory (unified budget + per-container attribution), one for SM occupancy, one for power and thermals.
  • Kernel-level hooks — where nvidia-smi stops being enough and Nsight Systems takes over (forward reference to the Nsight article).

What I expect to find

The interesting metric is not peak GPU utilization — it’s the shape of memory over time as NIMs warm up and pgvector indexes churn. The article closes with a screenshot of the dashboard the day the Spark hung on 2026-04-22, so the visual becomes a diagnostic template for next time.

Where it sits in the arc

Cross-cutting. Useful for all three application tracks (Second Brain, LLM Wiki, Autoresearch) — whichever is running at the moment, the dashboard is the same.