The DGX Spark rig on the author's desk — the subject of these field notes.

Field notes on the DGX Spark.

One builder maximising the DGX Spark as a personal AI power user and edge AI rig. Every article is a session transcript turned into a deep-dive essay.

At a glance

14 +4
Articles
36,561
Words
7,467
Lines of code
6
Models
11
NVIDIA Products
Stages
Foundations 3 Training 0 +1 Fine-tuning 3 +1 Inference 8 Deployment 1 Agentic 1 Observability 1 +1 Dev-tools 1 +1
Products & frameworks
DGX Spark 14 NVIDIA NIM 14 NeMo Framework 9 pgvector 9 NeMo Retriever 8 TensorRT-LLM 7 NemoClaw 6 Triton Inference Server 6 NeMo Guardrails 3 Ollama 3 OpenClaw 2
Models deployed
Llama 3.1 8B Instruct 11 Nemotron Reranker 1B 5 Nemotron Super 49B 4 Llama 3.3 70B Instruct 3 Nemotron Embed 1B v2 3 Qwen2.5 3B Instruct 1
Measured on this box
Latency
80 ms
streams the first token in
Throughput
25 tok/s
fp8 engine runs at roughly
Accuracy
recall@5 = 1.0
queries where retrieval was perfect
Upcoming training NeMo Framework + Llama 3.1 8B planned ~2 days of wall-clock, one long weekend

Continued Pre-training on a DGX Spark — NeMo Framework Without a Cluster

When does it make sense to continue pre-training on a single GB10 box, and when is it a category error? A planned run that pushes NeMo Framework, Megatron-LM parallelism, and BF16 mixed precision against the 128 GB unified-memory wall with a small domain corpus.

Upcoming fine-tuning NeMo Customizer + Nemotron Nano 9B v2 planned ~4 hours per sweep

LoRA on Nemotron Nano — Fine-tuning a 9B Without Blowing Unified Memory

A planned walk through LoRA fine-tuning on Nemotron Nano 9B with NeMo Customizer: rank and alpha sweeps, a tiny domain corpus, and the memory accounting that keeps a PEFT run from tripping the Spark's 128 GB unified-memory wall.

Upcoming observability NVIDIA DCGM + Prometheus + Grafana planned ~3 hours, mostly dashboard tuning

Watching the GPU — DCGM, Prometheus, and a Local Grafana for the Spark

A planned setup of DCGM Exporter → Prometheus → Grafana entirely on the Spark itself. The goal is a single dashboard that tells the truth about GPU memory, SM occupancy, and per-container utilization for a rig that's running NIMs, pgvector, and an occasional training job at the same time.

Upcoming dev-tools NVIDIA Nsight Systems + CUDA Toolkit planned ~4 hours including trace analysis

Tracing a NIM Request with Nsight Systems — What the 24.8 tok/s Number Hides

A planned kernel-level trace of a single NIM inference request on GB10. Where does the wall-clock time actually go — tokenization, KV-cache attention, the sampling loop, memcpy? The article turns 24.8 tokens per second into a timeline you can point at and say 'that line is the bottleneck'.

Article №14 foundations Foundation ~25 minute read

Looking Beyond Spark — Fine-Tuning a 100B Nemotron

A working answer to: how many GPUs to fine-tune a 100B Nemotron? Three methods, three memory footprints — full FT ≈ 1.6 TB needs 24× H100; LoRA ≈ 250 GB fits 8× H100; QLoRA ≈ 65 GB fits 1× H200. The Spark's 3B LoRA teaches the math.

Article №13 observability NeMo Evaluator ~60 minutes end-to-end — 40 s to ingest the blog into pgvector, 2 min for retrieval, 4 min for generation across three 8B variants, 90 s for the LoRA variant, 9 min for grading

Ragas, Reranked — What 44 Held-Out Questions Say About the Second Brain Stack

A Ragas-style harness written in 200 lines of stdlib Python, run locally on the DGX Spark, against four variants of the Second Brain RAG chain. Naive RAG scores 3.30 / 5. Rerank RAG scores 4.27. LoRA+RAG is a surprise — it does not beat naive. Retrieval is where the points come from.

Article №12 fine-tuning Hugging Face PEFT + Qwen2.5-3B-Instruct ~45 minutes end-to-end — 5 min corpus via NIM 8B, 69 s training, 3 min benchmark, plus a 6 GB base-model download

LoRA on Your Own Q&A — What 231 Pairs Actually Teach a 3B Model

231 own-voice Q&A pairs, a rank-16 LoRA, 69 s of training on a GB10 Spark. The adapter won't memorize your exact numbers, but it will take a model that refuses 61% of questions about your work and turn it into one that answers all of them in your voice. For facts you still need RAG.

Article №11 deployment TensorRT-LLM + Triton Inference Server ~4 hours including two container pulls and three engine builds

TensorRT-LLM on the Spark — FP8 Isn't the Reason to Drop NIM. NVFP4 Is.

Dropping below NIM to raw TensorRT-LLM on a GB10 Spark. FP8 beats NIM's vLLM by 10-15% — barely worth the rebuild. NVFP4 beats it by 76% on decode, 43% on TTFT, and ships a 34%-smaller engine. The reason to drop NIM is the Blackwell-native 4-bit kernel, not FP8.

Article №10 foundations Foundation 10-minute read; no hands-on

One Substrate, Three Apps — Where the Foundation Forks

Seven articles installed one stack on the Spark — NIM, Embed, pgvector, RAG glue, reranker, generator A/B, Guardrails. This bridge retells that install as three different answers to one question — corpus plus 128 GB — and walks readers to the top of three tracks.

Article №09 inference NeMo Guardrails ~90 minutes on top of the article #7/#8 chain

One Rail, Three Policies — NeMo Guardrails on the Retrieval Path

NeMo Guardrails drops a policy gate between retrieval and generation. One install, three per-arc configs — PII for Second Brain, style for LLM Wiki, code-safety for Autoresearch — and a 15-query benchmark: 100% block recall, 100% clean pass. Rails are scaffolding; detectors are the content.

Article №08 inference Llama 3.3 70B + Nemotron-Super-49B + Llama 3.1 8B NIM ~30 minutes on top of the article #7 chain

Bigger Generator, Same Grounding — 8B vs 49B vs 70B on One Retrieval Chain

Article #7 bet that a bigger generator would heal the 8B Google-IPO refusal. Ran the A/B across three sizes on one retrieval chain. Bet lost: Nemotron-Super-49B over-refuses the 8B baseline; Llama 3.3 70B narrows the gap, not closes it. The refusal was the scaffold working.

Article №07 inference Nemotron Reranker + pgvector full-text + Llama 3.1 8B NIM ~45 minutes on top of the article #6 chain

Hybrid Retrieval on the Spark — BM25, Dense, Fusion, Rerank

Four retrieval modes on one corpus — naive dense, BM25, Reciprocal Rank Fusion, Nemotron rerank. Dense is already 92% recall@5; rerank adds a point at K=10 and reorders the top. The 8B generator still refuses where retrieval is perfect — grounding, not retrieval, is the new bottleneck.

Article №06 inference Llama 3.1 8B NIM + Nemotron Retriever + pgvector ~30 minutes if the three endpoints are already warm

Three Endpoints, One Answer — Naive RAG on a DGX Spark

Three endpoints in one curl chain — a query embeds through Nemotron, pgvector returns top-5 chunks in under 80 ms, and a Llama 3.1 8B NIM stuffs them into a strict-context prompt. The chain works; the 8B generator still refuses on questions its own context answers.

Article №05 inference pgvector ~15 minutes first install, re-runs in seconds

Where Your Vectors Live — pgvector on a DGX Spark

The substrate between the embed call and the retrieve call — pgvector 0.8.2 running as a Postgres 16 container on GB10, with 1000 Nemotron vectors, HNSW and ivfflat both indexed, and a planner that prefers seq scan until you tell it otherwise.

Article №04 inference NeMo ~30 minutes first install, ~1 minute every restart after

Your Own Semantic Space — a Nemotron Embedding NIM on a DGX Spark

The embedding endpoint that every downstream RAG, wiki, and agent piece will reuse — a 2048-dim Nemotron Retriever NIM running locally on GB10, ready 52 seconds after docker run and holding 28 docs/s under batched load.

Article №03 inference NIM ~2 hours first install, ~2 minutes every restart after

Your First NIM on a DGX Spark — What 24.8 Tokens Per Second Doesn't Tell You

First-contact notes on NVIDIA's DGX-Spark-specific Llama 3.1 8B NIM. 9.4 GB image, ~108 s warm-cache cold-start, 24.8 tok/s steady, OpenAI-compatible on :8000 — and a confidently wrong Python one-liner that clarifies what small-model FP8 buys and what it costs.

Article №02 agentic NemoClaw ~2 hours after prerequisites

The Sandbox Tax That Wasn't — NemoClaw vs OpenClaw on One DGX Spark

I ran NemoClaw's sandboxed agent stack and the host Ollama-OpenClaw CLI side by side on one DGX Spark with the same 123B Nemotron model. The sandbox overhead I went looking for is real but modest (~2× raw inference); the real tax is onboarding, and NemoClaw paid it at install time.

Article №01 foundations Foundation ~6 hours spread across a week

Access First, Models Second — How I Set Up My DGX Spark for Solo AI Work

Most DGX Spark walkthroughs open with CUDA and tokens/sec. This one opens with streaming, AI-pair-programming, sandboxed agents, and browser automation — the access layer. For a solo edge builder, that interaction stack is more load-bearing than the model stack.

More articles in preparation End of Vol. 01