LoRA on Nemotron Nano — Fine-tuning a 9B Without Blowing Unified Memory

What this article will answer

When is LoRA the right tool on a personal rig, and what settings keep it clean? This piece will run a small fine-tune of Nemotron Nano 9B v2 on a domain corpus, sweeping rank ∈ {8, 16, 32} and alpha across a couple of values to see what actually moves the needle.

NVIDIA technologies to be covered

NeMo Customizer — the .yaml-driven workflow for LoRA/QLoRA on NeMo checkpoints.
Nemotron Nano 9B v2 — the base model; its FP8 quantization story; how PEFT plays with quantized bases.
PEFT adapters — saving, merging, and serving adapters without baking them into the base.
NIM-compatible deploy path — loading the fine-tuned model behind a NIM microservice so the rest of the RAG stack keeps working.

What I expect to find

LoRA at rank 16 on 9B should fit comfortably in unified memory with batch size 4 and sequence length 2048. The real question is not “can it fit” but “does a few hours of LoRA on a hundred-document corpus produce a model that’s materially better for LLM-Wiki’s compile-time summarization loop than the base + good prompting”. The article closes on that comparison.

Where it sits in the arc

Part of the LLM Wiki track (W-series). Builds on the seven foundation articles and complements the bigger-generator article’s result that strict-context prompting has a ceiling.