Stage

Deployment

From experiment to something that runs for you reliably. Containers, services, updates, graceful degradation on one machine.

Article №11 deployment TensorRT-LLM + Triton Inference Server ~4 hours including two container pulls and three engine builds

TensorRT-LLM on the Spark — FP8 Isn't the Reason to Drop NIM. NVFP4 Is.

Dropping below NIM to raw TensorRT-LLM on a GB10 Spark. FP8 beats NIM's vLLM by 10-15% — barely worth the rebuild. NVFP4 beats it by 76% on decode, 43% on TTFT, and ships a 34%-smaller engine. The reason to drop NIM is the Blackwell-native 4-bit kernel, not FP8.