Triton Nvidia Inference Server | Aleksandr Shirokov

Triton Inference Server sounds like a production-ready model serving solution - just load your weights and go. But in reality, it’s a bit more nuanced.

In this talk, I’ll walk through how we turned Triton into a reliable and scalable part of our open source ML infrastructure — from model preparation to monitoring and autoscaling. We’ll cover how to choose and build the best runtime for every model in seconds (ONNX Runtime, OpenVino or TensorRT), when to bring in NVIDIA DALI for optimized preprocessing, how to use Triton API to seamlessly change model version after training, and how Model Analyzer helps with performance tuning. You’ll also see how Triton fits into K8s ecosystem: deployment in K8s with DVC / S3 model weights storage, scaling with Horizontal Pod Autoscaler, securing endpoints with Kong, and monitoring everything via Prometheus and Grafana.

This talk is not about running tritonserver --model-repository=.... It’s about what it takes to quick launch (even for DS specialists) and keep Triton alive in production with hundreds of models, that our team is serving. Perfect for engineers working with production ML, building MLOps stacks, or just curious about the hidden work behind smooth inference.