Building a Hybrid LLM Platform on EKS, Part 5: Serving Local Models with vLLM and KEDA
Part 5 of our hands-on EKS series. We deploy vLLM model servers on the GPU pool from Part 4, load Qwen2.5-7B model weights from Amazon S3 via an init container, and wire KEDA autoscaling that scales replicas with live queue depth and drives GPU nodes to zero overnight.
Building a Hybrid LLM Platform on EKS, Part 3: Node Groups, GPU AMIs, and the NVIDIA Device Plugin
Part 3 of our hands-on EKS series. We add worker nodes to the empty cluster from Part 2: a CPU system pool for add-ons and the hybrid router, a GPU pool for vLLM model servers, the NVIDIA device plugin DaemonSet, and the taints and labels that make scheduling predictable.
Self-Hosting LLMs on Kubernetes: A Practical Guide
How to deploy, serve, and autoscale open-source large language models on Kubernetes with vLLM — from GPU node pools and deployment manifests to KEDA-based autoscaling and production guardrails.