Building a Hybrid LLM Platform on EKS
Across this blog we keep referring to a hybrid LLM platform — frontier models for the hard reasoning, self-hosted open-source models for the high-volume work, all on Kubernetes. This series builds it from an empty AWS account to a working inference service, one layer at a time, as reproducible AWS CDK infrastructure you can deploy and tear down yourself.
This is a learning series, not a production blueprint.
The infrastructure built here is designed to teach how the pieces fit together. Before running anything like this with real workloads, proprietary data, or user traffic, you should conduct a thorough security review, understand the attack surface of each layer, and assess compliance requirements for your specific context. GPU clusters, public-facing routers, and LLM endpoints each introduce risks that require deliberate hardening beyond what a tutorial covers.
The Target Architecture
┌─────────────────────────┐
client requests ───► │ ALB (public subnets) │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ hybrid router / gateway │ ← cloud vs. local
│ (CPU node pool) │
└──────┬─────────────┬─────┘
│ │
frontier API │ │ local inference
(egress via │ ▼
NAT) │ ┌──────────────────────┐
▼ │ vLLM model servers │
┌──────────┤ (GPU node pool) │
│ Claude / │└──────────────────────┘
│ GPT │
└──────────┘
all of it on EKS, in private subnets, observed + autoscaledThe Parts
Each part deploys cleanly on its own, with downloadable source. Published parts link to the full walkthrough; the rest are on the way.
Prefer the high-level version? The companion Hybrid AI Playbook and Self-Hosting LLMs on Kubernetes cover the why behind this build.
Want This Built for Your Team?
We build hybrid LLM platforms like this one for clients — reproducible, cost-aware, and documented so your team can own it. Book a free call and we'll map the fastest path.
Book a Free Call