Building a Hybrid LLM Platform on EKS

Across this blog we keep referring to a hybrid LLM platform — frontier models for the hard reasoning, self-hosted open-source models for the high-volume work, all on Kubernetes. This series builds it from an empty AWS account to a working inference service, one layer at a time, as reproducible AWS CDK infrastructure you can deploy and tear down yourself.

Start with Part 1 →8 of 8 parts published

This is a learning series, not a production blueprint.

The infrastructure built here is designed to teach how the pieces fit together. Before running anything like this with real workloads, proprietary data, or user traffic, you should conduct a thorough security review, understand the attack surface of each layer, and assess compliance requirements for your specific context. GPU clusters, public-facing routers, and LLM endpoints each introduce risks that require deliberate hardening beyond what a tutorial covers.

The Target Architecture

                          ┌─────────────────────────┐
   client requests  ───►  │   ALB (public subnets)  │
                          └────────────┬────────────┘
                                       │
                          ┌────────────▼────────────┐
                          │  hybrid router / gateway │   ← cloud vs. local
                          │     (CPU node pool)      │
                          └──────┬─────────────┬─────┘
                                 │             │
                   frontier API  │             │  local inference
                   (egress via   │             ▼
                    NAT)         │   ┌──────────────────────┐
                                 ▼   │  vLLM model servers   │
                          ┌──────────┤   (GPU node pool)     │
                          │ Claude / │└──────────────────────┘
                          │   GPT    │
                          └──────────┘
        all of it on EKS, in private subnets, observed + autoscaled
EKSAWS CDKvLLMGPUKarpenterOpenTelemetryClaudeLlama

The Parts

Each part deploys cleanly on its own, with downloadable source. Published parts link to the full walkthrough; the rest are on the way.

01

Architecture & the Network Foundation

Published

The full platform architecture and the CDK network stack it all lives in — VPC, public/private subnets, NAT, EKS discovery tags, and VPC endpoints sized for a GPU cluster.

AWS CDKVPCTypeScript
02

The EKS Control Plane

Published

Dropping the cluster into the VPC: the EKS control plane, the OIDC provider, IAM roles, and IRSA — ending with a working kubectl connection.

EKSIAMIRSAOIDC
03

Node Groups: CPU System Pool & GPU Pool

Published

Managed node groups for the system workloads and a GPU pool for inference — GPU AMIs, the NVIDIA device plugin, and the taints and labels that keep model servers on the right nodes.

GPUNVIDIANode Groups
04

Platform Add-ons

Published

The cluster services everything else depends on: the AWS Load Balancer Controller, ingress, and Karpenter for fast, cost-aware autoscaling of GPU capacity.

KarpenterALB ControllerIngress
05

Serving Local Models with vLLM

Published

Deploying the self-hosted inference layer — vLLM model servers, loading weights, and request-based autoscaling so GPU capacity follows demand.

vLLMKEDAHelm
06

The Hybrid Router

Published

The gateway that makes it hybrid: routing each request to a frontier model for hard reasoning or to a local model for high-volume execution work.

HonoClaudeRouting
07

Observability & Cost Telemetry

Published

Wiring observability into the platform — OpenTelemetry traces through the router, Prometheus and Grafana for GPU and vLLM metrics, and Langfuse for per-request token and cost telemetry, so you can see cloud-vs-local spend and tune the routing.

OpenTelemetryPrometheusGrafanaLangfuse
08

Testing, Load & Examples

Published

Validating the platform end-to-end — load testing the inference layer, sample workloads, and proving the routing economics under real traffic.

Vitestk6OpenAI SDK

Prefer the high-level version? The companion Hybrid AI Playbook and Self-Hosting LLMs on Kubernetes cover the why behind this build.

Want This Built for Your Team?

We build hybrid LLM platforms like this one for clients — reproducible, cost-aware, and documented so your team can own it. Book a free call and we'll map the fastest path.

Book a Free Call