Building a Hybrid LLM Platform on EKS, Part 6: The Hybrid Router
In Part 5 we deployed vLLM and wired KEDA to scale replicas with queue depth. The model server is running and answering requests — but every caller reaches it at a ClusterIP address with no routing in front of it. There is nothing deciding whether a given request belongs on the local GPU or on a cloud API, and there is nothing handling the gap when vLLM is warming up from zero.
Part 6 builds the hybrid router: a lightweight FastAPI service that sits in front of both backends, accepts the same OpenAI-compatible request format the rest of the platform uses, and routes each request to the right backend at runtime. The router uses model-name overrides for explicit control and prompt complexity heuristics for automatic decisions, checks vLLM health before every local route and transparently falls back to Claude when the local model is still loading, and exposes the whole platform behind an internet-facing ALB via the load balancer controller from Part 4.
When this part is done: a single HTTPS endpoint accepts inference requests, routes them to the GPU model server or to the Anthropic API depending on the request, and the caller sees a consistent OpenAI-formatted response regardless of which backend answered it.
What the Router Does
The router is the only component in the platform that knows both backends exist. Everything above it — clients, integration tests, the observability stack in Part 7 — speaks the OpenAI wire format and does not know whether the response came from vLLM or Claude.
The routing decision is made per request on three criteria, evaluated in order:
-
Explicit model name. If the request specifies
"model": "qwen2.5-7b"or"model": "local", it goes to vLLM. If it specifies a Claude model name or"model": "cloud", it goes to the Anthropic API. Explicit trumps everything else. -
Heuristics for
"model": "auto". Estimate prompt token count (character count ÷ 4) and checkmax_tokens. Short prompts with bounded output — classification, extraction, summarization, code generation on a known schema — go local. Long prompts or large requested outputs — complex reasoning, multi-step planning, long-form generation — go to cloud. -
Health-based fallback. Before forwarding to vLLM, the router checks
/healthwith a short timeout. If vLLM is starting up or scaled to zero, the check fails and the request routes to cloud transparently. The caller gets a response, not a 503.
The router adds an X-Router-Backend header to every response — local or cloud — so the observability layer and clients can see which path was taken without parsing the response body.
The Router Application
router/
├── main.py
├── requirements.txt
└── Dockerfile
main.py
# router/main.py
from __future__ import annotations
import asyncio
import json
import os
from typing import AsyncIterator
import anthropic as ant
import httpx
from fastapi import FastAPI, Request, Response
from fastapi.responses import StreamingResponse
app = FastAPI(title="Hybrid LLM Router")
VLLM_BASE_URL = os.getenv("VLLM_BASE_URL", "http://vllm.inference.svc.cluster.local")
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
VLLM_HEALTH_TIMEOUT = float(os.getenv("VLLM_HEALTH_TIMEOUT", "2.0"))
LOCAL_TOKEN_THRESHOLD = int(os.getenv("LOCAL_TOKEN_THRESHOLD", "1500"))
LOCAL_MAX_TOKENS_THRESHOLD = int(os.getenv("LOCAL_MAX_TOKENS_THRESHOLD", "1024"))
# Models that always route to vLLM
_LOCAL_MODELS = {"qwen2.5-7b", "local"}
# Models that always route to Anthropic; values are the canonical Claude model ID
_CLOUD_MODELS: dict[str, str] = {
"cloud": "claude-opus-4-8",
"claude": "claude-opus-4-8",
"claude-opus-4-8": "claude-opus-4-8",
"claude-sonnet-4-6": "claude-sonnet-4-6",
"claude-haiku-4-5-20251001": "claude-haiku-4-5-20251001",
}
_anthropic_client: ant.AsyncAnthropic | None = None
def get_anthropic_client() -> ant.AsyncAnthropic:
global _anthropic_client
if _anthropic_client is None:
_anthropic_client = ant.AsyncAnthropic(api_key=ANTHROPIC_API_KEY)
return _anthropic_client
async def vllm_is_ready() -> bool:
try:
async with httpx.AsyncClient(timeout=VLLM_HEALTH_TIMEOUT) as client:
r = await client.get(f"{VLLM_BASE_URL}/health")
return r.status_code == 200
except Exception:
return False
def _estimate_tokens(messages: list[dict]) -> int:
return sum(len(m.get("content", "")) for m in messages) // 4
def _route_to_local(model: str, messages: list[dict], max_tokens: int) -> bool:
if model in _LOCAL_MODELS:
return True
if model in _CLOUD_MODELS:
return False
# "auto" or any unrecognised name — apply heuristics.
prompt_tokens = _estimate_tokens(messages)
return prompt_tokens < LOCAL_TOKEN_THRESHOLD and max_tokens <= LOCAL_MAX_TOKENS_THRESHOLD
@app.get("/health")
async def health() -> dict:
return {"status": "ok"}
@app.post("/v1/chat/completions")
async def chat_completions(request: Request) -> Response:
body = await request.json()
model: str = body.get("model", "auto")
messages: list[dict] = body.get("messages", [])
max_tokens: int = body.get("max_tokens", 512)
stream: bool = body.get("stream", False)
use_local = _route_to_local(model, messages, max_tokens)
if use_local and not await vllm_is_ready():
# vLLM is cold-starting or scaled to zero — route to cloud silently.
use_local = False
body = {**body, "model": "claude-sonnet-4-6"}
if use_local:
resp = await _forward_vllm(body, stream)
resp.headers["X-Router-Backend"] = "local"
else:
resp = await _forward_anthropic(body, stream)
resp.headers["X-Router-Backend"] = "cloud"
return resp
async def _forward_vllm(body: dict, stream: bool) -> Response:
# Normalise the model name to what vLLM expects.
if body.get("model") not in _LOCAL_MODELS:
body = {**body, "model": "qwen2.5-7b"}
if stream:
async def _gen() -> AsyncIterator[bytes]:
async with httpx.AsyncClient(timeout=120.0) as client:
async with client.stream(
"POST",
f"{VLLM_BASE_URL}/v1/chat/completions",
json=body,
) as r:
async for chunk in r.aiter_bytes():
yield chunk
return StreamingResponse(_gen(), media_type="text/event-stream")
async with httpx.AsyncClient(timeout=120.0) as client:
r = await client.post(f"{VLLM_BASE_URL}/v1/chat/completions", json=body)
return Response(content=r.content, status_code=r.status_code,
media_type="application/json")
async def _forward_anthropic(body: dict, stream: bool) -> Response:
client = get_anthropic_client()
requested_model = body.get("model", "claude-sonnet-4-6")
claude_model = _CLOUD_MODELS.get(requested_model, "claude-sonnet-4-6")
messages: list[dict] = body.get("messages", [])
max_tokens: int = body.get("max_tokens", 1024)
system = next((m["content"] for m in messages if m["role"] == "system"), None)
turns = [m for m in messages if m["role"] != "system"]
kwargs: dict = dict(model=claude_model, messages=turns, max_tokens=max_tokens)
if system:
kwargs["system"] = system
if stream:
async def _gen() -> AsyncIterator[bytes]:
async with client.messages.stream(**kwargs) as s:
async for text in s.text_stream:
chunk = {
"choices": [{"delta": {"content": text}, "finish_reason": None}]
}
yield f"data: {json.dumps(chunk)}\n\n".encode()
yield b"data: [DONE]\n\n"
return StreamingResponse(_gen(), media_type="text/event-stream")
resp = await client.messages.create(**kwargs)
openai_body = {
"id": resp.id,
"object": "chat.completion",
"model": resp.model,
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": resp.content[0].text},
"finish_reason": resp.stop_reason,
}],
"usage": {
"prompt_tokens": resp.usage.input_tokens,
"completion_tokens": resp.usage.output_tokens,
"total_tokens": resp.usage.input_tokens + resp.usage.output_tokens,
},
}
return Response(content=json.dumps(openai_body), media_type="application/json")
requirements.txt
fastapi==0.115.0
uvicorn[standard]==0.32.0
httpx==0.28.0
anthropic==0.40.0
Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY main.py .
EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080", "--workers", "4"]
A Sixth CDK Stack: Router
The router stack creates an ECR repository, builds and pushes the container image from the router/ directory, creates the router namespace, and deploys the workload manifests. The ALB Ingress at the end exposes the router to the public internet.
// lib/router-stack.ts
import * as path from "path";
import * as cdk from "aws-cdk-lib";
import { Construct } from "constructs";
import * as eks from "aws-cdk-lib/aws-eks";
import * as iam from "aws-cdk-lib/aws-iam";
import * as assets from "aws-cdk-lib/aws-ecr-assets";
import { config } from "./config";
interface RouterStackProps extends cdk.StackProps {
cluster: eks.Cluster;
oidcProvider: iam.OpenIdConnectProvider;
}
export class RouterStack extends cdk.Stack {
constructor(scope: Construct, id: string, props: RouterStackProps) {
super(scope, id, props);
const image = this.buildRouterImage();
this.deployRouter(props.cluster, image);
}
private buildRouterImage(): assets.DockerImageAsset {
// CDK builds the image from router/ and pushes it to the CDK-managed ECR
// repository in the bootstrap account. The imageUri is passed to the
// Deployment manifest so CDK re-deploys when the image digest changes.
return new assets.DockerImageAsset(this, "RouterImage", {
directory: path.join(__dirname, "../router"),
});
}
private deployRouter(
cluster: eks.Cluster,
image: assets.DockerImageAsset,
): void {
const ns = cluster.addManifest("RouterNamespace", {
apiVersion: "v1",
kind: "Namespace",
metadata: { name: "router" },
});
// The Anthropic API key lives in a Kubernetes Secret created manually
// with kubectl (see "Deploy" section below) — it never enters CDK or
// CloudFormation state.
const deployment = cluster.addManifest("RouterDeployment", {
apiVersion: "apps/v1",
kind: "Deployment",
metadata: { name: "router", namespace: "router", labels: { app: "router" } },
spec: {
replicas: 2,
selector: { matchLabels: { app: "router" } },
template: {
metadata: { labels: { app: "router" } },
spec: {
// The router is a CPU workload — no GPU affinity or toleration needed.
// It lands on the system node pool from Part 3.
affinity: {
nodeAffinity: {
requiredDuringSchedulingIgnoredDuringExecution: {
nodeSelectorTerms: [{
matchExpressions: [{
key: "node.kubernetes.io/purpose",
operator: "In",
values: ["system"],
}],
}],
},
},
},
containers: [{
name: "router",
image: image.imageUri,
ports: [{ containerPort: 8080, name: "http" }],
env: [
{
name: "VLLM_BASE_URL",
value: "http://vllm.inference.svc.cluster.local",
},
{
name: "ANTHROPIC_API_KEY",
valueFrom: {
secretKeyRef: {
name: "router-api-keys",
key: "ANTHROPIC_API_KEY",
},
},
},
{ name: "VLLM_HEALTH_TIMEOUT", value: "2.0" },
{ name: "LOCAL_TOKEN_THRESHOLD", value: "1500" },
{ name: "LOCAL_MAX_TOKENS_THRESHOLD", value: "1024" },
],
resources: {
requests: { cpu: "250m", memory: "256Mi" },
limits: { cpu: "1", memory: "512Mi" },
},
readinessProbe: {
httpGet: { path: "/health", port: 8080 },
initialDelaySeconds: 5,
periodSeconds: 10,
},
livenessProbe: {
httpGet: { path: "/health", port: 8080 },
initialDelaySeconds: 10,
periodSeconds: 30,
},
}],
},
},
},
});
deployment.node.addDependency(ns);
cluster.addManifest("RouterService", {
apiVersion: "v1",
kind: "Service",
metadata: { name: "router", namespace: "router" },
spec: {
selector: { app: "router" },
ports: [{ name: "http", port: 80, targetPort: 8080 }],
type: "ClusterIP",
},
}).node.addDependency(ns);
cluster.addManifest("RouterPdb", {
apiVersion: "policy/v1",
kind: "PodDisruptionBudget",
metadata: { name: "router", namespace: "router" },
spec: {
minAvailable: 1,
selector: { matchLabels: { app: "router" } },
},
}).node.addDependency(ns);
// The ALB Ingress provisions a real internet-facing Application Load Balancer
// using the controller installed in Part 4.
cluster.addManifest("RouterIngress", {
apiVersion: "networking.k8s.io/v1",
kind: "Ingress",
metadata: {
name: "router",
namespace: "router",
annotations: {
"kubernetes.io/ingress.class": "alb",
"alb.ingress.kubernetes.io/scheme": "internet-facing",
"alb.ingress.kubernetes.io/target-type": "ip",
"alb.ingress.kubernetes.io/healthcheck-path": "/health",
"alb.ingress.kubernetes.io/listen-ports": '[{"HTTPS":443},{"HTTP":80}]',
"alb.ingress.kubernetes.io/ssl-redirect": "443",
},
},
spec: {
rules: [{
http: {
paths: [{
path: "/",
pathType: "Prefix",
backend: {
service: { name: "router", port: { number: 80 } },
},
}],
},
}],
},
}).node.addDependency(ns);
}
}
Update bin/app.ts to add the sixth stack:
// bin/app.ts
import * as cdk from "aws-cdk-lib";
import { NetworkStack } from "../lib/network-stack";
import { ClusterStack } from "../lib/cluster-stack";
import { NodeGroupStack } from "../lib/node-group-stack";
import { AddonsStack } from "../lib/addons-stack";
import { InferenceStack } from "../lib/inference-stack";
import { RouterStack } from "../lib/router-stack";
import { config } from "../lib/config";
const app = new cdk.App();
const env = { region: config.region };
const network = new NetworkStack(app, "HybridLlmNetwork", { env });
const cluster = new ClusterStack(app, "HybridLlmCluster", {
env,
vpc: network.vpc,
});
new NodeGroupStack(app, "HybridLlmNodeGroups", {
env,
cluster: cluster.cluster,
nodeRole: cluster.nodeRole,
});
new AddonsStack(app, "HybridLlmAddons", {
env,
cluster: cluster.cluster,
oidcProvider: cluster.oidcProvider,
});
new InferenceStack(app, "HybridLlmInference", {
env,
cluster: cluster.cluster,
oidcProvider: cluster.oidcProvider,
});
new RouterStack(app, "HybridLlmRouter", {
env,
cluster: cluster.cluster,
oidcProvider: cluster.oidcProvider,
});
Walking Through the Decisions
The OpenAI wire format as the lingua franca
Every component in the platform — the router, the vLLM server, the observability stack in Part 7, any application that calls the platform — speaks the OpenAI /v1/chat/completions format. This is not an accident. The OpenAI format has become the de facto standard for LLM APIs: virtually every SDK, integration, and framework has a client that speaks it. Standardising on it here means swapping backends (upgrading from Qwen2.5-7B to a newer model, adding a second cloud provider) never touches the callers. The router is the translation layer; everything above and below it sees the same interface.
The Anthropic SDK returns a different response shape (content[].text rather than choices[].message.content). The _forward_anthropic function translates it into OpenAI format before returning, so callers need no awareness that the response came from Claude. The X-Router-Backend header is how they can tell, if they want to — but they never have to.
Routing heuristics: why token count
The heuristic of routing by estimated prompt token count and requested output tokens captures a real distinction in how these models are used on this platform.
Short prompt, bounded output — extracting a JSON field from a document, classifying a customer intent into one of ten categories, summarizing a paragraph in two sentences — is exactly what a 7B instruction-tuned model excels at. The task is pattern-matching and reformatting, not reasoning. Local inference is fast, cheap, and good enough.
Long prompt or large requested output — synthesizing information from multiple documents, multi-step code generation for an unfamiliar library, explaining a subtle bug across several files — benefits from the stronger reasoning and larger context window of a frontier model. Routing these to Claude means higher per-token cost but substantially better output quality for tasks that need it.
The thresholds (1500 estimated prompt tokens, 1024 max output tokens) are starting points, not absolutes. The environment variables LOCAL_TOKEN_THRESHOLD and LOCAL_MAX_TOKENS_THRESHOLD in the ConfigMap let you tune these without redeploying. In Part 7, tracing each request with its routing decision lets you observe whether the heuristics are correctly splitting work across the two backends.
The health check and cold-start fallback
The most important design decision in the router is the 2-second vLLM health check. When vLLM is scaling from zero — Karpenter is provisioning the GPU node, the init container is downloading weights, or the model is loading into GPU memory — the platform is still available to callers because the router falls back to cloud.
The fallback is silent: the caller gets a response, not a 503. The only visible difference is the X-Router-Backend: cloud header where they might have expected local. For most callers in a hybrid platform, this is acceptable. The expectation is best-effort local inference, cloud as the reliable backstop.
The 2-second timeout is deliberately short. A vLLM instance that is healthy responds to /health in under 100 ms. A vLLM instance that is starting up or has crashed will either refuse the connection immediately (returning in milliseconds) or hang. Two seconds is long enough to distinguish "healthy" from "not healthy" without making every routed request wait 2 seconds when vLLM is down. If your network has higher latency between router and vLLM (cross-AZ, for example), tune VLLM_HEALTH_TIMEOUT upward.
CDK DockerImageAsset for the router image
DockerImageAsset builds the Docker image from the specified directory during cdk deploy and pushes it to the CDK-managed ECR repository created during cdk bootstrap. The resulting image.imageUri includes the image digest (<ecr-repo>@sha256:<digest>), not a mutable tag. This means CDK detects image changes on every cdk deploy and rolls out a new Deployment if the digest changed — you never accidentally run stale code because of a floating latest tag.
The tradeoff is that cdk deploy HybridLlmRouter builds the Docker image on the machine where CDK runs. In CI/CD, the build agent needs Docker and credentials to push to ECR. For a team deploying from multiple machines, a separate image build pipeline that writes a digest to a config value CDK reads is more scalable. For this tutorial, the bundled build is the simplest correct setup.
CPU pool placement
The router runs on the system node pool (the on-demand m7i.xlarge instances from Part 3), not on the GPU pool. The routing logic is pure network I/O — it reads a request, makes a fast decision, and proxies the bytes. It has no GPU requirement and would waste expensive GPU instances. The nodeAffinity rule on node.kubernetes.io/purpose: system keeps the router pods on system nodes alongside other CPU workloads like CoreDNS and the load balancer controller.
Two replicas and a PodDisruptionBudget
The router is in the critical request path: if all router pods are down, the entire platform is unavailable. Running two replicas across availability zones is the minimum for HA. The PodDisruptionBudget with minAvailable: 1 means a node upgrade or Karpenter consolidation that drains one system node will not take both router pods down simultaneously — the second must be ready before the first terminates.
The router is stateless (no in-flight request state beyond the HTTP connection), so a rolling update with minAvailable: 1 is safe. Old and new versions can serve requests simultaneously during a deploy.
API key management: kubectl secret over CDK
The Anthropic API key is created as a Kubernetes Secret with kubectl, not via CDK or CloudFormation. This is a deliberate choice. CDK stacks are stored as CloudFormation templates in S3; any value passed to addManifest or a Helm values map becomes part of that template in plaintext (before CloudFormation encryption). An API key in a CDK manifest value ends up in CloudFormation template storage where it is harder to rotate, audit, and control access to.
Creating the secret manually and referencing it by name keeps the key out of IaC state. The Deployment spec references router-api-keys by name; CDK does not know its contents.
For production, use AWS Secrets Manager: store the key there, grant an IRSA role secretsmanager:GetSecretValue on that secret ARN, and use the External Secrets Operator to sync it into a Kubernetes Secret automatically. That adds one more component but gives you rotation, audit logging, and no manual kubectl steps during deploys.
The ALB Ingress and HTTPS
The RouterIngress annotation alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443},{"HTTP":80}]' combined with ssl-redirect: "443" configures the ALB to:
- Accept HTTPS on 443 (TLS terminated at the ALB)
- Accept HTTP on 80 and redirect it to 443
- Forward decrypted traffic to router pods on port 80
For HTTPS to work, the ALB needs a TLS certificate. The alb.ingress.kubernetes.io/certificate-arn annotation (not shown above) points to an ACM certificate ARN. Without it, the ALB provisions without HTTPS and only serves HTTP — acceptable for internal development, not for production. Add the annotation once you have an ACM certificate for your domain:
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:<ACCOUNT>:certificate/<UUID>
Deploy the Router Stack
Step 1: Create the API key Secret
Create the Kubernetes Secret manually before running cdk deploy. This keeps the key out of CDK/CloudFormation state:
# Replace with your actual Anthropic API key.
kubectl -n router create namespace router 2>/dev/null || true
kubectl -n router create secret generic router-api-keys \
--from-literal=ANTHROPIC_API_KEY=sk-ant-...
The namespace must exist before creating the Secret. The CDK deploy will create the namespace via addManifest, but you need it now for the kubectl create secret command. Creating it manually first is fine — addManifest will no-op when the namespace already exists.
Step 2: Deploy the stack
cdk deploy HybridLlmRouter
CDK builds the router Docker image, pushes it to ECR, and applies the Kubernetes manifests. The deploy takes 2–3 minutes. You can watch the router pods start:
kubectl -n router get pods -w
Once both pods are Running, check the ALB address:
kubectl -n router get ingress router
# ADDRESS column shows the ALB DNS name once provisioned (~60s after deploy).
Verify End-to-End
ALB=$(kubectl -n router get ingress router -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
# Route to local vLLM (explicit model name).
curl -s http://${ALB}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"max_tokens": 50
}' | jq '{backend: .headers."x-router-backend", answer: .choices[0].message.content}'
# Route to Claude (explicit model name).
curl -s http://${ALB}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "claude",
"messages": [{"role": "user", "content": "Explain the CAP theorem in depth."}],
"max_tokens": 500
}' | jq '.choices[0].message.content'
# Auto-routing: short prompt → local; long prompt → cloud.
curl -s http://${ALB}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [{"role": "user", "content": "Classify the sentiment: Great product!"}],
"max_tokens": 10
}' -v 2>&1 | grep "x-router-backend"
# Expected: x-router-backend: local
Verify cold-start fallback
Scale vLLM to zero and confirm the router falls back to Claude:
kubectl -n inference scale deployment/vllm --replicas=0
curl -s http://${ALB}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"local","messages":[{"role":"user","content":"Hello."}],"max_tokens":20}' \
-v 2>&1 | grep "x-router-backend"
# Expected: x-router-backend: cloud (vLLM was not ready)
# Restore the replica.
kubectl -n inference scale deployment/vllm --replicas=1
The fallback is silent from the caller's perspective — the response arrives with content, not a 503. The x-router-backend: cloud header is the only signal that the local model was bypassed.
Tearing Down
# Delete the Ingress before destroying the stack — the ALB controller created
# a real ALB that CDK does not own and will not delete automatically.
kubectl -n router delete ingress router
cdk destroy HybridLlmRouter
cdk destroy HybridLlmInference
cdk destroy HybridLlmAddons
cdk destroy HybridLlmNodeGroups
cdk destroy HybridLlmCluster
cdk destroy HybridLlmNetwork
What's Next
The platform now has all three layers working together: GPU inference via vLLM, KEDA-driven autoscaling that drives GPU nodes to zero overnight, and a hybrid router that splits traffic between the local model and Claude based on request characteristics and backend health.
What the platform lacks is visibility into how that split is actually working in production. You can see the X-Router-Backend header on individual responses, but you cannot answer: What fraction of requests went local vs. cloud this week? What is the p95 latency for local inference? What does GPU utilization look like across the Spot fleet? How much did the Anthropic API cost this month relative to the vLLM GPU spend?
In Part 7 we wire up the observability stack: OpenTelemetry traces through the router, Prometheus and Grafana for GPU and vLLM metrics, and Langfuse for per-request token and cost telemetry — connecting back to the patterns in LLM Observability on Kubernetes. With that stack in place, the routing heuristics become tunable by data rather than guesswork.