In Part 5 we deployed vLLM and wired KEDA to scale replicas with queue depth. The model server is running and answering requests — but every caller reaches it at a ClusterIP address with no routing in front of it. There is nothing deciding whether a given request belongs on the local GPU or on a cloud API, and there is nothing handling the gap when vLLM is warming up from zero.

Part 6 builds the hybrid router: a lightweight TypeScript service using Hono that sits in front of both backends, accepts the same OpenAI-compatible request format the rest of the platform uses, and routes each request to the right backend at runtime. The router uses model-name overrides for explicit control and prompt complexity heuristics for automatic decisions, checks vLLM health before every local route and transparently falls back to Claude when the local model is still loading, and exposes the whole platform behind an internet-facing ALB via the load balancer controller from Part 4.

When this part is done: a single HTTPS endpoint accepts inference requests, routes them to the GPU model server or to the Anthropic API depending on the request, and the caller sees a consistent OpenAI-formatted response regardless of which backend answered it.

What the Router Does

The router is the only component in the platform that knows both backends exist. Everything above it — clients, integration tests, the observability stack in Part 7 — speaks the OpenAI wire format and does not know whether the response came from vLLM or Claude.

The routing decision is made per request on three criteria, evaluated in order:

Explicit model name. If the request specifies "model": "qwen2.5-7b" or "model": "local", it goes to vLLM. If it specifies a Claude model name or "model": "cloud", it goes to the Anthropic API. Explicit trumps everything else.
Heuristics for "model": "auto". Estimate prompt token count (character count ÷ 4) and check max_tokens. Short prompts with bounded output — classification, extraction, summarization, code generation on a known schema — go local. Long prompts or large requested outputs — complex reasoning, multi-step planning, long-form generation — go to cloud.
Health-based fallback. Before forwarding to vLLM, the router checks /health with a short timeout. If vLLM is starting up or scaled to zero, the check fails and the request routes to cloud transparently. The caller gets a response, not a 503.

The router adds an X-Router-Backend header to every response — local or cloud — so the observability layer and clients can see which path was taken without parsing the response body.

The Router Application

router/
├── src/
│   └── index.ts
├── package.json
├── tsconfig.json
└── Dockerfile

src/index.ts

import { Hono } from "hono";
import { serve } from "@hono/node-server";
import Anthropic from "@anthropic-ai/sdk";

const VLLM_BASE_URL =
  process.env.VLLM_BASE_URL ?? "http://vllm.inference.svc.cluster.local";
const ANTHROPIC_API_KEY = process.env.ANTHROPIC_API_KEY ?? "";
const VLLM_HEALTH_TIMEOUT_MS = Number(process.env.VLLM_HEALTH_TIMEOUT ?? "2000");
const LOCAL_TOKEN_THRESHOLD = Number(process.env.LOCAL_TOKEN_THRESHOLD ?? "1500");
const LOCAL_MAX_TOKENS_THRESHOLD = Number(
  process.env.LOCAL_MAX_TOKENS_THRESHOLD ?? "1024",
);

const LOCAL_MODELS = new Set(["qwen2.5-7b", "local"]);
const CLOUD_MODELS: Record<string, string> = {
  cloud:                       "claude-opus-4-8",
  claude:                      "claude-opus-4-8",
  "claude-opus-4-8":           "claude-opus-4-8",
  "claude-sonnet-4-6":         "claude-sonnet-4-6",
  "claude-haiku-4-5-20251001": "claude-haiku-4-5-20251001",
};

const anthropic = new Anthropic({ apiKey: ANTHROPIC_API_KEY });

interface Message {
  role: string;
  content: string;
}

interface ChatBody {
  model?: string;
  messages: Message[];
  max_tokens?: number;
  stream?: boolean;
}

async function vllmIsReady(): Promise<boolean> {
  const ctrl = new AbortController();
  const timer = setTimeout(() => ctrl.abort(), VLLM_HEALTH_TIMEOUT_MS);
  try {
    const res = await fetch(`${VLLM_BASE_URL}/health`, { signal: ctrl.signal });
    return res.ok;
  } catch {
    return false;
  } finally {
    clearTimeout(timer);
  }
}

function estimateTokens(messages: Message[]): number {
  return messages.reduce((sum, m) => sum + (m.content?.length ?? 0), 0) / 4;
}

function routeToLocal(
  model: string,
  messages: Message[],
  maxTokens: number,
): boolean {
  if (LOCAL_MODELS.has(model)) return true;
  if (model in CLOUD_MODELS) return false;
  // "auto" or any unrecognised name — apply heuristics.
  return (
    estimateTokens(messages) < LOCAL_TOKEN_THRESHOLD &&
    maxTokens <= LOCAL_MAX_TOKENS_THRESHOLD
  );
}

async function forwardToVllm(body: ChatBody, isStream: boolean): Promise<Response> {
  if (!LOCAL_MODELS.has(body.model ?? "")) {
    body = { ...body, model: "qwen2.5-7b" };
  }
  const upstream = await fetch(`${VLLM_BASE_URL}/v1/chat/completions`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify(body),
  });
  // Pipe the upstream body directly — works for both streaming and non-streaming.
  return new Response(upstream.body, {
    status: upstream.status,
    headers: {
      "Content-Type": isStream ? "text/event-stream" : "application/json",
      "X-Router-Backend": "local",
    },
  });
}

async function forwardToAnthropic(
  body: ChatBody,
  isStream: boolean,
): Promise<Response> {
  const claudeModel =
    CLOUD_MODELS[body.model ?? "claude"] ?? "claude-sonnet-4-6";
  const messages = body.messages ?? [];
  const maxTokens = body.max_tokens ?? 1024;

  const system = messages.find((m) => m.role === "system")?.content;
  const turns = messages
    .filter((m) => m.role !== "system")
    .map((m) => ({
      role: m.role as "user" | "assistant",
      content: m.content,
    }));

  const params: Anthropic.MessageCreateParamsNonStreaming = {
    model: claudeModel,
    messages: turns,
    max_tokens: maxTokens,
    ...(system ? { system } : {}),
  };

  if (isStream) {
    const enc = new TextEncoder();
    const readable = new ReadableStream({
      async start(ctrl) {
        const s = anthropic.messages.stream(params);
        for await (const event of s) {
          if (
            event.type === "content_block_delta" &&
            event.delta.type === "text_delta"
          ) {
            const chunk = JSON.stringify({
              choices: [
                { delta: { content: event.delta.text }, finish_reason: null },
              ],
            });
            ctrl.enqueue(enc.encode(`data: ${chunk}\n\n`));
          }
        }
        ctrl.enqueue(enc.encode("data: [DONE]\n\n"));
        ctrl.close();
      },
    });
    return new Response(readable, {
      headers: {
        "Content-Type": "text/event-stream",
        "X-Router-Backend": "cloud",
      },
    });
  }

  const resp = await anthropic.messages.create(params);
  const openAiResp = {
    id: resp.id,
    object: "chat.completion",
    model: resp.model,
    choices: [
      {
        index: 0,
        message: {
          role: "assistant",
          content:
            resp.content[0].type === "text" ? resp.content[0].text : "",
        },
        finish_reason: resp.stop_reason,
      },
    ],
    usage: {
      prompt_tokens: resp.usage.input_tokens,
      completion_tokens: resp.usage.output_tokens,
      total_tokens: resp.usage.input_tokens + resp.usage.output_tokens,
    },
  };
  return new Response(JSON.stringify(openAiResp), {
    headers: {
      "Content-Type": "application/json",
      "X-Router-Backend": "cloud",
    },
  });
}

const app = new Hono();

app.get("/health", (c) => c.json({ status: "ok" }));

app.post("/v1/chat/completions", async (c) => {
  const body = await c.req.json<ChatBody>();
  const model = body.model ?? "auto";
  const messages = body.messages ?? [];
  const maxTokens = body.max_tokens ?? 512;
  const isStream = body.stream ?? false;

  let useLocal = routeToLocal(model, messages, maxTokens);

  if (useLocal && !(await vllmIsReady())) {
    // vLLM is cold-starting or scaled to zero — route to cloud silently.
    useLocal = false;
    body.model = "claude-sonnet-4-6";
  }

  return useLocal
    ? forwardToVllm(body, isStream)
    : forwardToAnthropic(body, isStream);
});

serve(
  { fetch: app.fetch, port: Number(process.env.PORT ?? 8080) },
  (info) => console.log(`Router listening on :${info.port}`),
);

package.json

{
  "name": "hybrid-llm-router",
  "version": "1.0.0",
  "scripts": {
    "dev": "tsx watch src/index.ts",
    "build": "tsc",
    "start": "node dist/index.js"
  },
  "dependencies": {
    "@anthropic-ai/sdk": "^0.40.0",
    "@hono/node-server": "^1.13.0",
    "hono": "^4.6.0"
  },
  "devDependencies": {
    "@types/node": "^22.0.0",
    "tsx": "^4.19.0",
    "typescript": "^5.6.0"
  }
}

tsconfig.json

{
  "compilerOptions": {
    "target": "ES2022",
    "module": "NodeNext",
    "moduleResolution": "NodeNext",
    "outDir": "dist",
    "rootDir": "src",
    "strict": true,
    "esModuleInterop": true,
    "skipLibCheck": true
  },
  "include": ["src"]
}

Dockerfile

FROM node:22-alpine AS builder
WORKDIR /app
COPY package*.json tsconfig.json ./
RUN npm ci
COPY src/ ./src/
RUN npm run build

FROM node:22-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY package*.json ./
RUN npm ci --omit=dev
EXPOSE 8080
CMD ["node", "dist/index.js"]

A Sixth CDK Stack: Router

The router stack creates an ECR repository, builds and pushes the container image from the router/ directory, creates the router namespace, and deploys the workload manifests. The ALB Ingress at the end exposes the router to the public internet.

// lib/router-stack.ts
import * as path from "path";
import * as cdk from "aws-cdk-lib";
import { Construct } from "constructs";
import * as eks from "aws-cdk-lib/aws-eks";
import * as iam from "aws-cdk-lib/aws-iam";
import * as assets from "aws-cdk-lib/aws-ecr-assets";
import { config } from "./config";

interface RouterStackProps extends cdk.StackProps {
  cluster: eks.Cluster;
  oidcProvider: iam.OpenIdConnectProvider;
}

export class RouterStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props: RouterStackProps) {
    super(scope, id, props);

    const image = this.buildRouterImage();
    this.deployRouter(props.cluster, image);
  }

  private buildRouterImage(): assets.DockerImageAsset {
    // CDK builds the image from router/ and pushes it to the CDK-managed ECR
    // repository in the bootstrap account. The imageUri is passed to the
    // Deployment manifest so CDK re-deploys when the image digest changes.
    return new assets.DockerImageAsset(this, "RouterImage", {
      directory: path.join(__dirname, "../router"),
    });
  }

  private deployRouter(
    cluster: eks.Cluster,
    image: assets.DockerImageAsset,
  ): void {
    const ns = cluster.addManifest("RouterNamespace", {
      apiVersion: "v1",
      kind: "Namespace",
      metadata: { name: "router" },
    });

    // The Anthropic API key lives in a Kubernetes Secret created manually
    // with kubectl (see "Deploy" section below) — it never enters CDK or
    // CloudFormation state.
    const deployment = cluster.addManifest("RouterDeployment", {
      apiVersion: "apps/v1",
      kind: "Deployment",
      metadata: { name: "router", namespace: "router", labels: { app: "router" } },
      spec: {
        replicas: 2,
        selector: { matchLabels: { app: "router" } },
        template: {
          metadata: { labels: { app: "router" } },
          spec: {
            // The router is a CPU workload — no GPU affinity or toleration needed.
            // It lands on the system node pool from Part 3.
            affinity: {
              nodeAffinity: {
                requiredDuringSchedulingIgnoredDuringExecution: {
                  nodeSelectorTerms: [{
                    matchExpressions: [{
                      key: "node.kubernetes.io/purpose",
                      operator: "In",
                      values: ["system"],
                    }],
                  }],
                },
              },
            },
            containers: [{
              name: "router",
              image: image.imageUri,
              ports: [{ containerPort: 8080, name: "http" }],
              env: [
                {
                  name: "VLLM_BASE_URL",
                  value: "http://vllm.inference.svc.cluster.local",
                },
                {
                  name: "ANTHROPIC_API_KEY",
                  valueFrom: {
                    secretKeyRef: {
                      name: "router-api-keys",
                      key: "ANTHROPIC_API_KEY",
                    },
                  },
                },
                { name: "VLLM_HEALTH_TIMEOUT", value: "2000" },
                { name: "LOCAL_TOKEN_THRESHOLD", value: "1500" },
                { name: "LOCAL_MAX_TOKENS_THRESHOLD", value: "1024" },
              ],
              resources: {
                requests: { cpu: "250m", memory: "256Mi" },
                limits:   { cpu: "1",    memory: "512Mi" },
              },
              readinessProbe: {
                httpGet: { path: "/health", port: 8080 },
                initialDelaySeconds: 5,
                periodSeconds: 10,
              },
              livenessProbe: {
                httpGet: { path: "/health", port: 8080 },
                initialDelaySeconds: 10,
                periodSeconds: 30,
              },
            }],
          },
        },
      },
    });
    deployment.node.addDependency(ns);

    cluster.addManifest("RouterService", {
      apiVersion: "v1",
      kind: "Service",
      metadata: { name: "router", namespace: "router" },
      spec: {
        selector: { app: "router" },
        ports: [{ name: "http", port: 80, targetPort: 8080 }],
        type: "ClusterIP",
      },
    }).node.addDependency(ns);

    cluster.addManifest("RouterPdb", {
      apiVersion: "policy/v1",
      kind: "PodDisruptionBudget",
      metadata: { name: "router", namespace: "router" },
      spec: {
        minAvailable: 1,
        selector: { matchLabels: { app: "router" } },
      },
    }).node.addDependency(ns);

    // The ALB Ingress provisions a real internet-facing Application Load Balancer
    // using the controller installed in Part 4.
    cluster.addManifest("RouterIngress", {
      apiVersion: "networking.k8s.io/v1",
      kind: "Ingress",
      metadata: {
        name: "router",
        namespace: "router",
        annotations: {
          "kubernetes.io/ingress.class": "alb",
          "alb.ingress.kubernetes.io/scheme": "internet-facing",
          "alb.ingress.kubernetes.io/target-type": "ip",
          "alb.ingress.kubernetes.io/healthcheck-path": "/health",
          "alb.ingress.kubernetes.io/listen-ports": '[{"HTTPS":443},{"HTTP":80}]',
          "alb.ingress.kubernetes.io/ssl-redirect": "443",
        },
      },
      spec: {
        rules: [{
          http: {
            paths: [{
              path: "/",
              pathType: "Prefix",
              backend: {
                service: { name: "router", port: { number: 80 } },
              },
            }],
          },
        }],
      },
    }).node.addDependency(ns);
  }
}

Update bin/app.ts to add the sixth stack:

// bin/app.ts
import * as cdk from "aws-cdk-lib";
import { NetworkStack } from "../lib/network-stack";
import { ClusterStack } from "../lib/cluster-stack";
import { NodeGroupStack } from "../lib/node-group-stack";
import { AddonsStack } from "../lib/addons-stack";
import { InferenceStack } from "../lib/inference-stack";
import { RouterStack } from "../lib/router-stack";
import { config } from "../lib/config";

const app = new cdk.App();
const env = { region: config.region };

const network = new NetworkStack(app, "HybridLlmNetwork", { env });

const cluster = new ClusterStack(app, "HybridLlmCluster", {
  env,
  vpc: network.vpc,
});

new NodeGroupStack(app, "HybridLlmNodeGroups", {
  env,
  cluster: cluster.cluster,
  nodeRole: cluster.nodeRole,
});

new AddonsStack(app, "HybridLlmAddons", {
  env,
  cluster: cluster.cluster,
  oidcProvider: cluster.oidcProvider,
});

new InferenceStack(app, "HybridLlmInference", {
  env,
  cluster: cluster.cluster,
  oidcProvider: cluster.oidcProvider,
});

new RouterStack(app, "HybridLlmRouter", {
  env,
  cluster: cluster.cluster,
  oidcProvider: cluster.oidcProvider,
});

Walking Through the Decisions

Hono for a Node.js routing service

Hono is a TypeScript-first web framework that runs on Node.js, Bun, Deno, and Cloudflare Workers using the same code. For a routing service like this one — receive a request, make a fast decision, proxy the bytes — it is the right fit. The API is small and the handler signature ((c: Context) => Response | Promise<Response>) works directly with the web-standard Request/Response types that Node 18+ ships natively.

That last point matters for streaming. Both forwardToVllm and forwardToAnthropic return a Response object whose body is a ReadableStream. Hono passes this through to the HTTP client without buffering. There is no async generator boilerplate, no custom streaming adapter — just a standard Response with a streaming body, which the Node.js HTTP layer already knows how to flush chunk by chunk.

The multi-stage Dockerfile compiles TypeScript to dist/ during build and runs the output with plain node. No runtime TypeScript execution (ts-node, tsx) in production — the container runs compiled JavaScript.

The OpenAI wire format as the lingua franca

Every component in the platform — the router, the vLLM server, the observability stack in Part 7, any application that calls the platform — speaks the OpenAI /v1/chat/completions format. This is not an accident. The OpenAI format has become the de facto standard for LLM APIs: virtually every SDK, integration, and framework has a client that speaks it. Standardising on it here means swapping backends (upgrading from Qwen2.5-7B to a newer model, adding a second cloud provider) never touches the callers. The router is the translation layer; everything above and below it sees the same interface.

The Anthropic SDK returns a different response shape (content[].text rather than choices[].message.content). The forwardToAnthropic function translates it into OpenAI format before returning, so callers need no awareness that the response came from Claude. The X-Router-Backend header is how they can tell, if they want to — but they never have to.

Routing heuristics: why token count

The heuristic of routing by estimated prompt token count and requested output tokens captures a real distinction in how these models are used on this platform.

Short prompt, bounded output — extracting a JSON field from a document, classifying a customer intent into one of ten categories, summarizing a paragraph in two sentences — is exactly what a 7B instruction-tuned model excels at. The task is pattern-matching and reformatting, not reasoning. Local inference is fast, cheap, and good enough.

Long prompt or large requested output — synthesizing information from multiple documents, multi-step code generation for an unfamiliar library, explaining a subtle bug across several files — benefits from the stronger reasoning and larger context window of a frontier model. Routing these to Claude means higher per-token cost but substantially better output quality for tasks that need it.

The thresholds (1500 estimated prompt tokens, 1024 max output tokens) are starting points, not absolutes. The environment variables LOCAL_TOKEN_THRESHOLD and LOCAL_MAX_TOKENS_THRESHOLD in the Deployment let you tune them without redeploying. In Part 7, tracing each request with its routing decision lets you observe whether the heuristics are correctly splitting work across the two backends.

The health check and cold-start fallback

The most important design decision in the router is the 2-second vLLM health check. When vLLM is scaling from zero — Karpenter is provisioning the GPU node, the init container is downloading weights, or the model is loading into GPU memory — the platform is still available to callers because the router falls back to cloud.

The fallback is silent: the caller gets a response, not a 503. The only visible difference is the X-Router-Backend: cloud header where they might have expected local. For most callers in a hybrid platform, this is acceptable. The expectation is best-effort local inference, cloud as the reliable backstop.

The 2-second timeout is deliberately short. A vLLM instance that is healthy responds to /health in under 100 ms. A vLLM instance that is starting up or has crashed will either refuse the connection immediately (returning in milliseconds) or hang. Two seconds is long enough to distinguish "healthy" from "not healthy" without making every routed request wait 2 seconds when vLLM is down. If your network has higher latency between router and vLLM (cross-AZ, for example), tune VLLM_HEALTH_TIMEOUT upward.

CDK DockerImageAsset for the router image

DockerImageAsset builds the Docker image from the specified directory during cdk deploy and pushes it to the CDK-managed ECR repository created during cdk bootstrap. The resulting image.imageUri includes the image digest (<ecr-repo>@sha256:<digest>), not a mutable tag. This means CDK detects image changes on every cdk deploy and rolls out a new Deployment if the digest changed — you never accidentally run stale code because of a floating latest tag.

The tradeoff is that cdk deploy HybridLlmRouter builds the Docker image on the machine where CDK runs. In CI/CD, the build agent needs Docker and credentials to push to ECR. For a team deploying from multiple machines, a separate image build pipeline that writes a digest to a config value CDK reads is more scalable. For this tutorial, the bundled build is the simplest correct setup.

CPU pool placement

The router runs on the system node pool (the on-demand m7i.xlarge instances from Part 3), not on the GPU pool. The routing logic is pure network I/O — it reads a request, makes a fast decision, and proxies the bytes. It has no GPU requirement and would waste expensive GPU instances. The nodeAffinity rule on node.kubernetes.io/purpose: system keeps the router pods on system nodes alongside other CPU workloads like CoreDNS and the load balancer controller.

Two replicas and a PodDisruptionBudget

The router is in the critical request path: if all router pods are down, the entire platform is unavailable. Running two replicas across availability zones is the minimum for HA. The PodDisruptionBudget with minAvailable: 1 means a node upgrade or Karpenter consolidation that drains one system node will not take both router pods down simultaneously — the second must be ready before the first terminates.

The router is stateless (no in-flight request state beyond the HTTP connection), so a rolling update with minAvailable: 1 is safe. Old and new versions can serve requests simultaneously during a deploy.

API key management: kubectl secret over CDK

The Anthropic API key is created as a Kubernetes Secret with kubectl, not via CDK or CloudFormation. This is a deliberate choice. CDK stacks are stored as CloudFormation templates in S3; any value passed to addManifest or a Helm values map becomes part of that template in plaintext (before CloudFormation encryption). An API key in a CDK manifest value ends up in CloudFormation template storage where it is harder to rotate, audit, and control access to.

Creating the secret manually and referencing it by name keeps the key out of IaC state. The Deployment spec references router-api-keys by name; CDK does not know its contents.

For production, use AWS Secrets Manager: store the key there, grant an IRSA role secretsmanager:GetSecretValue on that secret ARN, and use the External Secrets Operator to sync it into a Kubernetes Secret automatically. That adds one more component but gives you rotation, audit logging, and no manual kubectl steps during deploys.

The ALB Ingress and HTTPS

The RouterIngress annotation alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443},{"HTTP":80}]' combined with ssl-redirect: "443" configures the ALB to:

Accept HTTPS on 443 (TLS terminated at the ALB)
Accept HTTP on 80 and redirect it to 443
Forward decrypted traffic to router pods on port 80

For HTTPS to work, the ALB needs a TLS certificate. The alb.ingress.kubernetes.io/certificate-arn annotation (not shown above) points to an ACM certificate ARN. Without it, the ALB provisions without HTTPS and only serves HTTP — acceptable for internal development, not for production. Add the annotation once you have an ACM certificate for your domain:

alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:<ACCOUNT>:certificate/<UUID>

Deploy the Router Stack

Step 1: Create the API key Secret

Create the Kubernetes Secret manually before running cdk deploy. This keeps the key out of CDK/CloudFormation state:

# Replace with your actual Anthropic API key.
kubectl create namespace router 2>/dev/null || true
kubectl -n router create secret generic router-api-keys \
  --from-literal=ANTHROPIC_API_KEY=sk-ant-...

The namespace must exist before creating the Secret. The CDK deploy will create the namespace via addManifest, but you need it now for the kubectl create secret command. Creating it manually first is fine — addManifest will no-op when the namespace already exists.

Step 2: Deploy the stack

cdk deploy HybridLlmRouter

CDK builds the router Docker image, pushes it to ECR, and applies the Kubernetes manifests. The deploy takes 2–3 minutes. You can watch the router pods start:

kubectl -n router get pods -w

Once both pods are Running, check the ALB address:

kubectl -n router get ingress router
# ADDRESS column shows the ALB DNS name once provisioned (~60s after deploy).

Verify End-to-End

ALB=$(kubectl -n router get ingress router -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')

# Route to local vLLM (explicit model name).
curl -s http://${ALB}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "max_tokens": 50
  }' | jq '.choices[0].message.content'

# Check which backend answered.
curl -sI http://${ALB}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"local","messages":[{"role":"user","content":"Hi"}],"max_tokens":10}' \
  | grep x-router-backend
# Expected: x-router-backend: local

# Route to Claude (explicit model name).
curl -s http://${ALB}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude",
    "messages": [{"role": "user", "content": "Explain the CAP theorem in depth."}],
    "max_tokens": 500
  }' | jq '.choices[0].message.content'

# Auto-routing: short prompt → local.
curl -s http://${ALB}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [{"role": "user", "content": "Classify the sentiment: Great product!"}],
    "max_tokens": 10
  }' -v 2>&1 | grep "x-router-backend"
# Expected: x-router-backend: local

Verify cold-start fallback

Scale vLLM to zero and confirm the router falls back to Claude:

kubectl -n inference scale deployment/vllm --replicas=0

curl -s http://${ALB}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"local","messages":[{"role":"user","content":"Hello."}],"max_tokens":20}' \
  -v 2>&1 | grep "x-router-backend"
# Expected: x-router-backend: cloud (vLLM was not ready)

# Restore the replica.
kubectl -n inference scale deployment/vllm --replicas=1

The fallback is silent from the caller's perspective — the response arrives with content, not a 503. The x-router-backend: cloud header is the only signal that the local model was bypassed.

Tearing Down

# Delete the Ingress before destroying the stack — the ALB controller created
# a real ALB that CDK does not own and will not delete automatically.
kubectl -n router delete ingress router

cdk destroy HybridLlmRouter
cdk destroy HybridLlmInference
cdk destroy HybridLlmAddons
cdk destroy HybridLlmNodeGroups
cdk destroy HybridLlmCluster
cdk destroy HybridLlmNetwork

What's Next

The platform now has all three layers working together: GPU inference via vLLM, KEDA-driven autoscaling that drives GPU nodes to zero overnight, and a hybrid router that splits traffic between the local model and Claude based on request characteristics and backend health.

What the platform lacks is visibility into how that split is actually working in production. You can see the X-Router-Backend header on individual responses, but you cannot answer: What fraction of requests went local vs. cloud this week? What is the p95 latency for local inference? What does GPU utilization look like across the Spot fleet? How much did the Anthropic API cost this month relative to the vLLM GPU spend?

In Part 7 we wire up the observability stack: OpenTelemetry traces through the router, Prometheus and Grafana for GPU and vLLM metrics, and Langfuse for per-request token and cost telemetry — connecting back to the patterns in LLM Observability on Kubernetes. With that stack in place, the routing heuristics become tunable by data rather than guesswork.