In Part 4 we installed the two add-ons every production EKS cluster needs: the AWS Load Balancer Controller for ALB provisioning, and Karpenter for node lifecycle management — including the GPU NodePool configured with consolidationPolicy: WhenEmpty that terminates GPU nodes the moment their last pod exits. The cluster now has everything except actual workloads.

Part 5 deploys the first real workload: vLLM model servers running on the GPU pool. vLLM is an OpenAI-compatible inference server that loads open-source model weights and serves them over the same API surface as the cloud providers — so anything that can talk to the OpenAI SDK can talk to a vLLM instance without changes. We load Qwen2.5-7B-Instruct weights from Amazon S3 via an init container, install KEDA to scale replicas based on live queue depth from Prometheus metrics, and pair a cron trigger that keeps one replica warm during business hours and drives the fleet to zero overnight.

When this part is done: a request to the vLLM service returns a real generated response, Karpenter has provisioned a GPU node to schedule the replica, and Karpenter will terminate that node when KEDA scales replicas to zero at end of day.

What We Are Deploying

vLLM is a high-throughput inference server from UC Berkeley. Its key technical property is PagedAttention — a memory management algorithm that treats the GPU's KV cache like virtual memory, allowing far more concurrent requests than naive implementations that pre-allocate the maximum sequence length per request. For 7B-class models on a single A10G, vLLM sustains three to five times the token throughput of a sequential Hugging Face transformers pipeline. It also exposes /v1/chat/completions — the OpenAI wire format — which is how the hybrid router in Part 6 addresses it.

KEDA (Kubernetes Event Driven Autoscaler) extends the Kubernetes HPA with a plugin model for external metric sources: SQS queue depth, Redis list length, Prometheus metric values, and dozens more. For inference workloads the Prometheus scaler is the right fit: vLLM exposes request queue depth as vllm:num_requests_waiting, and KEDA scales replicas proportionally to that depth.

Model weights live in Amazon S3. We stage them there once with huggingface-cli and aws s3 sync, and vLLM pods download them at startup via an init container. The download runs over the S3 VPC gateway endpoint from Part 1 — private, fast, no NAT bandwidth charges.

Model Weights: Staging to S3

Before deploying the CDK stack, stage the model weights to S3 once. Run this from any machine with the Hugging Face CLI and AWS credentials:

pip install huggingface_hub[cli]

# Download Qwen2.5-7B-Instruct locally (~14 GB at float16).
huggingface-cli download Qwen/Qwen2.5-7B-Instruct \
  --local-dir /tmp/Qwen2.5-7B-Instruct \
  --local-dir-use-symlinks False

# Sync to S3. Replace <ACCOUNT_ID> with your AWS account ID.
aws s3 sync /tmp/Qwen2.5-7B-Instruct \
  s3://hybrid-llm-model-weights-<ACCOUNT_ID>/models/Qwen2.5-7B-Instruct/

Once the weights are staged, every vLLM pod downloads them from the private S3 endpoint without Hugging Face rate limits or external gating.

A Fifth CDK Stack: Inference

The inference stack installs KEDA and a minimal Prometheus, creates the IRSA role that allows vLLM pods to read from S3, and deploys the vLLM workload manifests.

// lib/inference-stack.ts
import * as cdk from "aws-cdk-lib";
import { Construct } from "constructs";
import * as eks from "aws-cdk-lib/aws-eks";
import * as iam from "aws-cdk-lib/aws-iam";
import * as s3 from "aws-cdk-lib/aws-s3";
import { config } from "./config";

interface InferenceStackProps extends cdk.StackProps {
  cluster: eks.Cluster;
  oidcProvider: iam.OpenIdConnectProvider;
}

export class InferenceStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props: InferenceStackProps) {
    super(scope, id, props);

    const modelBucket = this.createModelBucket();
    const vllmRole = this.createVllmRole(props.cluster, props.oidcProvider, modelBucket);

    this.installKeda(props.cluster);
    this.installPrometheus(props.cluster);
    this.deployVllm(props.cluster, vllmRole, modelBucket);
  }

  private createModelBucket(): s3.Bucket {
    return new s3.Bucket(this, "ModelWeightsBucket", {
      bucketName: `${config.clusterName}-model-weights-${this.account}`,
      blockPublicAccess: s3.BlockPublicAccess.BLOCK_ALL,
      encryption: s3.BucketEncryption.S3_MANAGED,
      removalPolicy: cdk.RemovalPolicy.RETAIN,
    });
  }

  private createVllmRole(
    cluster: eks.Cluster,
    oidcProvider: iam.OpenIdConnectProvider,
    modelBucket: s3.Bucket
  ): iam.Role {
    const issuerHostPath = cluster.clusterOpenIdConnectIssuerUrl.replace("https://", "");

    const role = new iam.Role(this, "VllmRole", {
      assumedBy: new iam.WebIdentityPrincipal(
        oidcProvider.openIdConnectProviderArn,
        {
          StringEquals: {
            [`${issuerHostPath}:sub`]: "system:serviceaccount:inference:vllm",
            [`${issuerHostPath}:aud`]: "sts.amazonaws.com",
          },
        }
      ),
      description: "IRSA role for vLLM pods — S3 read for model weights",
    });

    role.addToPolicy(
      new iam.PolicyStatement({
        actions: ["s3:GetObject", "s3:ListBucket"],
        resources: [modelBucket.bucketArn, `${modelBucket.bucketArn}/*`],
      })
    );

    return role;
  }

  private installKeda(cluster: eks.Cluster): void {
    const ns = cluster.addManifest("KedaNamespace", {
      apiVersion: "v1",
      kind: "Namespace",
      metadata: { name: "keda" },
    });

    const chart = cluster.addHelmChart("Keda", {
      chart: "keda",
      repository: "https://kedacore.github.io/charts",
      namespace: "keda",
      release: "keda",
      version: "2.17.0",
      values: {
        resources: {
          operator: {
            requests: { cpu: "100m", memory: "128Mi" },
            limits: { cpu: "500m", memory: "256Mi" },
          },
          metricServer: {
            requests: { cpu: "100m", memory: "128Mi" },
            limits: { cpu: "500m", memory: "256Mi" },
          },
        },
      },
    });

    chart.node.addDependency(ns);
  }

  private installPrometheus(cluster: eks.Cluster): void {
    const ns = cluster.addManifest("MonitoringNamespace", {
      apiVersion: "v1",
      kind: "Namespace",
      metadata: { name: "monitoring" },
    });

    // Minimal Prometheus — scrapes vLLM pods by annotation.
    // Part 7 expands this into the full kube-prometheus-stack with
    // dashboards, alerting, and GPU metrics.
    const chart = cluster.addHelmChart("Prometheus", {
      chart: "prometheus",
      repository: "https://prometheus-community.github.io/helm-charts",
      namespace: "monitoring",
      release: "prometheus",
      version: "25.27.0",
      values: {
        server: {
          resources: {
            requests: { cpu: "200m", memory: "512Mi" },
            limits: { cpu: "1", memory: "1Gi" },
          },
          retention: "7d",
          persistentVolume: { enabled: false },
        },
        extraScrapeConfigs: `
- job_name: vllm
  kubernetes_sd_configs:
    - role: pod
      namespaces:
        names: [inference]
  relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: "true"
    - source_labels: [__meta_kubernetes_pod_ip,
                      __meta_kubernetes_pod_annotation_prometheus_io_port]
      action: replace
      separator: ":"
      target_label: __address__
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
      action: replace
      target_label: __metrics_path__
      regex: (.+)
`,
        alertmanager: { enabled: false },
        "prometheus-node-exporter": { enabled: false },
        "kube-state-metrics": { enabled: false },
        pushgateway: { enabled: false },
      },
    });

    chart.node.addDependency(ns);
  }

  private deployVllm(
    cluster: eks.Cluster,
    role: iam.Role,
    modelBucket: s3.Bucket
  ): void {
    const ns = cluster.addManifest("InferenceNamespace", {
      apiVersion: "v1",
      kind: "Namespace",
      metadata: { name: "inference" },
    });

    const sa = cluster.addServiceAccount("VllmSA", {
      name: "vllm",
      namespace: "inference",
      annotations: { "eks.amazonaws.com/role-arn": role.roleArn },
    });
    sa.node.addDependency(ns);

    const downloadScript = [
      "if [ -f /model-cache/.download-complete ]; then",
      "  echo 'Model already present, skipping download.'; exit 0;",
      "fi",
      `aws s3 sync s3://${modelBucket.bucketName}/models/Qwen2.5-7B-Instruct/ /model-cache/Qwen2.5-7B-Instruct/ --no-progress`,
      "touch /model-cache/.download-complete",
    ].join("\n");

    const deployment = cluster.addManifest("VllmDeployment", {
      apiVersion: "apps/v1",
      kind: "Deployment",
      metadata: { name: "vllm", namespace: "inference", labels: { app: "vllm" } },
      spec: {
        replicas: 1,
        selector: { matchLabels: { app: "vllm" } },
        template: {
          metadata: {
            labels: { app: "vllm" },
            annotations: {
              "prometheus.io/scrape": "true",
              "prometheus.io/port": "8000",
              "prometheus.io/path": "/metrics",
            },
          },
          spec: {
            serviceAccountName: "vllm",
            terminationGracePeriodSeconds: 300,
            tolerations: [
              { key: "nvidia.com/gpu", operator: "Exists", effect: "NoSchedule" },
            ],
            affinity: {
              nodeAffinity: {
                requiredDuringSchedulingIgnoredDuringExecution: {
                  nodeSelectorTerms: [{
                    matchExpressions: [{
                      key: "node.kubernetes.io/purpose",
                      operator: "In",
                      values: ["gpu-inference"],
                    }],
                  }],
                },
              },
            },
            initContainers: [{
              name: "model-downloader",
              image: "amazon/aws-cli:2.22.0",
              command: ["sh", "-c"],
              args: [downloadScript],
              volumeMounts: [{ name: "model-cache", mountPath: "/model-cache" }],
              resources: {
                requests: { cpu: "1", memory: "4Gi" },
                limits: { cpu: "2", memory: "8Gi" },
              },
            }],
            containers: [{
              name: "vllm",
              image: "vllm/vllm-openai:v0.7.3",
              args: [
                "--model", "/model-cache/Qwen2.5-7B-Instruct",
                "--served-model-name", "qwen2.5-7b",
                "--max-model-len", "32768",
                "--tensor-parallel-size", "1",
                "--gpu-memory-utilization", "0.90",
                "--port", "8000",
              ],
              ports: [{ containerPort: 8000, name: "http" }],
              resources: {
                requests: { cpu: "4", memory: "16Gi", "nvidia.com/gpu": "1" },
                limits:   { cpu: "8", memory: "24Gi", "nvidia.com/gpu": "1" },
              },
              volumeMounts: [
                { name: "model-cache", mountPath: "/model-cache" },
                { name: "shm", mountPath: "/dev/shm" },
              ],
              readinessProbe: {
                httpGet: { path: "/health", port: 8000 },
                initialDelaySeconds: 60,
                periodSeconds: 10,
                failureThreshold: 30,
              },
              livenessProbe: {
                httpGet: { path: "/health", port: 8000 },
                initialDelaySeconds: 120,
                periodSeconds: 30,
                failureThreshold: 3,
              },
            }],
            volumes: [
              { name: "model-cache", emptyDir: { sizeLimit: "30Gi" } },
              { name: "shm", emptyDir: { medium: "Memory", sizeLimit: "8Gi" } },
            ],
          },
        },
      },
    });
    deployment.node.addDependency(sa);

    cluster.addManifest("VllmService", {
      apiVersion: "v1",
      kind: "Service",
      metadata: { name: "vllm", namespace: "inference" },
      spec: {
        selector: { app: "vllm" },
        ports: [{ name: "http", port: 80, targetPort: 8000 }],
        type: "ClusterIP",
      },
    }).node.addDependency(ns);

    cluster.addManifest("VllmPdb", {
      apiVersion: "policy/v1",
      kind: "PodDisruptionBudget",
      metadata: { name: "vllm", namespace: "inference" },
      spec: {
        maxUnavailable: 1,
        selector: { matchLabels: { app: "vllm" } },
      },
    }).node.addDependency(ns);

    cluster.addManifest("VllmScaledObject", {
      apiVersion: "keda.sh/v1alpha1",
      kind: "ScaledObject",
      metadata: { name: "vllm", namespace: "inference" },
      spec: {
        scaleTargetRef: { name: "vllm" },
        minReplicaCount: 0,
        maxReplicaCount: 4,
        pollingInterval: 15,
        cooldownPeriod: 300,
        triggers: [
          {
            // Scale up 1 replica per 5 waiting requests.
            type: "prometheus",
            metadata: {
              serverAddress: "http://prometheus-server.monitoring.svc.cluster.local",
              metricName: "vllm_inference_load",
              query: 'sum(vllm:num_requests_waiting{namespace="inference"})',
              threshold: "5",
              ignoreNullValues: "true",
            },
          },
          {
            // Keep 1 replica warm during business hours to avoid cold starts.
            // Off-hours both triggers request 0 → KEDA scales to 0.
            type: "cron",
            metadata: {
              timezone: "America/New_York",
              start: "0 8 * * 1-5",
              end:   "0 22 * * 1-5",
              desiredReplicas: "1",
            },
          },
        ],
      },
    }).node.addDependency(deployment);
  }
}

Update bin/app.ts to wire in the fifth stack:

// bin/app.ts
import * as cdk from "aws-cdk-lib";
import { NetworkStack } from "../lib/network-stack";
import { ClusterStack } from "../lib/cluster-stack";
import { NodeGroupStack } from "../lib/node-group-stack";
import { AddonsStack } from "../lib/addons-stack";
import { InferenceStack } from "../lib/inference-stack";
import { config } from "../lib/config";

const app = new cdk.App();
const env = { region: config.region };

const network = new NetworkStack(app, "HybridLlmNetwork", { env });

const cluster = new ClusterStack(app, "HybridLlmCluster", {
  env,
  vpc: network.vpc,
});

new NodeGroupStack(app, "HybridLlmNodeGroups", {
  env,
  cluster: cluster.cluster,
  nodeRole: cluster.nodeRole,
});

new AddonsStack(app, "HybridLlmAddons", {
  env,
  cluster: cluster.cluster,
  oidcProvider: cluster.oidcProvider,
});

new InferenceStack(app, "HybridLlmInference", {
  env,
  cluster: cluster.cluster,
  oidcProvider: cluster.oidcProvider,
});

Walking Through the Decisions

Why vLLM

The obvious question is why use vLLM instead of loading a model with the Hugging Face transformers library directly. The answer is throughput and memory efficiency.

transformers processes requests one at a time by default. Under any real concurrency, a second request waits for the first to complete before the GPU is touched. vLLM uses continuous batching: incoming requests join the active batch mid-generation, so the GPU is always working on as many requests as its memory allows simultaneously. On a single A10G running Qwen2.5-7B at float16, continuous batching multiplies throughput by three to five times compared to sequential inference.

The second advantage is PagedAttention. Standard attention implementations pre-allocate KV cache memory for the maximum sequence length of each request regardless of how long the actual output is — a request with max_tokens=4096 reserves 4096 tokens of KV cache even if it generates 20. PagedAttention allocates KV cache in pages and releases them as requests complete, meaning more concurrent requests can coexist in GPU memory without OOM errors.

vLLM also exposes /v1/chat/completions — the OpenAI wire format. The hybrid router in Part 6 sends requests using the same client code it uses for the cloud providers. No adapter layer needed.

Model weight storage: S3 over Hugging Face Hub

The init container could download weights with huggingface-cli download at startup. That works for one pod on a test cluster. It is problematic at scale:

Hugging Face Hub has rate limits and throttles burst downloads.
The model files (~14 GB for float16) cross the public internet without the private VPC endpoint, billing NAT data charges on every pod start.
Gated models (Llama 3.x requires accepting Meta's license) require a Hugging Face access token injected as a Secret, adding operational overhead.

Staging to S3 once and downloading over the private VPC gateway endpoint from Part 1 costs nothing in bandwidth, has no rate limits, and puts model distribution on infrastructure you control. The --no-progress flag reduces init container log noise.

The .download-complete sentinel file is a small optimization: on same-node pod restarts where the emptyDir volume persists briefly, the init container exits immediately rather than re-downloading 14 GB. More useful than it sounds when a liveness probe restart fires mid-session.

emptyDir for model cache, not a PVC

We use emptyDir with a 30 GB size limit rather than a PersistentVolumeClaim. Model weights are read-only artifacts — vLLM reads from disk once at startup, then works entirely in GPU memory. A PVC provides persistence across pod restarts on different nodes; emptyDir gives same-node persistence. For a workload that always re-downloads on a new node anyway, the PVC's persistence guarantee adds complexity without benefit.

The GPU instances from Part 3 have instance store NVMe storage (configured with instanceStorePolicy: RAID0 in the Karpenter EC2NodeClass). EKS automatically mounts instance store volumes on GPU nodes, and emptyDir backed by disk uses that storage. A g5.xlarge instance store is a 250 GB NVMe — fast, local, and abundant for 7B model weights.

For multi-node tensor-parallel deployments (70B+ models split across GPUs on different nodes), persistent shared storage via Amazon EFS or FSx for Lustre becomes necessary. That is outside the scope of this series.

/dev/shm

The shm volume mounts a tmpfs filesystem at /dev/shm with 8 GB. PyTorch uses POSIX shared memory for inter-process communication between workers and for certain tensor operations. The default /dev/shm in a Kubernetes container is 64 MB — sized for a desktop OS, not a model server. vLLM workers writing large tensor slices to shared memory will hit this limit silently, causing crashes that are confusing to diagnose because nothing in the main process logs explains them.

Eight GB is a safe upper bound for 7B-class models. For larger models or higher parallelism, scale it proportionally to GPU memory size.

GPU resources: request equals limit

The vLLM container sets nvidia.com/gpu: "1" in both requests and limits. For GPU resources this is not a stylistic choice — it is how the NVIDIA device plugin works. GPU allocation is exclusive: a container requesting one GPU gets that GPU entirely, and no other container can use it. Setting requests lower than limits for GPUs would create a scheduling inconsistency; the device plugin enforces the limits value at bind time regardless.

CPU and memory use requests for scheduling and limits as a ceiling. GPU uses both as a hard exclusive binding. Set them equal.

terminationGracePeriodSeconds: 300

When Kubernetes terminates a pod — whether from a KEDA scale-down, Karpenter consolidation, or a node drain — it sends SIGTERM to the container and waits for terminationGracePeriodSeconds before sending SIGKILL. vLLM v0.7+ handles SIGTERM gracefully: it stops accepting new requests and drains the running batch before exiting.

A 5-minute (300s) grace period is sufficient for virtually any in-flight request at Qwen2.5-7B scale. A max_tokens=4096 generation at typical throughput completes well within 60 seconds on an A10G; 300 seconds is generous headroom for worst-case token counts and high-parallelism batches.

Without a sufficient grace period, a scale-down event kills the container mid-token, the client receives a truncated response, and the request must be retried. Set the grace period to the maximum plausible inference duration, not the minimum.

Why KEDA instead of native HPA

Kubernetes' native HPA scales on CPU utilization, memory utilization, or custom metrics. For inference workloads, none of these is the right signal. CPU utilization on a GPU workload is misleading — the GPU does the inference work, and CPU can be near-zero while the GPU is fully saturated. Memory utilization tracks CPU memory, not GPU VRAM.

The HPA can consume custom metrics through the Metrics API, but wiring that requires deploying a custom metrics adapter — KEDA's metrics server is that adapter. So you pay the same complexity cost either way; KEDA just makes the wiring declarative with a ScaledObject instead of custom adapter configuration.

KEDA also adds two things the native HPA cannot do. First, it has native connectors for external metric sources so you declare a trigger type rather than implementing an adapter per source. Second, it correctly manages minReplicaCount: 0 — the HPA does not natively handle zero replicas; KEDA pauses the underlying HPA when replicas reach zero and revives it when a trigger fires.

The Prometheus trigger and null values

The query sum(vllm:num_requests_waiting{namespace="inference"}) has a subtle failure mode: when all replicas are scaled to zero, there are no pods to scrape, and the metric series disappears from Prometheus. A missing series returns null, not zero. Without ignoreNullValues: "true", KEDA treats the null as a query error and freezes scaling decisions. With it, null is interpreted as zero — no waiting requests — which is the semantically correct behavior when no pods are running.

The cron trigger and cold start

The two-trigger design addresses a fundamental problem with scale-to-zero inference: cold start. Loading Qwen2.5-7B from S3, initializing CUDA, and warming the model takes 2–4 minutes on first pod start. For a request that arrives during the scale-to-zero window, that is a 2–4 minute wait before the first token is generated.

The cron trigger keeps one replica warm during business hours at a known cost — one GPU node, roughly $0.35/hr on Spot. Off-hours, both triggers request zero replicas, KEDA scales the deployment to zero pods, and Karpenter's WhenEmpty consolidation terminates the GPU node. GPU costs drop to zero overnight.

Adjust the cron start/end times to your traffic pattern. If your platform receives inference requests at all hours, set minReplicaCount: 1 and let the Prometheus trigger handle burst scaling without scale-to-zero.

Full scale-from-zero without cold-start latency requires the hybrid router from Part 6. The router can hold a queued request while vLLM warms up, or fall back to a cloud model for the first request and retry locally once the replica is ready. The warm-up plumbing lives in the router, not in vLLM itself.

Deploy the Inference Stack

Stage model weights first (the aws s3 sync step above), then deploy:

cdk deploy HybridLlmInference

This installs KEDA, Prometheus, and the vLLM workload manifests. Helm chart installations take 2–3 minutes. The first vLLM pod will be in Init:0/1 for 3–5 minutes while the init container downloads model weights; subsequent restarts on the same node (where the emptyDir persists) skip the download.

Watch the progression:

# Watch pod status in the inference namespace.
kubectl get pods -n inference -w

# Follow init container logs during weight download.
kubectl -n inference logs -f deployment/vllm -c model-downloader

# Once the main container starts, follow vLLM startup.
kubectl -n inference logs -f deployment/vllm -c vllm

vLLM's startup sequence loads model weights from disk to GPU memory — you will see Loading weights... and GPU blocks: 1024 before the server enters the ready state. The /health endpoint begins responding only after the model is fully loaded.

Verify vLLM Is Working

Health check and model list

# Port-forward the service for local testing.
kubectl -n inference port-forward svc/vllm 8080:80 &

curl http://localhost:8080/health
# Expected: HTTP 200

curl http://localhost:8080/v1/models | jq '.data[].id'
# Expected: "qwen2.5-7b"

First inference request

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-7b",
    "messages": [{"role": "user", "content": "Summarize what Kubernetes is in two sentences."}],
    "max_tokens": 100
  }' | jq '.choices[0].message.content'

A real generated response confirms the full chain: Karpenter provisioned the GPU node, the init container loaded weights from S3, the NVIDIA device plugin allocated the GPU to the container, and vLLM is serving the OpenAI API.

Check vLLM Prometheus metrics

curl http://localhost:8080/metrics | grep vllm_num_requests
# vllm:num_requests_running 0
# vllm:num_requests_waiting 0

This is what both Prometheus scrapes and what the KEDA scaler reads every 15 seconds. Under load, vllm:num_requests_waiting rises; KEDA scales replica count accordingly.

Watch KEDA Scale Up

Send concurrent requests to push queue depth above the threshold:

for i in $(seq 1 20); do
  curl -s http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"qwen2.5-7b","messages":[{"role":"user","content":"Write a haiku."}],"max_tokens":50}' &
done
wait

Watch KEDA react:

# ScaledObject status — shows current replicas and last scale event.
kubectl -n inference describe scaledobject vllm

# Or watch the HPA that KEDA creates and manages underneath.
kubectl -n inference get hpa -w

With 20 concurrent requests and a threshold of 5 waiting requests per replica, KEDA scales to 4 replicas. Karpenter provisions additional GPU nodes to accommodate pods that cannot fit on the existing one. Watch Karpenter provision them:

kubectl get nodeclaims -w
kubectl -n karpenter logs deployment/karpenter | grep -i "launched\|nodeclaim"

After load subsides and the 300-second cooldown passes, KEDA scales back down. At 10 PM Eastern the cron trigger drops its desired replicas to zero; KEDA scales the deployment to zero pods; Karpenter's WhenEmpty policy terminates the now-empty GPU nodes.

# Confirm Karpenter terminates the GPU node after scale-to-zero.
kubectl -n karpenter logs deployment/karpenter | grep -i "terminat\|consolidat"

Tearing Down

cdk destroy HybridLlmInference
cdk destroy HybridLlmAddons
cdk destroy HybridLlmNodeGroups
cdk destroy HybridLlmCluster
cdk destroy HybridLlmNetwork

The S3 model weights bucket has removalPolicy: RETAIN and is not deleted by cdk destroy. The ~14 GB of weights remain in S3 until you delete the bucket manually. If you rebuild the tutorial, the init container finds .download-complete and skips the sync. For a full cleanup:

aws s3 rb s3://hybrid-llm-model-weights-<ACCOUNT_ID> --force

Before destroying HybridLlmAddons, delete any live Ingress objects that have provisioned ALBs — they create real AWS resources CDK does not track, and a live ALB will block VPC deletion.

What's Next

The cluster now has end-to-end GPU inference: Karpenter provisions GPU nodes on demand, vLLM loads model weights and serves the OpenAI API, and KEDA scales both replicas and GPU nodes with live request queue depth. GPU costs approach zero overnight.

What is missing is the routing layer. Every request goes directly to vLLM and there is no mechanism to decide at runtime whether a given request is best handled by the local model or by a cloud provider. That decision — model selection, latency vs. cost tradeoff, fallback logic — is the job of the hybrid router we build in Part 6. The router also addresses the cold-start gap: it holds queued requests while a vLLM replica is warming up, so clients see a short delay rather than an error when the platform scales from zero.