> Blog Post

Building a Hybrid LLM Platform on EKS, Part 7: Observability and Cost Telemetry

In Part 6 we built the hybrid router — the TypeScript service that splits every request between the local vLLM model and the Anthropic API, with a 2-second health-check fallback when vLLM is warming up. The platform now works end-to-end, but it is a black box. You can see individual X-Router-Backend headers, but you cannot answer: what fraction of traffic is going local right now? What is the p95 latency for each backend? Is the GPU being fully utilized? How much is the Anthropic API costing per hour compared to the amortized GPU spend?

Part 7 adds three observability layers that answer those questions:

  1. Infrastructure metrics — upgrade the thin Prometheus from Part 5 to kube-prometheus-stack, add the NVIDIA DCGM exporter for GPU utilization and VRAM metrics, and wire the existing vLLM Prometheus endpoint into Grafana dashboards.
  2. Distributed traces — instrument the TypeScript router with the OpenTelemetry SDK, deploy an OTel Collector and Grafana Tempo in-cluster, and surface end-to-end request traces in Grafana.
  3. LLM cost telemetry — add the Langfuse SDK to the router so every inference request is logged as a generation with model, token counts, latency, and computed cost.

By the end, a single Grafana instance shows GPU utilization alongside inference latency alongside cloud-vs-local token spend — the data needed to tune the routing thresholds from Part 6 by evidence rather than guesswork.

Instrumenting the Router

The router changes come first because the CDK stack below references new environment variables the router needs. We add two files and update package.json.

router/src/tracer.ts

The OTel SDK must be initialized before any other imports. We isolate the setup in its own module and import it as the very first line of index.ts.

// router/src/tracer.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
import { Resource } from "@opentelemetry/resources";
import { ATTR_SERVICE_NAME } from "@opentelemetry/semantic-conventions";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";

export const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: "hybrid-llm-router",
  }),
  traceExporter: new OTLPTraceExporter({
    // OTEL_EXPORTER_OTLP_ENDPOINT is set by the Deployment env vars.
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Disable noisy fs/dns auto-instrumentation; keep http.
      "@opentelemetry/instrumentation-fs": { enabled: false },
      "@opentelemetry/instrumentation-dns": { enabled: false },
    }),
  ],
});

sdk.start();

process.on("SIGTERM", () => sdk.shutdown().finally(() => process.exit(0)));

Updates to router/src/index.ts

Add import "./tracer" as the first line, then layer in spans and Langfuse generation logging around the routing decision:

import "./tracer"; // must precede all other imports
import { trace, SpanStatusCode } from "@opentelemetry/api";
import Langfuse from "langfuse";
// ... existing imports unchanged ...

const tracer = trace.getTracer("hybrid-llm-router", "1.0.0");

const langfuse = new Langfuse({
  secretKey: process.env.LANGFUSE_SECRET_KEY ?? "",
  publicKey: process.env.LANGFUSE_PUBLIC_KEY ?? "",
  baseUrl:
    process.env.LANGFUSE_BASE_URL ??
    "http://langfuse.monitoring.svc.cluster.local:3000",
  flushAt: 1,     // flush after every event for low-latency telemetry
  flushInterval: 0,
});

Replace the /v1/chat/completions handler with the instrumented version:

app.post("/v1/chat/completions", async (c) => {
  return tracer.startActiveSpan("chat.completions", async (span) => {
    try {
      const body = await c.req.json<ChatBody>();
      const model = body.model ?? "auto";
      const messages = body.messages ?? [];
      const maxTokens = body.max_tokens ?? 512;
      const isStream = body.stream ?? false;

      let useLocal = routeToLocal(model, messages, maxTokens);

      if (useLocal && !(await vllmIsReady())) {
        useLocal = false;
        body.model = "claude-sonnet-4-6";
      }

      const backend = useLocal ? "local" : "cloud";

      span.setAttributes({
        "llm.model.requested": model,
        "llm.backend": backend,
        "llm.max_tokens": maxTokens,
        "llm.prompt_tokens_estimated": Math.round(estimateTokens(messages)),
      });

      const start = Date.now();
      const response = useLocal
        ? await forwardToVllm(body, isStream)
        : await forwardToAnthropic(body, isStream);
      const latencyMs = Date.now() - start;

      span.setAttributes({ "llm.latency_ms": latencyMs });
      span.setStatus({ code: SpanStatusCode.OK });

      // Log to Langfuse — fire-and-forget, does not block the response.
      void logToLangfuse(model, backend, messages, response, latencyMs);

      return response;
    } catch (err) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: String(err) });
      throw err;
    } finally {
      span.end();
    }
  });
});

async function logToLangfuse(
  requestedModel: string,
  backend: string,
  messages: Message[],
  response: Response,
  latencyMs: number,
): Promise<void> {
  // Clone the response so reading the body here doesn't consume it for the caller.
  const clone = response.clone();
  if (response.headers.get("Content-Type")?.includes("text/event-stream")) {
    // Skip body parsing for streaming responses.
    langfuse.generation({
      name: "chat-completion",
      model: requestedModel,
      input: messages,
      metadata: { backend, streaming: true, latencyMs },
    });
  } else {
    try {
      const body = await clone.json<{ model?: string; usage?: { prompt_tokens?: number; completion_tokens?: number }; choices?: Array<{ message?: { content?: string } }> }>();
      langfuse.generation({
        name: "chat-completion",
        model: body.model ?? requestedModel,
        input: messages,
        output: body.choices?.[0]?.message?.content ?? "",
        usage: {
          input: body.usage?.prompt_tokens ?? 0,
          output: body.usage?.completion_tokens ?? 0,
        },
        metadata: { backend, latencyMs },
      });
    } catch {
      // Best-effort — never let telemetry break the response path.
    }
  }
  await langfuse.flushAsync();
}

Updated router/package.json

{
  "name": "hybrid-llm-router",
  "version": "1.0.0",
  "scripts": {
    "dev": "tsx watch src/index.ts",
    "build": "tsc",
    "start": "node dist/index.js"
  },
  "dependencies": {
    "@anthropic-ai/sdk": "^0.40.0",
    "@hono/node-server": "^1.13.0",
    "@opentelemetry/api": "^1.9.0",
    "@opentelemetry/auto-instrumentations-node": "^0.54.0",
    "@opentelemetry/exporter-trace-otlp-grpc": "^0.56.0",
    "@opentelemetry/sdk-node": "^0.56.0",
    "@opentelemetry/semantic-conventions": "^1.28.0",
    "hono": "^4.6.0",
    "langfuse": "^3.28.0"
  },
  "devDependencies": {
    "@types/node": "^22.0.0",
    "tsx": "^4.19.0",
    "typescript": "^5.6.0"
  }
}

A Seventh CDK Stack: Observability

The observability stack upgrades the thin Prometheus from Part 5, adds the NVIDIA DCGM GPU exporter, deploys Grafana Tempo as the trace backend, installs the OTel Collector, and installs Langfuse.

Before deploying this stack, remove this.installPrometheus(props.cluster) from InferenceStackkube-prometheus-stack takes over the monitoring namespace and will conflict with the standalone chart.

// lib/observability-stack.ts
import * as cdk from "aws-cdk-lib";
import { Construct } from "constructs";
import * as eks from "aws-cdk-lib/aws-eks";
import { config } from "./config";

interface ObservabilityStackProps extends cdk.StackProps {
  cluster: eks.Cluster;
}

export class ObservabilityStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props: ObservabilityStackProps) {
    super(scope, id, props);

    const ns = this.ensureMonitoringNamespace(props.cluster);

    this.installKubePrometheusStack(props.cluster, ns);
    this.installDcgmExporter(props.cluster, ns);
    this.installTempo(props.cluster, ns);
    this.installOtelCollector(props.cluster, ns);
    this.installLangfuse(props.cluster, ns);
  }

  private ensureMonitoringNamespace(cluster: eks.Cluster): cdk.aws_eks.KubernetesManifest {
    return cluster.addManifest("MonitoringNamespace", {
      apiVersion: "v1",
      kind: "Namespace",
      metadata: { name: "monitoring" },
    });
  }

  private installKubePrometheusStack(
    cluster: eks.Cluster,
    ns: cdk.aws_eks.KubernetesManifest,
  ): void {
    const chart = cluster.addHelmChart("KubePrometheusStack", {
      chart: "kube-prometheus-stack",
      repository: "https://prometheus-community.github.io/helm-charts",
      namespace: "monitoring",
      release: "kube-prometheus-stack",
      version: "65.2.0",
      values: {
        prometheus: {
          prometheusSpec: {
            retention: "15d",
            storageSpec: {
              volumeClaimTemplate: {
                spec: {
                  storageClassName: "gp3",
                  resources: { requests: { storage: "50Gi" } },
                },
              },
            },
            // Scrape pods annotated with prometheus.io/scrape: "true" — picks
            // up vLLM pods from Part 5 and router pods from Part 6.
            podMonitorSelectorNilUsesHelmValues: false,
            serviceMonitorSelectorNilUsesHelmValues: false,
            additionalScrapeConfigs: [
              {
                job_name: "annotated-pods",
                kubernetes_sd_configs: [{ role: "pod" }],
                relabel_configs: [
                  {
                    source_labels: ["__meta_kubernetes_pod_annotation_prometheus_io_scrape"],
                    action: "keep",
                    regex: "true",
                  },
                  {
                    source_labels: [
                      "__meta_kubernetes_pod_ip",
                      "__meta_kubernetes_pod_annotation_prometheus_io_port",
                    ],
                    action: "replace",
                    separator: ":",
                    target_label: "__address__",
                  },
                  {
                    source_labels: ["__meta_kubernetes_pod_annotation_prometheus_io_path"],
                    action: "replace",
                    target_label: "__metrics_path__",
                    regex: "(.+)",
                  },
                ],
              },
            ],
          },
        },
        grafana: {
          adminPassword: "changeme",
          persistence: { enabled: true, storageClassName: "gp3", size: "10Gi" },
          additionalDataSources: [
            {
              name: "Tempo",
              type: "tempo",
              url: "http://tempo.monitoring.svc.cluster.local:3100",
              access: "proxy",
              isDefault: false,
            },
          ],
          dashboardProviders: {
            "dashboardproviders.yaml": {
              apiVersion: 1,
              providers: [{
                name: "default",
                orgId: 1,
                folder: "",
                type: "file",
                options: { path: "/var/lib/grafana/dashboards/default" },
              }],
            },
          },
        },
        alertmanager: { enabled: false },
      },
    });

    chart.node.addDependency(ns);
  }

  private installDcgmExporter(
    cluster: eks.Cluster,
    ns: cdk.aws_eks.KubernetesManifest,
  ): void {
    // NVIDIA DCGM Exporter runs as a DaemonSet on GPU nodes and exports
    // per-GPU metrics — utilization, VRAM used/free, temperature, power draw —
    // to Prometheus.
    const chart = cluster.addHelmChart("DcgmExporter", {
      chart: "dcgm-exporter",
      repository: "https://nvidia.github.io/dcgm-exporter/helm-charts",
      namespace: "monitoring",
      release: "dcgm-exporter",
      version: "3.5.0",
      values: {
        tolerations: [
          // Must tolerate the GPU taint from Part 3 to schedule on GPU nodes.
          { key: "nvidia.com/gpu", operator: "Exists", effect: "NoSchedule" },
        ],
        affinity: {
          nodeAffinity: {
            requiredDuringSchedulingIgnoredDuringExecution: {
              nodeSelectorTerms: [{
                matchExpressions: [{
                  key: "node.kubernetes.io/purpose",
                  operator: "In",
                  values: ["gpu-inference"],
                }],
              }],
            },
          },
        },
        serviceMonitor: { enabled: true },
      },
    });

    chart.node.addDependency(ns);
  }

  private installTempo(
    cluster: eks.Cluster,
    ns: cdk.aws_eks.KubernetesManifest,
  ): void {
    // Grafana Tempo is the trace backend. The OTel Collector forwards spans to
    // it via OTLP gRPC. Grafana reads traces from it using the Tempo datasource
    // configured above.
    const chart = cluster.addHelmChart("Tempo", {
      chart: "tempo",
      repository: "https://grafana.github.io/helm-charts",
      namespace: "monitoring",
      release: "tempo",
      version: "1.10.3",
      values: {
        tempo: {
          storage: {
            trace: {
              backend: "local",
              local: { path: "/var/tempo/traces" },
            },
          },
          resources: {
            requests: { cpu: "200m", memory: "512Mi" },
            limits: { cpu: "1", memory: "2Gi" },
          },
        },
        persistence: {
          enabled: true,
          storageClassName: "gp3",
          size: "20Gi",
        },
      },
    });

    chart.node.addDependency(ns);
  }

  private installOtelCollector(
    cluster: eks.Cluster,
    ns: cdk.aws_eks.KubernetesManifest,
  ): void {
    // The OTel Collector receives traces from the router on port 4317 (gRPC)
    // and forwards them to Grafana Tempo.
    const chart = cluster.addHelmChart("OtelCollector", {
      chart: "opentelemetry-collector",
      repository: "https://open-telemetry.github.io/opentelemetry-helm-charts",
      namespace: "monitoring",
      release: "otel-collector",
      version: "0.108.0",
      values: {
        mode: "deployment",
        replicaCount: 1,
        config: {
          receivers: {
            otlp: {
              protocols: {
                grpc: { endpoint: "0.0.0.0:4317" },
                http: { endpoint: "0.0.0.0:4318" },
              },
            },
          },
          processors: {
            batch: {},
            memory_limiter: {
              check_interval: "5s",
              limit_percentage: 80,
              spike_limit_percentage: 25,
            },
          },
          exporters: {
            otlp: {
              endpoint: "http://tempo.monitoring.svc.cluster.local:4317",
              tls: { insecure: true },
            },
          },
          service: {
            pipelines: {
              traces: {
                receivers: ["otlp"],
                processors: ["memory_limiter", "batch"],
                exporters: ["otlp"],
              },
            },
          },
        },
        resources: {
          requests: { cpu: "100m", memory: "256Mi" },
          limits: { cpu: "500m", memory: "512Mi" },
        },
      },
    });

    chart.node.addDependency(ns);
  }

  private installLangfuse(
    cluster: eks.Cluster,
    ns: cdk.aws_eks.KubernetesManifest,
  ): void {
    // Langfuse: self-hosted LLM observability. For this tutorial we use the
    // bundled PostgreSQL. Production should use an RDS instance — set
    // postgresql.enabled: false and supply an external DATABASE_URL.
    const chart = cluster.addHelmChart("Langfuse", {
      chart: "langfuse",
      repository: "https://langfuse.github.io/langfuse-k8s",
      namespace: "monitoring",
      release: "langfuse",
      version: "1.3.0",
      values: {
        langfuse: {
          salt: "change-me-random-32-chars",
          nextauth: {
            secret: "change-me-random-32-chars",
            url: "http://langfuse.monitoring.svc.cluster.local:3000",
          },
        },
        postgresql: {
          enabled: true,
          auth: {
            postgresPassword: "langfuse-pg-password",
            database: "langfuse",
          },
        },
        service: {
          type: "ClusterIP",
          port: 3000,
        },
        resources: {
          requests: { cpu: "250m", memory: "512Mi" },
          limits: { cpu: "1", memory: "1Gi" },
        },
      },
    });

    chart.node.addDependency(ns);
  }
}

Update bin/app.ts to add the seventh stack — also remove oidcProvider from the parameter since ObservabilityStack does not use IRSA:

// bin/app.ts
import * as cdk from "aws-cdk-lib";
import { NetworkStack } from "../lib/network-stack";
import { ClusterStack } from "../lib/cluster-stack";
import { NodeGroupStack } from "../lib/node-group-stack";
import { AddonsStack } from "../lib/addons-stack";
import { InferenceStack } from "../lib/inference-stack";
import { RouterStack } from "../lib/router-stack";
import { ObservabilityStack } from "../lib/observability-stack";
import { config } from "../lib/config";

const app = new cdk.App();
const env = { region: config.region };

const network = new NetworkStack(app, "HybridLlmNetwork", { env });

const cluster = new ClusterStack(app, "HybridLlmCluster", {
  env,
  vpc: network.vpc,
});

new NodeGroupStack(app, "HybridLlmNodeGroups", {
  env,
  cluster: cluster.cluster,
  nodeRole: cluster.nodeRole,
});

new AddonsStack(app, "HybridLlmAddons", {
  env,
  cluster: cluster.cluster,
  oidcProvider: cluster.oidcProvider,
});

new InferenceStack(app, "HybridLlmInference", {
  env,
  cluster: cluster.cluster,
  oidcProvider: cluster.oidcProvider,
});

new RouterStack(app, "HybridLlmRouter", {
  env,
  cluster: cluster.cluster,
  oidcProvider: cluster.oidcProvider,
});

new ObservabilityStack(app, "HybridLlmObservability", {
  env,
  cluster: cluster.cluster,
});

Also update RouterStack to pass the new OTel and Langfuse environment variables to the router Deployment:

// In RouterStack.deployRouter(), add to the env array:
{ name: "OTEL_EXPORTER_OTLP_ENDPOINT",
  value: "http://otel-collector.monitoring.svc.cluster.local:4317" },
{ name: "LANGFUSE_BASE_URL",
  value: "http://langfuse.monitoring.svc.cluster.local:3000" },
{
  name: "LANGFUSE_SECRET_KEY",
  valueFrom: { secretKeyRef: { name: "router-api-keys", key: "LANGFUSE_SECRET_KEY" } },
},
{
  name: "LANGFUSE_PUBLIC_KEY",
  valueFrom: { secretKeyRef: { name: "router-api-keys", key: "LANGFUSE_PUBLIC_KEY" } },
},

Walking Through the Decisions

kube-prometheus-stack replaces the thin Prometheus

Part 5 installed prometheus-community/prometheus — a minimal single-component chart — because that was the earliest point where KEDA needed a metrics source. The thin chart served that purpose, but it has no Grafana, no alerting, no ServiceMonitor CRDs, and no node or Kubernetes state metrics.

kube-prometheus-stack is the production-grade replacement. It ships Prometheus, Grafana, AlertManager, the Prometheus Operator (which adds ServiceMonitor and PodMonitor CRDs), kube-state-metrics for Kubernetes object state, and node-exporter for per-node CPU/memory/disk metrics. All of it is wired together at install time. The upgrade from thin Prometheus to kube-prometheus-stack is a namespace-level swap — the existing vLLM pod annotations (prometheus.io/scrape: "true") are picked up immediately by the additionalScrapeConfigs we carry forward from Part 5.

NVIDIA DCGM Exporter for GPU visibility

The most important single metric on an inference cluster is GPU utilization. Without it you cannot answer whether the model server is compute-bound or memory-bound, whether the Karpenter GPU node is actually working or sitting idle, or whether a Spot interruption left a node partially degraded.

NVIDIA's DCGM Exporter runs as a DaemonSet on every GPU node and exposes a Prometheus-compatible /metrics endpoint with metrics like:

  • DCGM_FI_DEV_GPU_UTIL — GPU compute utilization (0–100%)
  • DCGM_FI_DEV_MEM_COPY_UTIL — memory bandwidth utilization
  • DCGM_FI_DEV_FB_USED — framebuffer (VRAM) used in MB
  • DCGM_FI_DEV_POWER_USAGE — power draw in watts
  • DCGM_FI_DEV_SM_CLOCK — streaming multiprocessor clock frequency

The DaemonSet tolerates the nvidia.com/gpu=present:NoSchedule taint from Part 3 and uses the same node affinity — without both, it would not schedule on GPU nodes and you would see no GPU metrics at all.

OTel SDK initialization must be the first import

The @opentelemetry/sdk-node SDK patches Node.js core modules (http, https, dns) to inject trace context. These patches must apply before any module that uses those APIs is imported. If import Anthropic from "@anthropic-ai/sdk" runs before the SDK starts, the Anthropic HTTP client is never patched and its requests are invisible to tracing.

The import "./tracer" as the first line of index.ts ensures the SDK starts before anything else. This is a Node.js module loading guarantee — synchronous import statements execute in order.

What to put on spans

Spans should carry attributes that answer questions you will actually ask. For the inference router, the critical ones are:

  • llm.model.requested — what the caller asked for ("auto", "local", "claude")
  • llm.backend — which backend actually served the request ("local" or "cloud")
  • llm.max_tokens — the caller's requested output budget
  • llm.prompt_tokens_estimated — a cheap approximation of prompt size (for routing post-analysis)
  • llm.latency_ms — end-to-end time from routing decision to first byte of response

The gap between llm.model.requested and llm.backend is where cold-start fallbacks and heuristic overrides show up. A Grafana query of llm.model.requested == "local" AND llm.backend == "cloud" finds every fallback — so you can see how often vLLM's warm-up is causing cloud fallback and decide whether to raise the baseline replica count.

Langfuse for what Prometheus cannot tell you

Prometheus is excellent at infrastructure metrics — utilization percentages, throughput rates, error counts. It is not designed for the data that matters most for an LLM platform:

  • Token counts per request — Prometheus can count requests; it cannot count tokens inside each request without a custom metric your code emits explicitly.
  • Cost per request — you need token counts and a model-aware pricing lookup to compute cost. Langfuse does this natively for every model it knows about.
  • Input/output content — Prometheus stores time-series numbers, not text. Langfuse stores the actual prompts and completions so you can review surprising outputs, identify routing mistakes, and build evaluation datasets.

The logToLangfuse function clones the response before it is sent to the caller, parses the body for token usage, and logs a generation asynchronously. The void keyword on the call site is deliberate — Langfuse telemetry must never block or throw on the response path.

Tempo as the trace backend

Grafana Tempo stores traces in object storage (or local disk for the tutorial) and integrates directly with Grafana via the Tempo datasource. Because kube-prometheus-stack already ships Grafana, adding Tempo as a datasource means traces and metrics live in one UI. An engineer investigating a latency spike can pivot from a Grafana metrics graph to the individual traces that contributed to the spike without leaving the window.

The alternative — Jaeger — is mature and well-documented, but requires a separate UI. For a platform where the primary dashboards are already in Grafana, that split adds friction. Tempo's zero-query-cost model (you query by trace ID, not by field value) is a fine fit for investigating specific requests surfaced by Langfuse.

The bundled PostgreSQL in Langfuse is not for production

The postgresql.enabled: true value in the Langfuse Helm values starts a single-replica PostgreSQL pod inside the cluster. This is convenient for the tutorial — no external dependency — but has no HA, no automated backup, and data is lost if the PVC is deleted. For production, provision an RDS PostgreSQL instance, set postgresql.enabled: false, and provide the connection string via langfuse.databaseUrl pointing to the RDS endpoint. The RDS instance can live in a CDK stack alongside the cluster infrastructure.

Deploy the Observability Stack

Step 1: Remove the thin Prometheus from InferenceStack

In lib/inference-stack.ts, delete the installPrometheus method and remove its call from the constructor. The monitoring namespace and Prometheus will be recreated by kube-prometheus-stack.

If the thin chart is already deployed, uninstall it first:

helm -n monitoring uninstall prometheus

Step 2: Add Langfuse keys to the router Secret

Langfuse generates API keys on first login. Start Langfuse first (step 3), create an account, generate a project key pair, then add them to the router Secret:

kubectl -n router patch secret router-api-keys \
  --type=merge \
  -p '{"stringData":{"LANGFUSE_SECRET_KEY":"sk-lf-...","LANGFUSE_PUBLIC_KEY":"pk-lf-..."}}'

Step 3: Deploy

cdk deploy HybridLlmObservability

# Also redeploy the router to pick up the new OTel/Langfuse env vars.
cdk deploy HybridLlmRouter

The observability stack deploys five Helm charts. kube-prometheus-stack is the slowest — it creates CRDs and waits for the Prometheus Operator to reconcile. Total deploy time is 5–8 minutes.

Verify the Observability Stack

Grafana

kubectl -n monitoring port-forward svc/kube-prometheus-stack-grafana 3000:80 &
# Open http://localhost:3000 — admin / changeme

Navigate to Dashboards → Kubernetes / Compute Resources / Namespace and select the inference namespace. You should see vLLM pod CPU and memory. Navigate to Explore, select the Prometheus datasource, and query:

vllm:num_requests_waiting{namespace="inference"}

For GPU metrics, query:

DCGM_FI_DEV_GPU_UTIL{namespace="monitoring"}

A value above zero means the GPU is computing inference. A value of zero while vLLM shows running requests indicates the pod is not correctly bound to the GPU — check the device plugin logs from Part 3.

Traces in Grafana Tempo

ALB=$(kubectl -n router get ingress router -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
curl -s http://${ALB}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"auto","messages":[{"role":"user","content":"What year did Kubernetes 1.0 ship?"}],"max_tokens":50}'

In Grafana, go to Explore → Tempo and search by service name hybrid-llm-router. You should see the trace for the request above, with the llm.backend attribute showing local or cloud depending on whether vLLM was warm.

Langfuse dashboard

kubectl -n monitoring port-forward svc/langfuse 3001:3000 &
# Open http://localhost:3001 — log in with the account you created

The Langfuse Generations view shows each request with model name, token counts, latency, and cost (computed from the model's known pricing). The Analytics tab shows aggregates over time — total tokens, cost per day, split by model and backend.

Tearing Down

kubectl -n router delete ingress router

cdk destroy HybridLlmObservability
cdk destroy HybridLlmRouter
cdk destroy HybridLlmInference
cdk destroy HybridLlmAddons
cdk destroy HybridLlmNodeGroups
cdk destroy HybridLlmCluster
cdk destroy HybridLlmNetwork

The Prometheus and Grafana PVCs are not deleted automatically when the stack is destroyed — they remain as orphaned PersistentVolume objects. Clean them up:

kubectl -n monitoring delete pvc --all

What's Next

The platform is now fully instrumented. You can see GPU utilization alongside inference latency, distributed traces that show exactly when the router fell back from local to cloud, and per-request cost data in Langfuse that makes the cloud-vs-local trade-off visible in dollar terms.

In Part 8 we put the platform under load: write integration tests that exercise both routing paths, run a realistic traffic simulation against the ALB, and use the Grafana and Langfuse data we wired up in this part to verify that the routing heuristics and autoscaling are behaving as designed. Part 8 is also where we collect the sample workloads — classification, summarization, multi-step reasoning — that demonstrate the hybrid approach working as intended.