Observability for LLM Applications on Kubernetes: Tokens, Traces, and Cost per Request

Once you have an LLM running in production, the first question stops being "does it work?" and starts being "is it actually working — and what is each call costing us?" Traditional application observability does not answer that. A 200 OK on /v1/chat/completions tells you the request did not crash. It does not tell you the model took eleven seconds to start streaming, that the user's prompt blew the context window, that a single retrieval-heavy request burned forty cents in cloud tokens, or that your local vLLM pod is sitting at 14% GPU utilization while the autoscaler refuses to scale down.

LLM workloads need their own observability stack. The signals that matter — tokens, latency-to-first-token, KV cache pressure, model routing decisions, cost-per-request — do not show up on a default Grafana dashboard. This post covers what to instrument, how to wire it together on Kubernetes, and the dashboards we build for teams running production LLM services at Entuit.

We assume you are running something like the architecture from our self-hosting LLMs on Kubernetes post — vLLM or Ollama behind a Service, possibly with a hybrid cloud router in front, scaled by KEDA. The instrumentation here is the same whether you serve one model or twenty.

The Four Signals That Actually Matter

Standard RED metrics (Rate, Errors, Duration) do not capture what makes LLM workloads different. A single inference request can take anywhere from 200ms to two minutes. A "successful" response can be a hallucination. Two requests with identical latency can cost wildly different amounts depending on output length. You need a different vocabulary.

Token throughput. Input tokens per second, output tokens per second, broken down by model and route. This is the closest thing LLM serving has to a "queries per second" metric, but it matters more — output tokens are usually 5-10x slower to generate than input tokens are to ingest, so they dominate capacity planning.

Time-to-first-token (TTFT) and inter-token latency (ITL). End-to-end response time is misleading because output length varies. TTFT captures how long the user waited before anything started appearing on screen — this is the perceived latency for streaming UIs. ITL captures the steady-state generation speed once streaming begins. Most user complaints about "slow AI" are TTFT problems caused by KV cache pressure or cold model loading, not throughput problems.

GPU and KV cache utilization. Standard nvidia-smi GPU utilization is deceptive for inference workloads — it shows the percentage of time any CUDA kernel was running, not how saturated the GPU actually is. For vLLM, the real signal is KV cache block utilization. When KV cache hits 90%+, new requests queue or get preempted. When it sits at 20%, you are paying for idle capacity.

Cost per request. Token counts × per-token price × routing decision. For cloud-routed requests, this is straightforward arithmetic. For local inference, it is amortized GPU hours divided by tokens served. Either way, you want this as a histogram, not an average — the p95 and p99 of cost-per-request are where unit economics quietly die.

Everything else (error rate, queue depth, request size distribution, prompt cache hit ratio) is useful, but if you only instrument four things, instrument these.

The Stack

We use three layers, each doing one thing well:

┌────────────────────────────────────────────────────────────┐
│                  Your LLM Application                       │
│             (OTEL SDK + token accounting)                   │
└─────────────────────┬──────────────────┬───────────────────┘
                      │                  │
                      ▼                  ▼
        ┌──────────────────────┐  ┌─────────────────────┐
        │   OTEL Collector     │  │     Langfuse        │
        │ (traces + metrics)   │  │ (prompt/completion  │
        │                      │  │  logging + evals)   │
        └──────┬───────────┬───┘  └─────────────────────┘
               │           │
               ▼           ▼
        ┌──────────┐  ┌─────────┐
        │ Tempo /  │  │Prometheus│
        │ Jaeger   │  │          │
        └──────────┘  └────┬─────┘
                           ▼
                      ┌─────────┐
                      │ Grafana │
                      └─────────┘

OpenTelemetry for traces and infrastructure metrics. Every request gets a span; spans carry token counts, model name, route decision, and cost as attributes.
Prometheus + Grafana for aggregate metrics and dashboards. vLLM exposes a /metrics endpoint out of the box; we scrape it and join it with application-level metrics.
Langfuse (or Phoenix, or Arize) for prompt and completion logging, evaluation, and replay. This is the layer Prometheus cannot do — storing the actual text of prompts and outputs for debugging and offline evaluation.

Avoid the temptation to put everything in one tool. Prometheus is wrong for storing 8KB prompt strings; Langfuse is wrong for high-cardinality counter aggregation. Use each for what it is good at.

Instrumenting the Application

The application layer is where you generate the data that nothing else can see — which user made the request, which prompt template was used, which model the router picked, and how many tokens came back. Without this, your dashboards will show you "the system is healthy" while individual users are getting garbage.

Here is the pattern we use, in TypeScript with the OpenTelemetry SDK and the Anthropic and OpenAI clients:

import { trace, metrics, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("llm.app");
const meter = metrics.getMeter("llm.app");

// Counters and histograms for aggregate dashboards
const inputTokens = meter.createCounter("llm.tokens.input");
const outputTokens = meter.createCounter("llm.tokens.output");
const ttftHistogram = meter.createHistogram("llm.time_to_first_token", { unit: "ms" });
const costHistogram = meter.createHistogram("llm.cost_per_request", { unit: "usd" });

// Per-model pricing — keep this in config, not code
const PRICING: Record<string, { input: number; output: number }> = {
  "claude-opus-4-7":   { input: 15.0 / 1e6, output: 75.0 / 1e6 },
  "claude-sonnet-4-6": { input:  3.0 / 1e6, output: 15.0 / 1e6 },
  "qwen2.5:32b":       { input:  0.0,       output:  0.0 }, // local
};

type Message = { role: string; content: string };

async function callLlm(model: string, messages: Message[], route: string, userId: string): Promise<string> {
  return tracer.startActiveSpan("llm.completion", async (span) => {
    span.setAttribute("llm.model", model);
    span.setAttribute("llm.route", route); // "cloud" or "local"
    span.setAttribute("llm.user_id", userId);
    span.setAttribute("llm.prompt_tokens_est", estimateTokens(messages));

    const start = performance.now();
    let firstTokenAt: number | null = null;
    const outText: string[] = [];

    try {
      const stream = clientFor(model).stream({ model, messages });
      for await (const chunk of stream) {
        if (firstTokenAt === null) {
          firstTokenAt = performance.now();
          const ttftMs = firstTokenAt - start;
          span.setAttribute("llm.ttft_ms", ttftMs);
          ttftHistogram.record(ttftMs, { model, route });
        }
        outText.push(chunk.delta);
      }

      // Final accounting — canonical token counts from the response
      const inTokens = stream.usage.input_tokens;
      const outTokens = stream.usage.output_tokens;
      const cost = inTokens * PRICING[model].input + outTokens * PRICING[model].output;

      span.setAttribute("llm.input_tokens", inTokens);
      span.setAttribute("llm.output_tokens", outTokens);
      span.setAttribute("llm.cost_usd", cost);

      inputTokens.add(inTokens, { model, route });
      outputTokens.add(outTokens, { model, route });
      costHistogram.record(cost, { model, route });

      return outText.join("");
    } catch (e) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: String(e) });
      span.recordException(e as Error);
      throw e;
    } finally {
      span.end();
    }
  });
}

A few things to notice:

Token counts come from the model, not estimates. Estimating with tiktoken is fine for pre-flight checks, but for cost accounting you want the canonical count the provider returns in the response usage object. vLLM, Ollama, OpenAI, and Anthropic all return this.
TTFT is measured at the first streamed chunk, not at the first token of the final response. If you are not streaming, you cannot measure TTFT — and you should be streaming for any user-facing application.
The route attribute is what makes hybrid setups debuggable. When a user complains a response was slow or wrong, the first question is "did this go to cloud or local?" Tag it once at the call site and you will never have to guess.
High-cardinality attributes go on spans, low-cardinality attributes go on metrics. user_id belongs on a trace span (where it is searchable but not aggregated). model and route belong on both, because they are low-cardinality and useful for dashboards.

Scraping vLLM and Ollama

vLLM exposes Prometheus metrics on /metrics by default. The metrics that matter:

Metric	What it tells you
`vllm:num_requests_running`	Active requests in the batch
`vllm:num_requests_waiting`	Requests queued (this is your "things are bad" signal)
`vllm:gpu_cache_usage_perc`	KV cache utilization — the real saturation signal
`vllm:time_to_first_token_seconds`	Histogram of TTFT from the server's perspective
`vllm:time_per_output_token_seconds`	ITL histogram
`vllm:prompt_tokens_total`	Cumulative input tokens served
`vllm:generation_tokens_total`	Cumulative output tokens generated

Scraping it from a Prometheus operator is one annotation on the vLLM Service:

apiVersion: v1
kind: Service
metadata:
  name: vllm-qwen-coder
  labels:
    app: vllm
    model: qwen2.5-coder-32b
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8000"
    prometheus.io/path: "/metrics"
spec:
  selector:
    app: vllm
    model: qwen2.5-coder-32b
  ports:
    - name: http
      port: 8000
      targetPort: 8000

If you are using kube-prometheus-stack, prefer a ServiceMonitor instead — annotations work but ServiceMonitors are easier to manage at scale.

Ollama is less complete. It does not expose a Prometheus endpoint natively, so we run a small sidecar that polls /api/ps and /api/show and translates the JSON into Prometheus metrics. For most production work this is a sign you should be on vLLM anyway — Ollama is excellent for development and small-scale local inference, but it lacks the operational surface area you need for serious observability.

GPU Metrics That Are Not Lies

NVIDIA's stock GPU exporter, dcgm-exporter, gives you per-GPU utilization, memory used, temperature, power draw, and ECC errors. Deploy it as a DaemonSet on your GPU nodes:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: dcgm-exporter
          image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.0-ubuntu22.04
          ports:
            - name: metrics
              containerPort: 9400
          securityContext:
            runAsNonRoot: false
            capabilities:
              add: ["SYS_ADMIN"]
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi

The trap with GPU metrics is that DCGM_FI_DEV_GPU_UTIL — the headline number — measures the percentage of time at least one kernel was active. A vLLM pod doing single-stream inference can sit at 95% on that metric while only using a small fraction of the SM (streaming multiprocessor) capacity. The metrics that better reflect actual saturation are:

DCGM_FI_PROF_SM_ACTIVE — fraction of SMs that were active (closer to real utilization)
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE — fraction of cycles the tensor cores were busy (most relevant for transformer inference)
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE — framebuffer memory used vs free (the hard ceiling)

Pair these with vllm:gpu_cache_usage_perc and you have a real picture of whether your GPU is actually working hard or just plugged in.

Capturing Prompts and Completions with Langfuse

Metrics tell you the system is unhealthy. They do not tell you why. For that, you need the actual text of the prompts and completions, structured so you can search them, replay them, and run evaluations against them. We use Langfuse for this; Phoenix and Arize are also reasonable choices.

Langfuse runs in your cluster as a Postgres-backed web app and ingest API. The Helm chart is straightforward:

helm repo add langfuse https://langfuse.github.io/langfuse-k8s
helm install langfuse langfuse/langfuse \
  --namespace observability \
  --set postgresql.enabled=true \
  --set langfuse.nextauth.secret=<random-secret> \
  --set langfuse.salt=<random-salt>

In the application, every LLM call becomes a Langfuse trace with the prompt, completion, model, tokens, cost, and latency attached:

import { Langfuse } from "langfuse";

const langfuse = new Langfuse();

async function callLlm(model: string, messages: Message[], route: string, userId: string): Promise<string> {
  const generation = langfuse.generation({
    name: "llm.completion",
    model,
    input: messages,
    metadata: { route, userId },
  });

  // ... existing callLlm body ...

  generation.end({
    output: responseText,
    usage: { input: inTokens, output: outTokens, unit: "TOKENS" },
  });
  return responseText;
}

The payoff is huge:

Debugging. When a user reports "the bot gave me a weird answer," you can find the exact prompt and completion in seconds, including the system prompt and any RAG context that was injected.
Evaluation. You can pull a sample of production traffic and run automated evaluations against it — checking for hallucinations, format compliance, or regressions after a prompt change.
Prompt versioning. Langfuse can store prompts as versioned artifacts. Roll forward and backward without a code deploy when you find a prompt regression in production.

What Langfuse should not do is replace your metrics pipeline. Querying "p95 TTFT over the last hour" in Langfuse is slow and expensive. Querying it in Prometheus is instant. Use each for what it is good at.

The Dashboards

We build three dashboards for every LLM deployment. They are mostly the same shape regardless of stack.

Dashboard 1: Service Health. TTFT (p50/p95/p99), ITL (p50/p95/p99), request rate by model, error rate by error class, KV cache utilization, queue depth. This is the on-call dashboard — the one that should be open when the alert fires. Set TTFT p95 alerts at a threshold that reflects your actual user experience (we typically use 2 seconds for chat UIs, 500ms for autocomplete).

Dashboard 2: Cost and Unit Economics. Cost per request (p50/p95/p99), cost per user per day, tokens served by model and route, cloud-vs-local split as a percentage of total tokens, cost trend over the last 30 days. This is the dashboard you send the finance team. The unit economics question — "are we losing money on every request?" — is answerable from here.

Dashboard 3: Model Behavior. Output token distribution (are responses getting longer over time? a common sign of a regressed prompt), prompt cache hit ratio if you use Anthropic or OpenAI prompt caching, route mix over time, top users by token consumption, top prompt templates by cost. This is the one that catches "the new feature is silently 10x more expensive than the old feature."

A simple Prometheus query for cost-per-request p95, joining application metrics with model routing labels:

histogram_quantile(0.95,
  sum by (le, model, route) (
    rate(llm_cost_per_request_bucket[5m])
  )
)

And for the cloud-vs-local token split, useful for verifying your hybrid routing is actually doing what you think:

sum by (route) (rate(llm_tokens_output_total[5m]))
/ ignoring(route) group_left
sum (rate(llm_tokens_output_total[5m]))

Alerts That Are Worth Paging On

Alert fatigue is a real risk with LLM workloads because so many of the signals are noisy by nature. We keep page-worthy alerts narrow:

TTFT p95 above threshold for 5 minutes. This catches KV cache saturation, cold starts, and upstream API degradation.
Cost per request p99 doubles week-over-week. Catches prompt regressions, runaway agents, and accidental model swaps.
vLLM num_requests_waiting > 0 for 2+ minutes. Means the autoscaler is not keeping up — either KEDA is misconfigured or you have hit your node pool ceiling.
GPU memory used > 95%. OOM is much worse on GPU than CPU because the pod usually crashes hard. Catch it before that.
Error rate > 1% sustained. A single 500 is fine. A sustained error rate means a model is broken or a provider is degraded.

Things we have learned not to page on: individual slow requests (LLMs are bursty), GPU utilization alone (it lies, see above), and Langfuse evaluation regressions (better as a daily report).

What This Buys You

The teams that get the most value from LLM observability are not the ones with the prettiest dashboards. They are the ones who can answer four questions in under a minute:

Is the service healthy right now? (TTFT, error rate, queue depth)
What is each request costing us, and is it trending the right direction? (cost histogram, route mix)
When this specific user complained, what actually happened? (Langfuse trace lookup by user_id)
Is our GPU capacity sized correctly? (KV cache utilization, tensor core activity)

If you cannot answer all four, you are operating an LLM service the same way you operate a web service — and the failure modes are different enough that this will eventually hurt. Token cost is silent. Hallucinations are silent. GPU under-utilization is silent. None of these show up as a 500.

Build the instrumentation early, before you need it. The cost is a few hundred lines of code and a few dashboards. The alternative is debugging a $40,000 monthly bill or a degraded user experience with no signal to go on — and at that point, the answer is always the same: "we need to add observability first."