← All posts

ai-infrastructure

11 posts tagged “ai-infrastructure

Building a Hybrid LLM Platform on EKS, Part 5: Serving Local Models with vLLM and KEDA

Part 5 of our hands-on EKS series. We deploy vLLM model servers on the GPU pool from Part 4, load Qwen2.5-7B model weights from Amazon S3 via an init container, and wire KEDA autoscaling that scales replicas with live queue depth and drives GPU nodes to zero overnight.

Building a Hybrid LLM Platform on EKS, Part 7: Observability and Cost Telemetry

Part 7 of our hands-on EKS series. We instrument the TypeScript router with OpenTelemetry, upgrade Prometheus to kube-prometheus-stack for GPU and vLLM metrics, add Grafana Tempo for distributed traces, and wire Langfuse so every request shows its backend, token count, and dollar cost.

Building a Hybrid LLM Platform on EKS, Part 6: The Hybrid Router

Part 6 of our hands-on EKS series. We build a TypeScript/Hono router that sits in front of both vLLM and the Anthropic API, routes each request to the right backend based on model name and complexity heuristics, and falls back to cloud when the local model is cold-starting.

Building a Hybrid LLM Platform on EKS, Part 8: Testing, Load, and Examples

The final part of our EKS series. We write integration tests with Vitest, load-test the ALB with k6, build three real-world TypeScript workloads that prove the hybrid routing works, and use the Grafana and Langfuse dashboards from Part 7 to verify the platform under traffic.

Building a Hybrid LLM Platform on EKS, Part 4: Platform Add-ons, the Load Balancer Controller, and Karpenter

Part 4 of our hands-on EKS series. We install the two add-ons every production EKS cluster needs: the AWS Load Balancer Controller so Kubernetes Ingress objects provision real ALBs, and Karpenter for cost-aware autoscaling — including the GPU NodePool that scales to zero between inference workloads.

Building a Hybrid LLM Platform on EKS, Part 3: Node Groups, GPU AMIs, and the NVIDIA Device Plugin

Part 3 of our hands-on EKS series. We add worker nodes to the empty cluster from Part 2: a CPU system pool for add-ons and the hybrid router, a GPU pool for vLLM model servers, the NVIDIA device plugin DaemonSet, and the taints and labels that make scheduling predictable.

Building a Hybrid LLM Platform on EKS, Part 2: The Control Plane, IAM, and IRSA

Part 2 of our hands-on EKS series. We provision the EKS cluster into the VPC from Part 1, wire up OIDC federation and IRSA so pods authenticate without static credentials, and end with a working kubectl connection to a real cluster.

Building a Hybrid LLM Platform on EKS, Part 1: Architecture and the Network Foundation

Part 1 of a hands-on series building the EKS-based hybrid LLM platform referenced throughout this blog. We map out the full architecture, then provision the VPC, subnets, NAT, and VPC endpoints with AWS CDK — the network foundation every later part builds on.

Observability for LLM Applications on Kubernetes: Tokens, Traces, and Cost per Request

How to instrument self-hosted and hybrid LLM workloads with OpenTelemetry, Prometheus, and Langfuse — tracking time-to-first-token, tokens per second, GPU utilization, and unit economics down to the individual request.

Self-Hosting LLMs on Kubernetes: A Practical Guide

How to deploy, serve, and autoscale open-source large language models on Kubernetes with vLLM — from GPU node pools and deployment manifests to KEDA-based autoscaling and production guardrails.

FinOps for AI Infrastructure: Beyond Cloud Cost Tags

Traditional FinOps practices fall short for AI workloads. Here's how to build a cost management strategy that accounts for GPU economics.