← All posts

llm

8 posts tagged “llm

The Cost-Efficient AI Stack: Ship AI Features Without the Runaway Bill

Most teams overpay for AI by routing every request to a frontier model. This is the architecture we build instead — hybrid cloud+local routing, self-hosted inference, agent orchestration, and cost-per-request observability — and the single principle that ties it together: send each unit of work to the cheapest model that can do it well.

Securing Self-Hosted LLMs and AI Agents on Kubernetes

Harden self-hosted vLLM and AI agents on Kubernetes: an auth/rate-limit gateway, gVisor tool sandboxing, prompt-injection guardrails, scoped secrets, and signed model weights — mapped to the OWASP LLM Top 10.

Building a Hybrid LLM Platform on EKS, Part 1: Architecture and the Network Foundation

Part 1 of a hands-on series building the EKS-based hybrid LLM platform referenced throughout this blog. We map out the full architecture, then provision the VPC, subnets, NAT, and VPC endpoints with AWS CDK — the network foundation every later part builds on.

Build a Personal AI Dev Environment: Hybrid Models, Local Inference, and a Workflow That Costs Almost Nothing

The production patterns we deploy for teams — hybrid cloud/local routing, self-hosted models, agent orchestration — scaled down to a single developer's workstation. A practical guide to building a personal AI dev environment with Ollama, Claude Code, and a local router that keeps your token bill near zero.

The Agent Control Plane: Frontier Models Plan, Your Kubernetes Fleet Executes

How to orchestrate a fleet of AI agents using a shared task queue — frontier models like Claude handle planning and decomposition, while a local Kubernetes worker pool runs the high-volume execution tasks. Covers the task ledger, dynamic task creation, lane-based routing, and KEDA autoscaling.

Observability for LLM Applications on Kubernetes: Tokens, Traces, and Cost per Request

How to instrument self-hosted and hybrid LLM workloads with OpenTelemetry, Prometheus, and Langfuse — tracking time-to-first-token, tokens per second, GPU utilization, and unit economics down to the individual request.

The Hybrid AI Playbook: Cloud Models for Thinking, Local Models for Doing

How to cut your AI costs by 60-80% using a hybrid approach — Claude or GPT for planning and complex reasoning, local models like Llama and Qwen for execution tasks like code generation, summarization, and data extraction.

Self-Hosting LLMs on Kubernetes: A Practical Guide

How to deploy, serve, and autoscale open-source large language models on Kubernetes with vLLM — from GPU node pools and deployment manifests to KEDA-based autoscaling and production guardrails.