> Blog Post

Building a Hybrid LLM Platform on EKS, Part 2: The Control Plane, IAM, and IRSA

In Part 1 we built the network: a properly sized VPC with tagged subnets, NAT egress, and VPC endpoints that keep GPU image pulls off the metered NAT gateway. Everything else in this series lives inside that network. Now we drop the cluster into it.

Part 2 does three things. It provisions the EKS control plane — the API server and etcd that Kubernetes actually is — as a second CDK stack that consumes the VPC from Part 1. It wires up the OIDC provider that makes IRSA possible. And it creates the IAM roles that worker nodes and pods use to authenticate to AWS services without static credentials anywhere. By the end you will have a kubectl connection to an empty-but-real cluster, ready for the node groups we add in Part 3.

What IAM Looks Like on EKS

Before writing any CDK, it is worth understanding the two layers of identity EKS deals with — they are different problems and CDK solves them differently.

Node IAM role. Every EC2 instance that joins the cluster needs an IAM role so it can call AWS APIs: register itself with EKS, pull container images from ECR, write logs to CloudWatch. This role is attached to the EC2 instance (via an instance profile) and is shared by every pod running on that node. It is a floor, not a ceiling — it should have only what the node itself needs, not the broader permissions individual pods require. We will create this role now and attach it to node groups in Part 3.

Pod IAM roles (IRSA). Individual pods — the load balancer controller, the vLLM model server, the hybrid router — often need their own AWS permissions distinct from each other and from the node. IRSA (IAM Roles for Service Accounts) solves this by linking a Kubernetes service account to an IAM role via OIDC federation: the pod presents a Kubernetes-issued JWT, the IAM token service validates it against the cluster's OIDC provider, and issues temporary credentials scoped to that specific IAM role. No credentials in Secrets, no overprivileged node roles, no credential rotation — the node role stays minimal.

IRSA is not optional for a well-run cluster. We set up the OIDC provider now because every add-on we install in Parts 4–7 uses it.

The Cluster Stack

Here is the complete CDK stack for the control plane. We will walk through each section below.

// lib/cluster-stack.ts
import * as cdk from "aws-cdk-lib";
import { Construct } from "constructs";
import * as ec2 from "aws-cdk-lib/aws-ec2";
import * as eks from "aws-cdk-lib/aws-eks";
import * as iam from "aws-cdk-lib/aws-iam";
import { config } from "./config";

interface ClusterStackProps extends cdk.StackProps {
  vpc: ec2.Vpc;
}

export class ClusterStack extends cdk.Stack {
  /** OIDC provider, consumed when creating IRSA roles in later stacks. */
  public readonly oidcProvider: iam.OpenIdConnectProvider;
  /** Node IAM role, attached to node groups in Part 3. */
  public readonly nodeRole: iam.Role;
  /** The EKS cluster itself. */
  public readonly cluster: eks.Cluster;

  constructor(scope: Construct, id: string, props: ClusterStackProps) {
    super(scope, id, props);

    const clusterRole = this.createClusterRole();
    this.nodeRole = this.createNodeRole();
    this.cluster = this.createCluster(props.vpc, clusterRole);
    this.oidcProvider = this.createOidcProvider();

    new cdk.CfnOutput(this, "ClusterName", { value: this.cluster.clusterName });
    new cdk.CfnOutput(this, "OidcIssuer", {
      value: this.cluster.clusterOpenIdConnectIssuerUrl,
    });
  }

  private createClusterRole(): iam.Role {
    return new iam.Role(this, "ClusterRole", {
      assumedBy: new iam.ServicePrincipal("eks.amazonaws.com"),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName("AmazonEKSClusterPolicy"),
      ],
    });
  }

  private createNodeRole(): iam.Role {
    const role = new iam.Role(this, "NodeRole", {
      assumedBy: new iam.ServicePrincipal("ec2.amazonaws.com"),
      managedPolicies: [
        // Lets the node join the cluster and receive scheduling decisions.
        iam.ManagedPolicy.fromAwsManagedPolicyName("AmazonEKSWorkerNodePolicy"),
        // Required for the VPC CNI to assign pod IPs and manage ENIs.
        iam.ManagedPolicy.fromAwsManagedPolicyName("AmazonEKS_CNI_Policy"),
        // Read-only ECR access so nodes can pull container images.
        iam.ManagedPolicy.fromAwsManagedPolicyName(
          "AmazonEC2ContainerRegistryReadOnly"
        ),
        // SSM agent — enables shell access without opening SSH/port 22.
        iam.ManagedPolicy.fromAwsManagedPolicyName(
          "AmazonSSMManagedInstanceCore"
        ),
      ],
    });

    // CloudWatch Container Insights and Fluent Bit log shipping.
    role.addManagedPolicy(
      iam.ManagedPolicy.fromAwsManagedPolicyName("CloudWatchAgentServerPolicy")
    );

    return role;
  }

  private createCluster(vpc: ec2.Vpc, role: iam.Role): eks.Cluster {
    return new eks.Cluster(this, "Cluster", {
      clusterName: config.clusterName,
      version: eks.KubernetesVersion.V1_32,
      role,
      vpc,
      vpcSubnets: [{ subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS }],
      endpointAccess: eks.EndpointAccess.PRIVATE,
      defaultCapacity: 0, // we define node groups explicitly in Part 3
      // Envelope encryption for etcd secrets using a KMS key CDK creates.
      secretsEncryptionKey: this.createSecretsKey(),
      clusterLogging: [
        eks.ClusterLoggingTypes.API,
        eks.ClusterLoggingTypes.AUDIT,
        eks.ClusterLoggingTypes.AUTHENTICATOR,
        eks.ClusterLoggingTypes.CONTROLLER_MANAGER,
        eks.ClusterLoggingTypes.SCHEDULER,
      ],
    });
  }

  private createSecretsKey(): cdk.aws_kms.Key {
    return new cdk.aws_kms.Key(this, "SecretsKey", {
      description: `EKS etcd envelope encryption — ${config.clusterName}`,
      enableKeyRotation: true,
      removalPolicy: cdk.RemovalPolicy.RETAIN,
    });
  }

  private createOidcProvider(): iam.OpenIdConnectProvider {
    return new iam.OpenIdConnectProvider(this, "OidcProvider", {
      url: this.cluster.clusterOpenIdConnectIssuerUrl,
      clientIds: ["sts.amazonaws.com"],
    });
  }
}

Update bin/app.ts to instantiate the cluster stack and pass it the VPC:

// bin/app.ts
import * as cdk from "aws-cdk-lib";
import { NetworkStack } from "../lib/network-stack";
import { ClusterStack } from "../lib/cluster-stack";
import { config } from "../lib/config";

const app = new cdk.App();
const env = { region: config.region };

const network = new NetworkStack(app, "HybridLlmNetwork", { env });

new ClusterStack(app, "HybridLlmCluster", {
  env,
  vpc: network.vpc,
});

Walking Through the Decisions

The cluster role is not the node role

Two IAM roles are in play here and they are entirely distinct. The cluster role (eks.amazonaws.com principal) is what the EKS service itself assumes to manage the control plane on your behalf — provisioning ENIs, managing security groups, updating the cluster endpoint. It only ever needs AmazonEKSClusterPolicy.

The node role (ec2.amazonaws.com principal) is what EC2 instances assume when they join the cluster. It needs the three core EKS node policies — worker node policy (cluster join), CNI policy (pod IP management), and ECR read (image pulls) — plus SSM if you want shell access, and CloudWatch if you want Container Insights. That is already four managed policies; the point is to keep it to exactly what the node infrastructure needs, trusting IRSA to give pods any narrower, workload-specific permissions.

Private endpoint only

endpointAccess: eks.EndpointAccess.PRIVATE means the Kubernetes API server is reachable only inside the VPC — not from the public internet. This is the correct setting for a production cluster. The alternative (PUBLIC_AND_PRIVATE, which is the EKS default) exposes the API endpoint publicly with IAM as the gatekeeper. That is convenient and not inherently broken, but it adds unnecessary surface area: anything that can reach the endpoint can attempt authentication, and a misconfiguration in IAM or aws-auth can have a much larger blast radius.

With a private endpoint you need network connectivity to kubectl — either run it from inside the VPC (a bastion, a developer VPN, or AWS CloudShell in the account), or use the AWS Cloud9 or Systems Manager Session Manager approach. We will use SSM access via the node role we just defined, which is exactly why AmazonSSMManagedInstanceCore is there.

If you are running this tutorial from a laptop without a VPN into the VPC, temporarily switch to PUBLIC_AND_PRIVATE and lock it down to your IP with publicAccessCidrs. Change it back before any production use.

defaultCapacity: 0

The CDK EKS construct will, by default, create a managed node group with some EC2 instances just to give you something to run workloads on. We do not want that. We are going to define our own node groups in Part 3 — a CPU system pool and a GPU pool with separate taints, instance types, and scaling policies. Setting defaultCapacity: 0 tells CDK to skip the default group so we start with a control plane only.

Envelope encryption for Secrets

The secretsEncryptionKey option enables KMS envelope encryption for Kubernetes Secrets stored in etcd. Without it, Secrets are base64-encoded in etcd — which is not encryption, just encoding. With it, the etcd data is encrypted at rest using a KMS key that CDK creates and manages for you. We set enableKeyRotation: true and removalPolicy: RETAIN so the key survives a cdk destroy and secrets are not permanently unreadable if the stack is torn down accidentally.

This is a security baseline, not a silver bullet — RBAC and IRSA are still the primary controls. But it is a checkbox that compliance frameworks always ask about and it costs almost nothing.

Control plane logging

We enable all five log types. They ship to CloudWatch Logs under /aws/eks/<cluster-name>/cluster. This is the diagnostic record for everything that happens at the control plane: API calls (api), security events (audit), authentication (authenticator), reconciliation loops (controller_manager), and scheduling decisions (scheduler). Audit logs in particular are invaluable for understanding what called what when something goes wrong, and they feed the observability stack in Part 7. The cost is low relative to the diagnostic value.

The OIDC provider

createOidcProvider creates an IAM OIDC provider that trusts the cluster's built-in OIDC issuer. This is the link that makes IRSA work. When a pod in the cluster presents a Kubernetes-signed JWT to the IAM token endpoint, IAM validates the token against this provider before issuing temporary credentials.

We expose oidcProvider as a public property so later stacks can use it when constructing IRSA trust policies. The pattern in Part 4, for example, will look like:

const lbcRole = new iam.Role(stack, "LoadBalancerControllerRole", {
  assumedBy: new iam.WebIdentityPrincipal(
    clusterStack.oidcProvider.openIdConnectProviderArn,
    {
      StringEquals: {
        [`${issuerHostPath}:sub`]:
          "system:serviceaccount:kube-system:aws-load-balancer-controller",
        [`${issuerHostPath}:aud`]: "sts.amazonaws.com",
      },
    }
  ),
});

The StringEquals condition is what gives IRSA its precision: only the specific service account in the specific namespace can assume this role — no other pod on the cluster, regardless of what node it runs on.

Deploy the Cluster

The cluster stack depends on the network stack, so deploy in order:

# If you already deployed Part 1, skip this.
cdk deploy HybridLlmNetwork

cdk deploy HybridLlmCluster

The control plane takes roughly 12–15 minutes — EKS is standing up managed etcd, an API server, and the VPC integrations. This is the one step in the series where you genuinely wait. Use the time to read the Part 3 outline.

Getting kubectl Access

Once the deploy finishes, update your local kubeconfig:

aws eks update-kubeconfig \
  --name hybrid-llm \
  --region us-east-1

If you set endpointAccess: PRIVATE, you need to run kubectl from within the VPC. The easiest path without setting up a VPN is AWS CloudShell — it runs in your account's VPC context and has kubectl and aws pre-installed:

# From CloudShell, or any host with VPC access:
kubectl cluster-info
kubectl get nodes  # Expect: No resources found — no node groups yet.

If cluster-info returns the API server address and get nodes returns an empty list (not an error), the control plane is healthy. An empty node list is correct — we have no node groups until Part 3.

What CDK Did to aws-auth

When CDK's EKS construct creates a cluster, it automatically adds your CDK deployment role (the CloudFormation execution role) to the aws-auth ConfigMap as a system:masters entry. This is how CDK itself calls the Kubernetes API during deployment — to install Helm charts, apply manifests, and run kubectl commands as part of the CDK deploy process.

Your personal AWS IAM identity is not automatically added. To use kubectl from your own credentials, you need to add yourself to aws-auth:

// In ClusterStack, after cluster creation:
this.cluster.awsAuth.addMastersRole(
  iam.Role.fromRoleArn(
    this,
    "AdminRole",
    `arn:aws:iam::${this.account}:role/YourAdminRole`
  )
);

Or for individual user ARNs:

this.cluster.awsAuth.addUserMapping(
  iam.User.fromUserArn(this, "Admin", `arn:aws:iam::${this.account}:user/your-user`),
  { groups: ["system:masters"] }
);

In a real environment you would integrate with AWS IAM Identity Center (SSO) and use short-lived role credentials rather than long-lived user credentials. For the tutorial, adding your role or user as system:masters is sufficient.

Verify the Control Plane Health

Before moving on, confirm the basics:

# API server should be reachable and return version info.
kubectl version --short

# Control plane components should all be Healthy.
kubectl get componentstatuses

# The aws-auth ConfigMap should exist in kube-system.
kubectl -n kube-system get configmap aws-auth -o yaml

The componentstatuses check is a quick sanity test for etcd and the controller manager. If any component reports Unhealthy, check the CloudWatch control plane logs — the api and controller_manager streams are where EKS surfaces internal issues.

Tearing Down

Destroy in reverse order:

cdk destroy HybridLlmCluster
cdk destroy HybridLlmNetwork

The KMS key has removalPolicy: RETAIN, so it survives the destroy and will need manual deletion from the KMS console if you want a complete cleanup. That is intentional — it prevents accidentally making secrets unrecoverable.

What's Next

You now have a working EKS control plane: a private API server, full control plane logging, KMS-encrypted Secrets, an OIDC provider for IRSA, and a node role ready to attach to node groups. The cluster is real and kubectl-reachable, but it has no nodes and cannot schedule workloads yet.

In Part 3 we add the nodes: a CPU system pool for add-ons and the router, and a GPU pool for model servers. We will cover GPU-specific AMIs, the NVIDIA device plugin DaemonSet, node taints and labels for scheduling affinity, and the instance type choices that balance GPU memory, on-demand cost, and Spot availability.

The full source for Parts 1 and 2 is in the companion repository (link to follow). Each part is a standalone CDK app that you can deploy and destroy independently.