Networking on Talos Linux with Cilium

Table of Contents

You’ve got Talos running and a cluster that comes up clean. But a cluster without real networking isn’t much. Talos ships with Flannel, a basic CNI that gets the job done. Cilium is the step up once you want observability and real control over traffic. Instead of IP-range rules that drift as pods restart, you write identity-based rules: allow role=frontend to reach role=api, wherever they run. eBPF enforcement gives you that control, plus packet-level observability of which pods are talking and what’s being dropped, without sidecars.

I make the assumption that you’re starting fresh. If you already have Talos running with workloads, swapping the CNI means updating your machine config and reapplying it to each node, but you can do that without resetting. The caveats come later. For now, assume you’re building from scratch.

Why Cilium, not Flannel #

Flannel works. It hands out pod IPs, sets up an overlay, and gets out of the way. That’s what Talos ships it for. But once you have real workloads, the list of things Flannel deliberately doesn’t do catches up. Five reasons made Cilium the obvious choice for me.

First, Flannel has no policy engine at all. If you want to control which pods can talk to which, the usual fix is to bolt Calico on alongside it for enforcement. Cilium ships with policy as a first-class feature, and that alone is enough to make the switch.

Second, the policy model is identity-based and goes up to layer 7. Cilium gives every pod a numeric identity based on its labels, and writes rules against those identities instead of IPs. Pod IPs change every time a deployment restarts, but identities stay stable. And because eBPF can read packet contents, rules can go further than ports. You can say things like “service A can GET /products but not POST /products”, or “service A can write to Kafka topic X only”. Calico and Flannel stop at L3 and L4. To get the same L7 control with them, you need an Istio or Linkerd sidecar in every pod.

Third, Gateway API and service mesh capabilities are part of Cilium. No separate controller, no extra CRDs from a different vendor, no second project to operate.

Fourth, observability is built in. Hubble watches every flow as it passes through the eBPF programs that are already running, and shows you who is talking to whom, HTTP verbs, DNS queries, dropped packets, and latency. Without sidecars or extra code in your apps.

Fifth, the wider ecosystem is moving in this direction. I used Calico before. Cilium has graduated from the CNCF and is now a default or supported option on the major managed-Kubernetes platforms.

Underneath all of it is eBPF. Cilium runs its forwarding, policy, and observability as small programs that hook into the Linux kernel directly. They live inside a sandbox the kernel verifies, so forwarding decisions, policy checks, and flow telemetry all happen in one pass through the kernel. No iptables chains, no extra controller pods, no sidecars. It’s also what lets Cilium replace kube-proxy entirely. A Service IP packet hits an eBPF program that does an O(1) hash-map lookup to pick a backend, instead of walking an iptables chain that grows with the cluster.

Any one of these would be useful on its own. What makes Cilium compelling at scale isn’t any single feature. It’s that eBPF lets them all live in one stack instead of needing a separate tool for each. Pod networking, service routing, policy, and observability belong in the same place, and eBPF is what makes putting them there actually work.

Cilium on Talos: the setup #

With the case for Cilium out of the way, the setup on Talos is mostly about turning off the defaults so Cilium can take over. That means disabling the bundled Flannel and kube-proxy, then pointing Cilium at the Talos-specific bits.

Installing Cilium #

Talos doesn’t need a system extension for Cilium, but it does need two machine config changes before Cilium can take over. Disable the bundled Flannel CNI and disable the default kube-proxy so Cilium can replace it:

cluster:
  network:
    cni:
      name: none
  proxy:
    disabled: true

kube-proxy is the component that programs iptables rules so Service ClusterIPs and NodePorts route to the right pods. Cilium can do the same job in eBPF. It intercepts service traffic in the kernel without writing iptables rules, which is faster and gives you Service-level visibility in Hubble. That’s kubeProxyReplacement, and it only works if the default kube-proxy is out of the way.

With kube-proxy gone, Cilium needs another way to reach the Kubernetes API. Talos exposes KubePrism on localhost:7445, a local load balancer that fronts the control plane endpoints. Cilium uses that as its API endpoint.

Apply the machine config changes, then install Cilium with a values file:

# cilium-values.yaml
ipam:
  mode: kubernetes
kubeProxyReplacement: true
k8sServiceHost: localhost
k8sServicePort: 7445
cgroup:
  autoMount:
    enabled: false
  hostRoot: /sys/fs/cgroup
securityContext:
  capabilities:
    ciliumAgent:
      - CHOWN
      - KILL
      - NET_ADMIN
      - NET_RAW
      - IPC_LOCK
      - SYS_ADMIN
      - SYS_RESOURCE
      - DAC_OVERRIDE
      - FOWNER
      - SETGID
      - SETUID
    cleanCiliumState:
      - NET_ADMIN
      - SYS_ADMIN
      - SYS_RESOURCE
gatewayApi:
  enabled: true
hubble:
  enabled: true
  relay:
    enabled: true
  ui:
    enabled: true

The cgroup and securityContext blocks are Talos-specific. Talos doesn’t auto-mount cgroup v2 the way a typical distro does, so Cilium has to be told where to find it. The capability lists are narrower than Cilium’s defaults because Talos runs with a tighter security profile and rejects pods that ask for more than they need.

helm repo add cilium https://helm.cilium.io
helm repo update

helm install cilium cilium/cilium --namespace kube-system -f cilium-values.yaml

Before doing anything else, check that Cilium came up cleanly. The cilium CLI is a separate binary that talks to the agents and reports their state:

cilium status --wait

That waits for every node’s agent and the operator to report ready, then prints a summary of what’s running. For a deeper check, run cilium connectivity test. It deploys a small set of pods and runs pod-to-pod, pod-to-service, and DNS tests across the cluster. It takes a few minutes but it’s the surest way to confirm Cilium is doing what it should before you put real workloads on top.

Once that passes, networking works. You can now deploy workloads.

Network policy: deny-all and then allow #

Once Cilium is running, the default behavior matters: no policy exists, so all traffic is allowed. Cilium only starts enforcing restrictions once you create a policy. By default it doesn’t lock anything down.

A deny-all policy is your starting point:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: default-deny
  namespace: production
spec:
  endpointSelector: {}
  ingress: []
  egress: []

That blocks everything in the namespace, both inbound and outbound. From there, you explicitly allow what’s needed.

Allow traffic from frontends to APIs:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-frontend-to-api
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      role: api
  ingress:
  - from:
    - podSelector:
        matchLabels:
          role: frontend

Allow your APIs to talk to an external service:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: allow-api-to-upstream
  namespace: production
spec:
  endpointSelector:
    matchLabels:
      role: api
  egress:
  - to:
    - namespaceSelector: {}
      podSelector:
        matchLabels:
          role: upstream-service

DNS is the one exception every namespace needs. Workloads resolve names through CoreDNS in kube-system, so use a CiliumClusterwideNetworkPolicy to allow that traffic cluster-wide:

apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: allow-dns
spec:
  endpointSelector: {}
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kube-system
      podSelector:
        matchLabels:
          k8s-app: kube-dns
    ports:
    - protocol: UDP
      port: "53"
    - protocol: TCP
      port: "53"

Both UDP and TCP need to be allowed. UDP handles the common case, but DNS falls back to TCP whenever a response is too large for a single UDP packet. Think DNSSEC, long TXT records, or large answer sets. A UDP-only policy works until it doesn’t, and the failure mode is intermittent name resolution that’s annoying to debug.

Without this, your pods can’t resolve anything. Easy to miss when you’re thinking about application-layer policy.

Hubble: observability without the overhead #

Hubble is Cilium’s observability layer, watching every packet interaction at microsecond precision. The kicker is that it’s virtually zero-cost. It runs in the kernel alongside eBPF, so there are no logging sidecars and no extra code in your apps.

Hubble can show you:

Which pods are talking to which.
Protocol-level information (HTTP methods, DNS queries).
Dropped packets and policy violations.
L7 metrics (latency, error rates) without touching application code.

For a homelab where you’re the sole operator, this is invaluable: you can see exactly what’s flowing through your cluster without any additional tooling.

The Helm values from the install enabled the Hubble server, the relay (which aggregates flows across nodes), and the UI. To get to the UI, port-forward it:

kubectl -n kube-system port-forward svc/hubble-ui 12000:80

Open http://localhost:12000 and pick a namespace to see the live flow graph. For a CLI view, install the hubble CLI and run hubble observe --namespace production to tail packets, or hubble observe --verdict DROPPED to see what your policies are blocking. That’s the fastest way to debug a misconfigured CiliumNetworkPolicy.

Cilium Gateway API for HTTP routing #

Cilium implements the Gateway API, a Kubernetes standard for L7 routing. It replaces Ingress with a more expressive model: a Gateway resource defines how traffic enters the cluster, and HTTPRoute resources define the actual routing rules.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: shared-gateway
  namespace: gateway-api-system
spec:
  gatewayClassName: cilium
  listeners:
  - name: http
    port: 80
    protocol: HTTP
    allowedRoutes:
      namespaces:
        from: All
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: api-route
  namespace: production
spec:
  parentRefs:
  - name: shared-gateway
    namespace: gateway-api-system
  hostnames:
  - api.example.com
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /v1
    backendRefs:
    - name: api-service
      port: 8080

Gateway API is a CNCF standard, not a Cilium-specific abstraction, so you can swap implementations later without rewriting your routes.

One caveat: Cilium’s Gateway API support is nearly complete, but TCP routes aren’t supported yet. If you need to route raw TCP traffic (databases, gRPC over TCP), you’ll need something else for now. Active work is ongoing. See the GitHub discussion at https://github.com/cilium/cilium/issues/21929.

L2 announcement for bare-metal load balancing #

If you don’t have a cloud load balancer, Cilium can announce service IPs on the local network with ARP. That used to mean running MetalLB or another bare-metal load balancer project alongside your CNI. Cilium does it directly, no extra component to install or operate. When a pod serving traffic crashes, Cilium moves the IP to a healthy pod and sends a gratuitous ARP update. The network switch immediately sends traffic to the new node.

Three pieces wire this up: the feature has to be enabled in Cilium, you need a pool of IPs that Cilium can hand out to Services of type LoadBalancer, and a policy that tells Cilium which IPs to actually announce on which interfaces.

First, enable the feature in your Cilium values and reapply:

l2announcements:
  enabled: true

Then create the IP pool. Pick a small range on your home network that nothing else uses. Here it’s 10.0.0.240/28, sixteen addresses Cilium can assign to LoadBalancer Services:

apiVersion: cilium.io/v2
kind: CiliumLoadBalancerIPPool
metadata:
  name: homelab-pool
spec:
  blocks:
    - cidr: "10.0.0.240/28"

Finally, the announcement policy. It tells Cilium which nodes should answer ARP for those IPs and over which interfaces:

apiVersion: cilium.io/v2alpha1
kind: CiliumL2AnnouncementPolicy
metadata:
  name: homelab-announce
spec:
  serviceSelector:
    matchLabels: {}
  interfaces:
    - eth0
    - enp1s0
  loadBalancerIPs: true

serviceSelector: matchLabels: {} matches every Service in the cluster. Narrow it down if you only want certain Services exposed. interfaces lists the physical interfaces Cilium should send the gratuitous ARP from. Match these to whatever your nodes use (ip link on a node will tell you). loadBalancerIPs: true tells the announcer to broadcast the IPs the pool hands out to Services of type LoadBalancer.

Two annotations on a Service give you finer control over how the pool assigns IPs. lbipam.cilium.io/ips asks for a specific IP instead of taking whatever comes next out of the pool. Useful when you want a Service to keep a fixed address that doesn’t shift as the pool fills up:

apiVersion: v1
kind: Service
metadata:
  name: api
  namespace: production
  annotations:
    lbipam.cilium.io/ips: "10.0.0.241"
spec:
  type: LoadBalancer
  selector:
    role: api
  ports:
    - port: 80
      targetPort: 8080

lbipam.cilium.io/sharing-key goes the other way. Services with the same key share a single IP from the pool, as long as their ports don’t overlap. That lets you put a few small Services behind one address instead of burning a pool IP per Service:

apiVersion: v1
kind: Service
metadata:
  name: api
  namespace: production
  annotations:
    lbipam.cilium.io/sharing-key: "shared-1"
spec:
  type: LoadBalancer
  selector:
    role: api
  ports:
    - port: 80
      targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: admin
  namespace: production
  annotations:
    lbipam.cilium.io/sharing-key: "shared-1"
spec:
  type: LoadBalancer
  selector:
    role: admin
  ports:
    - port: 8080
      targetPort: 8080

Both Services land on the same pool IP. Traffic to port 80 reaches api, traffic to port 8080 reaches admin. Sharing stays inside a namespace by default. Add lbipam.cilium.io/sharing-cross-namespace on both Services with a comma-separated list of namespaces (or * for all) when you need to share across namespaces.

For a bare-metal homelab, this is your path to expose services. BGP is the alternative for larger setups, but L2 is simpler and works well on a single network.

Wrapping up #

With Cilium in place you have networking, identity-based policy, observability through Hubble, HTTP routing through the Gateway API, and a way to expose Services on a home network without depending on a cloud load balancer. The cluster can route traffic, enforce policy, and let you actually see what’s happening on the wire.

What’s still missing is persistent storage. Pods can talk to each other, but they can’t write anything that survives a reschedule. That’s a story for a separate post.