Skip to main content

Why I Run Talos Linux: A Minimal OS Built for Kubernetes

I wanted an immutable operating system for my Kubernetes nodes — one I could rebuild from scratch with a single command and no manual steps. I looked at Flatcar (the successor to CoreOS Container Linux), the minimal images from Fedora and Ubuntu, and Talos Linux. Talos won. It has an API that lets me automate the entire node lifecycle, and combined with a PXE boot server, I can wipe and re-bootstrap a three-node cluster in a few minutes.

That same Talos node has no SSH daemon. No port 22. No shell either. No apt, no yum, no /etc to edit, no systemd units to discover, no package upgrades to schedule. The only way to talk to a node is the API. It feels strict for a day, then quiet, then obvious.

This post is what I learned along the way. I provide configuration files so that you can actually use these to get your own Talos Linux cluster up and running.

Why the OS matters #

The operating system underneath your Kubernetes cluster still needs care. Kernel patches, security advisories, sysctl tweaks, slow drift between nodes — that’s all real work, and it doesn’t disappear because the cluster on top is healthy. Running k3s on Ubuntu isn’t wrong. But you’re maintaining a full general-purpose OS underneath: hundreds of packages you didn’t install, services you don’t use, and a shell that invites a quick fix on a single node that nobody remembers a month later.

A distribution built specifically for Kubernetes removes most of that work. There’s less code on the box, a smaller attack surface to patch, and far fewer moving parts that can drift between nodes. The OS stops being a thing you maintain and starts being a thing you configure once and replace when you want a new version.

Talos isn’t unique in that idea. Flatcar and the immutable variants of Fedora and Ubuntu point in the same direction. What separates Talos is the API and the model around it.

Talos is a Kubernetes operating system #

Talos is purpose-built to run Kubernetes and nothing else. The whole OS is the minimum a Kubernetes node needs.

What you don’t get:

  • A shell. No bash, sh, or busybox to drop into.
  • SSH. Nothing on port 22.
  • A package manager. You can’t apt install anything.
  • systemd, cron, init.d scripts, or a writable /etc.
  • A way to make a quick change on one node by hand.

What you do get:

  • A Linux kernel.
  • containerd and kubelet, configured and supervised.
  • machined (the supervisor) and apid (the gRPC API).
  • One declarative machine configuration per node.

You configure a node by sending it a YAML machine config. machined applies it. Hostname, network interfaces, install disk, kubelet flags, kernel parameters, registry mirrors, time servers — all in that one file. The root filesystem is read-only at runtime. Upgrades swap to the other partition, A/B style. If the new image fails to boot, the node falls back automatically. There’s no in-place package upgrade that can leave you halfway between versions.

The day-to-day habits all change. You don’t tail -f a log on a node, you ask the API. You don’t edit a config file, you re-apply the machine config. You don’t drop into a host shell, because there isn’t one. After the first week, you stop reaching for any of the old habits.

Talos runs on any bare metal hardware with the right set of system extensions. The same OS runs on a datacenter rack and on three mini-PCs in my office. Same kernel, same API, same upgrade model. It’s not a scaled-down distribution — it’s the same distribution, configured for different hardware.

System extensions: opt-in capabilities #

The core stays small because anything beyond “run Kubernetes” is added through system extensions baked into the boot image at build time.

A non-exhaustive list of what’s available:

  • iSCSI tooling for network-attached storage.
  • DRBD kernel modules for replicated block storage.
  • NVIDIA and Intel GPU drivers.
  • Intel and AMD CPU microcode.
  • qemu-guest-agent for VM deployments.
  • ZFS modules.
  • Tailscale.

You don’t apt install any of this. You declare which extensions you want, and Talos’s Image Factory builds a custom image with those extensions baked in. The image is immutable, exactly like the kernel.

For a homelab, the hosted Image Factory is the easiest path. If you’d rather not depend on a hosted service, the Image Factory is open source under MPL 2.0 and you can run your own from the siderolabs/image-factory repository. The schematic and image URLs work the same way against your own instance.

That’s the model. Small core, declarative extension list, custom image. “How do I install X on the node?” stops being a question. The answer is always: add X to the schematic, rebuild the image, upgrade the node.

The schematic is a small YAML file that tells the Image Factory which extensions to include:

customization:
  systemExtensions:
    officialExtensions:
      - siderolabs/iscsi-tools
      - siderolabs/util-linux-tools
      - siderolabs/intel-ucode
      - siderolabs/i915-ucode
      - siderolabs/qemu-guest-agent

POST it to the Image Factory and you get back a schematic ID, a hex string that uniquely identifies that extension set:

curl -X POST --data-binary @schematic.yaml https://factory.talos.systems/schematics
# {"id":"376567988e75d96c6f9f96e6c07f83b80beae62e5c1b6c4b3b9e8b22ad9f1234"}

That ID goes into two places: the boot image URL, and the installer image reference inside your machine config (so talosctl upgrade keeps the same extensions). For the rest of this post I’ll use abc123 as the schematic ID.

My setup #

My cluster is three mini-PCs on a shelf: 6-core CPU, 500 GB NVMe, a single 1 Gb NIC, 16 GB of RAM each. They’re identical, and all three run the control plane and schedule workloads. There’s no special node, so if one dies the other two keep going and the cluster carries on.

Talos doesn’t require this hardware. It runs on Raspberry Pis, on bare metal in a datacenter, on cloud VMs, on Proxmox guests. Use what you have.

One thing to think about before you buy: make sure the memory is upgradeable. I went with mini-PCs that have 16 GB soldered to the board, which means I can’t add more. 16 GB per node is enough to run a three-node cluster with a foundation layer (CNI, storage, ingress, observability) plus workloads that actually do useful things — but only just. If you’re planning to run memory-heavy workloads, pick hardware that takes 32 GB or 64 GB from the start, so you have headroom when you need it. I have to be careful with what I deploy; with a SODIMM slot, you wouldn’t have to.

Build a custom Talos image #

With a schematic ID in hand, you have a few options to get Talos onto a node.

For most homelabs the simplest path is the ISO. Download it from the Image Factory, write it to a USB stick, boot the node. It comes up in maintenance mode, waiting for a machine config:

https://factory.talos.systems/image/abc123/v1.12.7/metal-amd64.iso

What I run is PXE. An old mini-PC in the corner runs k3s with Matchbox, which serves Talos boot images to my nodes over the network. When a node boots, it pulls the right image from Matchbox automatically. Together with talosctl I can wipe and re-bootstrap the entire cluster with a single command and a few minutes of waiting.

You don’t need a separate machine for PXE. A laptop works, as long as it can proxy DHCP requests to your router. For a one-time install you’ll never repeat, the USB-stick route is easier.

PXE itself runs in two modes. Either Talos boots from the network every time and never touches the disk, or it boots from the network once, writes Talos to disk, and subsequent boots come from disk. I do the latter. If my Matchbox server is offline for any reason, my cluster nodes still come up on their own.

For the installer image referenced inside the machine config (used by talosctl upgrade), the URL pattern is:

factory.talos.systems/installer/abc123:v1.12.7

Generate and apply machine configs #

Talos’s control plane uses a virtual IP that floats to whichever control-plane node is healthy. That VIP is what kubectl and talosctl talk to once the cluster is running. The mechanism is built on Talos’s embedded etcd: a node only takes over the VIP after etcd has elected it.

That matters during bootstrap. Until etcd is initialized, the VIP isn’t bound to anything. Your nodes start in maintenance mode with no etcd, no VIP. So when you apply configs and bootstrap, you have to target each node by its own IP. After the cluster is up, the VIP works for everything.

For the VIP itself, pick an unused IP on your LAN. I use 192.168.1.10. That address goes into the machine config patches a little further down — Talos handles the failover from there.

For generating the configs, you can’t use the VIP yet, because etcd hasn’t been bootstrapped and the VIP isn’t bound to anything. Pick one of your nodes and use its address as the cluster endpoint instead. I use 192.168.1.11, the IP of my first control-plane node:

talosctl gen config homelab https://192.168.1.11:6443 --output-dir _out

That writes three files into _out/:

  • controlplane.yaml — base config for control-plane nodes.
  • worker.yaml — base config for worker-only nodes.
  • talosconfig — client credentials, with the cluster endpoint set to the node IP you just used.

The base configs are usable as-is for a default cluster, but every real setup needs a few overrides: install disk, network interface name, static IP per node, the VIP, and the installer image (so upgrades pull your custom one with the extensions). I keep those in a small patch file per node.

The patch for my first control-plane node, cp-01.patch.yaml:

machine:
  install:
    disk: /dev/nvme0n1
    image: factory.talos.systems/installer/abc123:v1.12.7
    wipe: false
  network:
    hostname: cp-01
    interfaces:
      - interface: enp1s0
        dhcp: false
        addresses:
          - 192.168.1.11/24
        routes:
          - network: 0.0.0.0/0
            gateway: 192.168.1.1
        vip:
          ip: 192.168.1.10
    nameservers:
      - 192.168.1.1
      - 1.1.1.1
  time:
    servers:
      - pool.ntp.org
cluster:
  allowSchedulingOnControlPlanes: true

vip.ip is the bit doing the work. Every control-plane node gets the same patch with the same VIP, and Talos handles the failover internally — no extra software, no keepalived. allowSchedulingOnControlPlanes: true is what makes a three-node cluster useful as a homelab; without it, workloads wouldn’t schedule on these nodes.

For the second and third nodes, copy this file and change two lines: hostname (cp-02, cp-03) and addresses (192.168.1.12/24, 192.168.1.13/24). The VIP stays the same on all three.

If you also want dedicated workers, the patch is a strict subset (no VIP, no allowSchedulingOnControlPlanes):

machine:
  install:
    disk: /dev/nvme0n1
    image: factory.talos.systems/installer/abc123:v1.12.7
  network:
    hostname: worker-01
    interfaces:
      - interface: enp1s0
        dhcp: false
        addresses:
          - 192.168.1.21/24
        routes:
          - network: 0.0.0.0/0
            gateway: 192.168.1.1
    nameservers:
      - 192.168.1.1

Apply each patch onto the base config and send it to the right node by IP:

talosctl machineconfig patch _out/controlplane.yaml \
  --patch @cp-01.patch.yaml \
  --output cp-01.yaml

talosctl apply-config --insecure -n 192.168.1.11 -f cp-01.yaml
talosctl apply-config --insecure -n 192.168.1.12 -f cp-02.yaml
talosctl apply-config --insecure -n 192.168.1.13 -f cp-03.yaml

--insecure is fine here: the nodes are in maintenance mode, they don’t have certificates yet, and the config you’re sending is what sets up the trust roots for everything after.

The nodes then reboot, install Talos to disk, and try to form a cluster. etcd needs one node to initialize — pick any of the three and bootstrap it. Because the VIP isn’t bound yet, target a node IP directly with -e:

export TALOSCONFIG=_out/talosconfig
talosctl bootstrap -n 192.168.1.11 -e 192.168.1.11

A note on talosctl flags. You can persist endpoints and nodes in the talosconfig with talosctl config endpoint and talosctl config node, but I prefer to pass them explicitly with -e / --endpoints and -n / --nodes. One less piece of hidden state. After bootstrap, the talosconfig already has a working endpoint (the node IP you used during gen config), so you can drop -e from most commands and just pass -n for the node you want to target.

Give etcd a minute, then grab the kubeconfig:

talosctl kubeconfig . -n 192.168.1.11
export KUBECONFIG=$PWD/kubeconfig
kubectl get nodes

If everything’s healthy:

NAME    STATUS   ROLES           AGE   VERSION
cp-01   Ready    control-plane   2m    v1.35.4
cp-02   Ready    control-plane   2m    v1.35.4
cp-03   Ready    control-plane   2m    v1.35.4

That’s a working three-node Talos cluster. From here on, you’re in normal Kubernetes territory.

Day-2 is just the API #

Once the cluster is up, the two operations that show off the model best are upgrades and resets.

An upgrade is one command per node:

talosctl upgrade -n 192.168.1.11 \
  --image factory.talos.systems/installer/abc123:v1.13.0

Talos pulls the new installer image, writes it to the inactive partition, reboots, and the node comes up on the new version. If the new image fails to boot, it falls back to the previous partition automatically. Repeat for the other nodes. There’s no apt upgrade, no half-upgraded state, no postinst scripts to wonder about.

A reset is also one command:

talosctl reset -n 192.168.1.11 --reboot --graceful=false

That wipes the node back to maintenance mode, ready for a fresh config. When I’m rebuilding the cluster and want to be certain nothing from yesterday is leaking into today, that’s the command.

The point isn’t that these commands are magic. It’s that they’re symmetric with everything else: the OS is an API, and day-2 operations are just more API calls.

The honest pain point: templating #

I’ve made Talos sound great, and I think it is. The part I haven’t solved well is configuration.

Machine configs grow. The patch I showed earlier is small. Real ones aren’t. Add registry mirrors, CNI overrides, kubelet extra args, kernel parameters for specific workloads, time-server overrides, KubeSpan settings, and per-environment differences, and the patches stop being small. You end up with a matrix: environments × node roles × hardware variants × extensions. A few of each and you’re managing dozens of subtly different configs that share most of their content.

Talos’s own patch system helps. You can layer multiple patches, and there’s a JSON Patch flavour for surgical edits. But it’s still YAML on YAML, and the moment you want one value derived from another — “the VIP is the first IP in the LAN range” — you’re back to wishing for variables.

The tool I’m currently trying is talm. The pitch is roughly “Helm for Talos machine configs”: templates, values files, render, apply. It feels like the right shape. I’m not ready to call it a confident endorsement — I’m still working out where its conventions clash with mine — but it’s the closest thing I’ve used to “this is how machine configs should be managed at scale”.

Alternatives I’ve tried or considered: raw talosctl patches with shell scripts (fine for two environments, painful at five), Jsonnet (powerful, steep learning curve for a one-person homelab), Kustomize-on-YAML (technically possible, fights you because Talos configs aren’t Kubernetes resources). None of them feel right.

If you’re starting with Talos, don’t optimise for templating on day one. Plain patches per node, accept the duplication, and get a feel for which parts actually vary. Then go shopping for a tool. That’s where I am, and talm is the current candidate.

What this unlocks #

When the OS underneath is declarative, immutable, and boring, everything above it gets easier. You stop worrying about node drift. You stop writing one-off scripts to fix a node. You stop blaming the OS when something breaks in the cluster, because the OS is barely there to blame.

What you get back is attention. The platform layer — CNI, storage, ingress, GitOps controller, identity provider — moves into focus, because it’s no longer competing with the OS for your time. The cluster starts to feel like an API: declare what you want, send it, watch it converge.

That’s where the interesting work is. The OS isn’t the platform. The platform sits on top of it. But the platform is a lot more enjoyable to build when the layer underneath isn’t fighting you.

Should you run Talos Linux? #

If you’re running Kubernetes anywhere — homelab, edge, production, side project — Talos is worth a serious look. The model takes a few days to settle in. Once it does, you won’t want to go back to a general-purpose OS for cluster nodes.

A few practical takeaways:

  • Make sure your nodes are upgradeable. Don’t be me — I bought mini-PCs with soldered memory and I’ve been working around that ever since.
  • Talos ships with a default CNI. Get familiar with Talos first, then swap it. I run Cilium and recommend it: feature-rich, secure, and the observability is genuinely useful when you’re operating the cluster yourself.
  • Use talosctl dashboard. It’s an in-CLI dashboard with node health, logs, and service status at a glance. Good for a quick check before diving into individual logs, and a nice way to learn what’s happening when you’re new to Talos and not yet a CLI wizard.

I haven’t logged into a node, patched a node, or wondered about the state of a node for months. That’s the bit that sold me, and it’s the bit that’s hardest to convey until you’ve felt it. Try it on three boxes for a weekend. You’ll know within an hour whether it’s for you.