Storage on Talos Linux with Linstor and DRBD

Table of Contents

With Talos under the cluster and Cilium handling the network, pods can talk to each other but they still can’t persist anything. That’s what this post fixes.

Every Kubernetes persistent volume needs backing storage. Local NVMe on each node is the fastest option, but if a node dies, the data dies with it. Network-attached storage adds redundancy but costs you latency. Replicated block storage splits the difference: data lives on the node where the workload is running, but gets replicated to other nodes so you can survive a failure.

Linstor with DRBD is the straightforward implementation of that last option. DRBD is a kernel-level block replication layer: reads come from local NVMe, writes replicate synchronously to other nodes. It’s operationally simpler than Ceph and faster than Longhorn, which does the same thing in userspace.

Understanding DRBD and Linstor #

DRBD is a Linux kernel module that mirrors a block device across the network. It’s been around since the late 1990s. Not flashy, but simple, reliable, and battle-tested. From the filesystem’s perspective, a DRBD device looks like a regular disk: writes go to both the local copy and the remote simultaneously, reads come from the local copy.

That read/write split is what makes the performance story interesting. Reads come from local NVMe at full speed. Writes have to cross the network to replicate, so they incur latency, but that’s a deliberate trade-off: your data survives a node failure because it lives on multiple nodes.

Linstor is the controller layer that automates DRBD. Rather than managing DRBD devices by hand on each node, you define volumes in Kubernetes and Linstor handles creating the DRBD device, configuring replication, and recovering from node failures.

Why Linstor over CephFS and Longhorn #

The hardware sets the terms: three nodes, 16 GB RAM each, one 500 GB NVMe per node shared with the OS. Within those constraints, Linstor isn’t just a reasonable choice. It’s the one that fully respects the resource envelope.

Ceph is a distributed storage system that provides block storage, object storage, and shared filesystems in a single self-healing cluster. It’s the natural comparison for replicated storage at scale, and on the right hardware it’s the right answer. On 16 GB nodes it isn’t. A viable Ceph deployment needs 6-8 GB of RAM per node. The default osd_memory_target is 4 GiB per OSD. Lower values are possible but performance degrades, and Ceph’s own documentation advises against going below 2 GB, so sticking with the default is the only sensible choice for production use. At 4 GiB, a single OSD already consumes a quarter of the available memory on a 16 GB node before the monitor, the OS, or any workload gets a byte. Add it all up and Ceph’s footprint runs to 40-50% of available RAM before the cluster does any real work. There’s also the shared-disk problem: Ceph’s bluestore requires dedicated block devices it can manage end-to-end, including checksum validation on every read. Sharing the OS NVMe with Ceph means either carving out a partition (losing the integrity properties that justify Ceph’s complexity in the first place) or accepting constant IOPS contention with the OS. CRUSH maps, healing traffic, and OSD daemons add real overhead on top of that.

Longhorn is Rancher’s block storage solution for Kubernetes, replicated across nodes and managed through the Kubernetes API. It’s lighter and operationally simpler than Ceph, and its replication model is easy to understand without knowing DRBD internals. Longhorn implements replication in userspace: engine and replica processes run per volume, scaling memory cost with volume count and landing in the 2-3 GB range per node at modest scale.

Linstor stays flat. The controller runs at around 512 MB; the per-node satellite at roughly 256 MB; replication happens in the kernel via DRBD with no userspace process per volume and no cache that grows with cluster size. Total userspace cost per node stays under 1 GB regardless of how many volumes are in play. It also sits cleanly on top of LVM on the existing NVMe, with no dedicated device required and no architectural mismatch with the hardware. On constrained hardware that’s not a tuning preference. It’s the difference between a cluster that runs comfortably and one that’s perpetually under memory pressure.

Linstor does have real operational concerns at the org level: a small expert pool, a kernel module dependency, and recovery procedures that require understanding DRBD internals. These are legitimate reasons larger teams choose Longhorn or Ceph for operational simplicity. In a homelab, those concerns weaken substantially. As long as the configuration is correct (quorum enabled, Piraeus HA controller running, failure behavior validated), the operational profile is fine for this scale.

Setting up Linstor on Talos #

Talos needs three things to run Linstor: partition storage for DRBD, add the DRBD kernel module via system extension, and install Linstor as an operator.

Step 1: Partition the disk with VolumeConfig #

Talos is immutable, so partitioning happens through machine config. Define two volumes: one for the root filesystem (EPHEMERAL), and one for Linstor storage (linstor-storage).

These are separate manifests in your machine config:

---
apiVersion: v1alpha1
kind: VolumeConfig
name: EPHEMERAL
provisioning:
  maxSize: 100GiB
  grow: false
---
apiVersion: v1alpha1
kind: UserVolumeConfig
name: linstor-storage
provisioning:
  diskSelector:
    match: disk.transport == 'nvme'
  minSize: 200GiB

The diskSelector targets NVMe disks on the node, minSize reserves at least 200 GiB for Linstor, and the EPHEMERAL volume caps at 100 GiB so the rest of the disk stays available.

EPHEMERAL deliberately doesn’t have a diskSelector here. In my homelab each node has a single NVMe disk, so there’s no ambiguity. Talos installs to the only disk it sees and the partition for linstor-storage ends up on the same one. On a node with multiple disks you’d want to pin EPHEMERAL with its own diskSelector so it lands where you expect.

Important: Talos won’t repartition a disk it thinks is already in use. If you’re deploying to a node that was previously configured, the order matters. First apply the updated machine config (with the new VolumeConfig manifests) so the node has the desired layout stored, then reset to wipe the disk. On the next boot Talos reinstalls and partitions according to the config you just applied:

talosctl apply-config --nodes <node-ip> --file controlplane.yaml
talosctl reset --nodes <node-ip> --wipe-mode all # this is the default wipe-mode

If you reset first and apply the config afterwards, Talos comes back up with the old layout and refuses to repartition the disk on a subsequent config apply, so the new VolumeConfig never takes effect.

Step 2: Add DRBD system extension and rebuild the image #

Linstor needs the DRBD kernel module. Add it to your Talos system extensions schematic:

customization:
  systemExtensions:
    officialExtensions:
      - siderolabs/drbd
      - siderolabs/zfs

DRBD provides the kernel module; ZFS provides the filesystem layer for snapshots and data integrity.

POST this schematic to the Image Factory:

curl -X POST --data-binary @schematic.yaml https://factory.talos.systems/schematics
# {"id":"abc123"}

That gives you a new schematic ID. Update your installer image reference in your machine config:

machine:
  install:
    image: factory.talos.systems/installer/abc123:v1.12.7

Upgrade all your nodes with the new image:

talosctl upgrade --node <node-ip> --image factory.talos.systems/installer/abc123:v1.12.7

talosctl upgrade reboots the node, so do this one node at a time and wait for each one to come back Ready before moving to the next. On a three-node cluster the whole process is a few minutes of rolling downtime; running it in parallel will take the cluster offline.

Once all nodes are upgraded, DRBD and ZFS are in the kernel.

Step 3: Install Linstor #

Linstor is a Kubernetes operator. Install it with Helm:

helm repo add piraeusdatastore https://charts.piraeusdatastore.io
helm repo update

helm install piraeus-datastore piraeusdatastore/piraeus \
  --namespace storage-system --create-namespace

That installs the Piraeus operator, which manages the Linstor controller and satellite agents.

Step 4: Bootstrap the Linstor cluster #

With the operator running, two more resources bring Linstor to life. The LinstorCluster initializes the control plane; the LinstorSatelliteConfiguration tells each node how to configure its storage. The LinstorCluster is minimal:

apiVersion: piraeus.io/v1
kind: LinstorCluster
metadata:
  name: linstorcluster
  namespace: storage-system
spec: {}

The LinstorSatelliteConfiguration does more work. It defines the storage pool on each node and includes a Talos compatibility patch. On a standard Ubuntu node, Piraeus uses init containers to load DRBD at runtime via modprobe, coordinate with systemd for clean shutdown, and access /lib/modules, /usr/src, and /etc/lvm. None of that applies to Talos: DRBD is already loaded as a system extension from Step 2, there is no systemd, and the filesystem is immutable. The patches section removes those init containers and volume mounts so the satellite pods start cleanly instead of getting stuck looking for paths that don’t exist:

apiVersion: piraeus.io/v1
kind: LinstorSatelliteConfiguration
metadata:
  name: worker-storage
  namespace: storage-system
spec:
  nodeSelector:
    node-role.kubernetes.io/worker: ""
  storagePools:
    - name: data
      zfsPool: {}
      source:
        hostDevices:
          - /dev/<linstor-storage-partition>  # block device from Step 1
  patches:
    - target:
        kind: Pod
        name: satellite
      patch: |
        apiVersion: v1
        kind: Pod
        metadata:
          name: satellite
        spec:
          initContainers:
            - name: drbd-shutdown-guard
              $patch: delete
            - name: drbd-module-loader
              $patch: delete
          volumes:
            - name: run-systemd-system
              $patch: delete
            - name: run-drbd-shutdown-guard
              $patch: delete
            - name: systemd-bus-socket
              $patch: delete
            - name: lib-modules
              $patch: delete
            - name: usr-src
              $patch: delete
            - name: etc-lvm-backup
              $patch: delete
            - name: etc-lvm-archive
              $patch: delete
          containers:
            - name: linstor-satellite
              volumeMounts:
                - mountPath: /etc/lvm/backup
                  name: etc-lvm-backup
                  $patch: delete
                - mountPath: /etc/lvm/archive
                  name: etc-lvm-archive
                  $patch: delete

Before applying this, replace the placeholder device path with the actual partition. The quickest way is talosctl get volumestatus -n <node>, which lists all volumes with their phase and device location:

talosctl get volumestatus -n <node-ip>

Look for the linstor-storage entry in the output. The LOCATION column gives you the device path (e.g. /dev/nvme0n1p3).

Once you have the path, wipe the partition before applying the manifests. Linstor needs a clean device to initialize the ZFS pool, because existing filesystem metadata will cause ZFS creation to fail. Talos mounts user volumes at /var/mnt/<name>, so linstor-storage will be mounted and needs to be unmounted first. Run a debug pod with --profile sysadmin to get the privileges needed:

kubectl -n kube-system debug -it --profile sysadmin --image=alpine "node/${node}" \
  -- sh -c "
    apk add --quiet util-linux
    umount /host/var/mnt/linstor-storage 2>/dev/null || true
    rm -rf /host/var/mnt/linstor-storage 2>/dev/null || true
    wipefs -a ${partition}
  "

alpine doesn’t ship with wipefs, so the script installs util-linux first. The umount is best-effort: if the volume isn’t mounted yet, it silently continues. With the device clean, substitute the path in hostDevices and apply both manifests.

After wiping, talosctl get volumestatus will show linstor-storage in a failed phase. That’s expected, because Talos can no longer mount the partition once the filesystem is gone. Linstor takes ownership of the raw device instead, and Talos won’t interfere with it.

Before moving on, verify Linstor came up cleanly. The operator should reconcile both manifests into running pods and a registered storage pool on every worker:

kubectl get pods -n storage-system
kubectl get linstorcluster
kubectl get linstorsatelliteconfiguration

The cluster status should report Available, and each worker node should have a satellite pod in Running. For a deeper check, exec into the controller and ask Linstor itself:

kubectl -n storage-system exec deploy/linstor-controller -- linstor node list
kubectl -n storage-system exec deploy/linstor-controller -- linstor storage-pool list

node list should show every worker as Online, and storage-pool list should list a data pool on each one. If a pool is missing, the satellite log on that node (kubectl logs -n storage-system <satellite-pod>) will usually point at the wrong device path or a partition that wasn’t wiped.

Step 5: Configure storage pools and StorageClasses #

With Linstor bootstrapped and holding ownership of the raw devices, the next step is to expose that storage to Kubernetes. StorageClasses define how Linstor provisions volumes: which pool to draw from, how many replicas to create, and how Kubernetes binds volumes to nodes.

Local storage (single replica):

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: linstor.csi.linbit.com
parameters:
  linstor.csi.linbit.com/storagePool: "data"
  linstor.csi.linbit.com/layerList: "storage"
  linstor.csi.linbit.com/allowRemoteVolumeAccess: "false"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Replicated storage (three replicas):

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: replicated
  annotations:
    storageclass.kubernetes.io/is-default-class: "false"
provisioner: linstor.csi.linbit.com
parameters:
  linstor.csi.linbit.com/storagePool: "data"
  linstor.csi.linbit.com/autoPlace: "3"
  linstor.csi.linbit.com/layerList: "drbd storage"
  linstor.csi.linbit.com/allowRemoteVolumeAccess: "true"
  property.linstor.csi.linbit.com/DrbdOptions/auto-quorum: suspend-io
  property.linstor.csi.linbit.com/DrbdOptions/Resource/on-no-data-accessible: suspend-io
  property.linstor.csi.linbit.com/DrbdOptions/Resource/on-suspended-primary-outdated: force-secondary
  property.linstor.csi.linbit.com/DrbdOptions/Net/rr-conflict: retry-connect
volumeBindingMode: Immediate
allowVolumeExpansion: true

autoPlace: "3" tells Linstor to place three replicas across nodes automatically. Every write goes to all three, trading some latency for the ability to survive a node failure. storagePool names the partition from Step 1, and layerList: "drbd storage" sets up the DRBD replication stack on top of raw block storage. allowRemoteVolumeAccess: "true" lets a pod read the volume even when it’s not scheduled on a replica node, which is necessary since you can’t control pod placement.

The four DrbdOptions entries handle split-brain and quorum edge cases. auto-quorum: suspend-io means DRBD suspends I/O when it can’t reach enough replicas to safely continue, so the pod hangs but data stays consistent. on-no-data-accessible extends that to the case where no replica is reachable at all. on-suspended-primary-outdated demotes a primary with stale data when nodes reconnect, so a lagging node can’t win arbitration with outdated writes. rr-conflict: retry-connect handles two nodes simultaneously trying to become primary: the loser retries rather than failing permanently. Together they say: when in doubt, pause rather than corrupt.

Local vs replicated: choosing per-workload #

Local is your baseline: fast I/O, no network overhead, the performance of a direct NVMe attach. It’s the right choice for workloads that can recreate their data if a node fails: Elasticsearch, Prometheus, caches.

Replicated adds latency because every write has to reach three nodes. On a 1 Gbps network that’s a few extra milliseconds, which most workloads won’t notice. The option exists for the ones that care.

The real value is choosing per-workload rather than picking one global strategy. A StatefulSet that needs fault tolerance uses the replicated StorageClass; a monitoring pod that scrapes metrics uses local and accepts losing historical data if the node dies.

Snapshots: point-in-time backups on ZFS #

ZFS is a filesystem and logical volume manager that gives you snapshots, compression, copy-on-write, and protection against silent data corruption. Snapshots are point-in-time copies of your volume that consume no extra space until the live data diverges from them.

That’s what makes them useful for backups: you can take a snapshot without pausing the workload, then back up the snapshot while the workload keeps writing. The snapshot stays consistent; the live data keeps changing.

Before you can use snapshots, the cluster needs the VolumeSnapshot CRDs and a snapshot-controller. These aren’t bundled with Kubernetes. They come from the kubernetes-csi/external-snapshotter project. The snapshot-controller is a cluster-level component that handles the Kubernetes-side orchestration: it watches VolumeSnapshot objects and binds them to VolumeSnapshotContent resources. The actual CSI calls (CreateSnapshot, DeleteSnapshot) are made by the csi-snapshotter sidecar, which runs inside the Piraeus CSI driver pod and is already included when you install Linstor. You only need to install the external piece: the CRDs and the controller.

git clone https://github.com/kubernetes-csi/external-snapshotter/
cd external-snapshotter
git checkout v8.5.0
kubectl kustomize client/config/crd | kubectl apply -f -
kubectl -n kube-system kustomize deploy/kubernetes/snapshot-controller | kubectl apply -f -

This installs three CRDs (VolumeSnapshot, VolumeSnapshotClass, VolumeSnapshotContent) and a snapshot-controller Deployment in kube-system. Install it once; it works with any CSI driver that supports snapshots.

A VolumeSnapshotClass works like a StorageClass: it tells Kubernetes which CSI driver handles snapshot operations and what to do with the snapshot when it’s deleted. The deletionPolicy: Retain means that deleting a VolumeSnapshot object won’t delete the underlying snapshot data, which is what you want for backups. The Velero label lets the Velero backup tool discover this class automatically. Wiring up Velero for full cluster backups is out of scope for this post, but the label costs nothing to add now if you want to use it in the future.

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: linstor-snapshot-class
  labels:
    velero.io/csi-volumesnapshot-class: "true"
driver: linstor.csi.linbit.com
deletionPolicy: Retain

With the class in place, taking a snapshot is a single manifest. You reference the PVC you want to snapshot, and the snapshot-controller coordinates with the Linstor CSI driver to create the point-in-time copy:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: database-backup-2026-06-01
  namespace: production
spec:
  volumeSnapshotClassName: linstor-snapshot-class
  source:
    persistentVolumeClaimName: database-pvc

Once the snapshot is complete, you can restore from it by creating a new PVC that references the snapshot:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: database-restored
  namespace: production
spec:
  storageClassName: replicated
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  dataSource:
    name: database-backup-2026-06-01
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io

Snapshots require ZFS, which is why it’s included in the system extensions schematic in Step 2.

When the choice for Linstor DRBD would change #

Two specific futures change the answer.

If the hardware refreshes to 64 GB+ per node with dedicated OSD disks and ideally a 10 GbE cluster network, Ceph becomes genuinely worth the migration cost. At that tier, the integration story (block storage, object storage, and shared filesystem in one self-healing cluster) justifies the operational complexity, and CephFS unlocks workload patterns the current stack can’t serve.

If the operational cost of Linstor starts showing up as recurring problems (kernel module issues after Talos upgrades, recovery procedures that go wrong, quorum misbehavior), Longhorn’s simpler operational model becomes worth the memory premium and migration effort. The honest test is hours spent fighting the storage layer in the last six months. If that number is near zero, the migration isn’t justified.

Until one of those conditions holds, Linstor fits the hardware, delivers the guarantees, and integrates cleanly with the rest of the stack.

Wrapping up #

Linstor gives you replicated block storage that survives a node failure, local volumes when you don’t need replication, and ZFS snapshots for point-in-time backups. The operational story is light: no CRUSH maps, no healing traffic, no userspace replication daemons competing with your workloads. DRBD handles the replication in the kernel, Linstor coordinates volumes across nodes, and once it’s bootstrapped you mostly leave it alone.

Combined with Cilium for networking and Talos under everything else, you have a cluster that can route traffic, enforce policy, persist data, and survive a node going down. When something breaks, you have the tools to see what’s happening: Cilium shows you the traffic, Linstor shows you the replication status, Talos shows you the node state. Everything is observable, declarative, and manageable from kubectl.

That’s the foundation of a platform you actually want to operate.