Image: https://sysdig.com/blog/container-security-best-practices/

Cloud Native: Privilege and Security#

This project was inspired by running LXD containers for the last few years. They took the place of VMware VMs and I was really happy with how much easier they are to build and manage. I started to wonder, would it be possible to run systemd in an unprivileged container in Kubernetes? It turns out the answer is no. This post is what I learned from the experience and how I eventually implemented unprivileged servers.

Security and “Unsharing”#

Containerization is based on the idea of unsharing parts of the Linux userspace called namespaces. There are eight namespaces in Linux. The descriptions below come from the unshare manual page:

  1. mount namespace: Mounting and unmounting filesystems will not affect the rest of the system, except for filesystems which are explicitly marked as shared.

  2. UTS namespace: Setting hostname or domainname will not affect the rest of the system.

  3. IPC namespace: The process will have an independent namespace for POSIX message queues as well as System V message queues, semaphore sets and shared memory segments.

  4. network namespace: The process will have independent IPv4 and IPv6 stacks, IP routing tables, firewall rules, the /proc/net and /sys/class/net directory trees, sockets, etc.

  5. PID namespace: Children will have a distinct set of PID-to-process mappings from their parent.

  6. cgroup namespace: The process will have a virtualized view of /proc/self/cgroup, and new cgroup mounts will be rooted at the namespace cgroup root.

  7. user namespace: The process will have a distinct set of UIDs, GIDs and capabilities.

  8. time namespace: The process can have a distinct view of CLOCK_MONOTONIC and/or CLOCK_BOOTTIME which can be changed using /proc/self/timens_offsets.

A container is just a program with one or more unshared namespace. If you’ve worked with containers you’ve probably noticed that PIDs inside the container start at 1 and processes have different PIDs outside of the container. In Kubernetes each pod gets it’s own IP address. The containers in the pod are in the same network namespace that’s been unshared with the system. It’s a brilliant and effective way to isolate programs from each other.

However, containerization is not a security technology. Running untrusted workloads, like a student shell, in Kubernetes carries a greater risk than running the same workload in a VM. But containers are lighter, more flexible and convenient, more robust and resilient than VMs.

Kubernetes and User Namespaces#

Alban Crequy has a great blog post with a deep explanation of the work to bring user namespaces to Kubernetes. User namespaces became a part of the Linux 3.8 in 2013. The user namespace allows UIDs inside of a container to work just like PIDs do. The container has a completely separate set of users with a mapping of IDs onto the host system. This, among other things, allows root in the container to be a non-root user on the system. Further, Linux capabilities, like CAP_SYS_ADMIN that give processes extra privileges, are isolated to their user namespace.

User namespaces aren’t as simple as they might appear at first glance. If each pod uses a randomly assigned group of UIDs what happens to files? The files in a PersistentVolume would have to be chown’d each time a pod attaches to it, possibly taking a very long time. There are solutions to these problems but as of this writing, the Kubernetes enhancement to support user namespaces has not yet been closed.

Until user namespaces are supported, root in a pod is root on the node. A container escape from a privileged container means the attacker has full control of the node.

Systemd and Privilege#

Systemd is the init program, PID 1, on modern Linux systems. It does an amazing job of optimizing the startup process, managing mounts, cgroups, services and targets (formerly runlevels). For years there’s been an argument about whether it’s appropriate to make Systemd run in a container. The Docker documentation states:

“It is generally recommended that you separate areas of concern by using one service per container.”

That’s not what Systemd is for, Systemd runs all the services! Daniel Walsh at RedHat has been struggling for years to get Systemd to run in a Docker container then to run in an unprivileged Docker container and finally to triumph by making Systemd run in an unprivileged Podman container.

In his last blog article he warns:

“That being said, there are also lots of reasons not to run systemd in containers. The main one is that systemd/journald controls the output of containers, whereas tools like Kubernetes and OpenShift expect the containers to log directly to stdout and stderr. So, if you are going to manage your containers via Orchestrator like these, then you should think twice about using systemd-based containers. Additionally, the upstream community of Docker and Moby were often hostile to the use of systemd in a container.”

I relied heavily on Daniel’s articles to help me understand what I needed to do to get Systemd to run. Unfortunately there’s no way to make it work in my current Kubernetes implementation, Microk8s. The reason comes down to how /sys/fs/cgroup is mounted.

On my laptop the mount is read/write as you would expect:

$ mount | grep cgroup
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

I checked an unprivileged LXD container running Ubuntu 22.04. LXD enables users namespaces and, in unprivileged mode, apparently makes /sys/fs/cgroup writable:

$ mount | grep cgroup
none on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime)

Finally I opened a shell on a Jupyter notebook that I have running using the fantastic Zero to JupyterHub with Kubernetes:

$ mount | grep cgroup
cgroup on /sys/fs/cgroup type cgroup2 (ro,nosuid,nodev,noexec,relatime)

Notice that /sys/fs/cgroup is mounted read only. When this is true there is no way to make Systemd start. According to the documentation if you could make /sys/fs/cgroup/systemd writable Systemd would start but there’s no way to do that from inside of the container (for obvious reasons), though I tried anyway.

The Workaround#

Container escapes are possible (even trivial) by the root user in a privileged container. For my dream of a big, shared Linux server that’s okay because students don’t have root privilege anyway. For now I’m running that as a privileged container. I also want to assign personal servers to students so they can use apt and dnf and do rooty things. Those servers pose a risk if they are privileged so I came up with a workaround: Don’t run Systemd.

The SSH server runs fine as a single-service container. When running in unprivileged mode I simply change the entrypoint and command of the container in the pod definition:

    command: ["/usr/bin/bash"]
    args: ["-c", "mkdir /run/sshd; /etc/rc.local; exec /usr/sbin/sshd -e -D"]

Notice that I’m manually running /etc/rc.local as Systemd would do and then exec’ing /usr/bin/sshd directly. In some ways this makes for a great UNIX environment. But it’s lacking some features that I miss:

  1. It’s impossible to override the hostname, that requires CAP_SYS_ADMIN. So the hostnames come from Kubernetes and are long and weird.

  2. Multiple users work but without the support of cgroups it’s not possible to confine user’s CPU and Memory usage as would be the case in a normal system.

  3. It’s not possible to run podman or docker inside of the container. Podman needs access to creating cgroups and Docker is a daemon. This is a huge problem for CIS-92 because I want them to be able to build their own containers.

Conclusion#

There’s no way yet to run Systemd in an unprivileged Kubernetes container. I have a good solution for the “Big UNIX Server” environment because I trust the root user, but I don’t have a replacement for the “personal server” environment for all of my classes yet. This will change when Kubernetes implements user namespaces. Hopefully soon!