systemd containers on OpenShift with cgroups v2
systemd in a container is a practical reality of migrating nontrivial applications to container infrastructure. It is not the “cloud native” way, but many applications written in The Before Times cannot be broken up and rearchitected without a huge cost. And so, there is a demand to run containers that run systemd, which in turn manages application services.
FreeIPA is one example. Its traditional environment is a dedicated Linux server (ignoring replicas). There are many services which both interact among themselves, and process requests from external clients and other FreeIPA servers. The engineering effort to redesign FreeIPA as a suite of several containerised services is expected to be very high. Therefore our small team focused on bringing FreeIPA to OpenShift therefore decided to pursue the “monolithic container” approach.
Support for systemd containers in OpenShift, without hacks, is a prerequisite for this approach to viable. In this post I experiment with systemd containers in OpenShift and share my results.
Test application: HTTP server §
To test systemd containers on OpenShift, I created a Fedora-based
container running the nginx HTTP server. I enable the nginx
systemd and set the default command to /sbin/init
, which is
systemd. The server doesn’t host any interesting content, but if it
responds to requests we know that systemd is working.
The Containerfile
definition is:
FROM fedora:33-x86_64
RUN dnf -y install nginx && dnf clean all && systemctl enable nginx
EXPOSE 80
CMD [ "/sbin/init" ]
I built the container on my workstation and tagged it test-nginx
.
To check that the container works, I ran it locally and performed an
HTTP request via curl
:
% podman run --detach --publish 8080:80 test-nginx
2d8059e555c821d9ffcccd84bee88996207794957696c54e8d29787e8c33fab3
% curl --head localhost:8080
HTTP/1.1 200 OK
Server: nginx/1.18.0
Date: Thu, 25 Mar 2021 00:22:23 GMT
Content-Type: text/html
Content-Length: 5564
Last-Modified: Mon, 27 Jul 2020 22:20:49 GMT
Connection: keep-alive
ETag: "5f1f5341-15bc"
Accept-Ranges: bytes
% podman kill 2d8059e5
2d8059e555c821d9ffcccd84bee88996207794957696c54e8d29787e8c33fab3
The container works properly in podman
. I proceed to testing it
on OpenShift.
Running (privileged user) §
I performed my testing on an OpenShift 4.8 nightly cluster. The
exact build is 4.8.0-0.nightly-2021-03-26-010831
. As far as I’m
aware, with respect to systemd and cgroups there are no major
differences between OpenShift 4.7 (which is Generally Available) and
the build I’m using. So results should be similar on OpenShift 4.7.
The Pod definition for my test service is:
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: quay.io/ftweedal/test-nginx:latest
I create the Pod, operating with the cluster admin
credential.
After a few seconds, the pod is running:
% oc create -f pod-nginx.yaml
pod/nginx created
% oc get -o json pod/nginx | jq .status.phase
"Running"
Verifying that the service is working §
pod/nginx
is running, but it is not exposed to other pods in the
cluster, or to the outside world. To test that the server is
working, I will expose it on the hostname
nginx.apps.ft-48dev-5.idmocp.lab.eng.rdu2.redhat.com
. First,
observe that performing an HTTP request from my workstation fails
because the service is not available:
% curl --head nginx.apps.ft-48dev-5.idmocp.lab.eng.rdu2.redhat.com
HTTP/1.0 503 Service Unavailable
pragma: no-cache
cache-control: private, max-age=0, no-cache, no-store
content-type: text/html
Now I create Service and Route objects to expose the nginx server. The Service definition is:
apiVersion: v1
kind: Service
metadata:
name: nginx
spec:
selector:
app: nginx
ports:
- protocol: TCP
port: 80
And the Route definition is:
apiVersion: v1
kind: Route
metadata:
name: nginx
spec:
host: nginx.apps.ft-48dev-5.idmocp.lab.eng.rdu2.redhat.com
to:
kind: Service
name: nginx
I create the objects:
% oc create -f service-nginx.yaml
service/nginx created
% oc create -f route-nginx.yaml
route.route.openshift.io/nginx created
After a few seconds I performed the HTTP request again, and it succeeded:
% curl --head nginx.apps.ft-48dev-5.idmocp.lab.eng.rdu2.redhat.com
HTTP/1.1 200 OK
server: nginx/1.18.0
date: Tue, 30 Mar 2021 08:16:23 GMT
content-type: text/html
content-length: 5564
last-modified: Mon, 27 Jul 2020 22:20:49 GMT
etag: "5f1f5341-15bc"
accept-ranges: bytes
set-cookie: 6cf5f3bc2fa4d24f45018c591d3617c3=6f2f093d36d535f1dde195e08a311bda; path=/; HttpOnly
cache-control: private
This confirms that the systemd container is running properly on OpenShift 4.8.
Low-level details §
Now I will inspect some low-level details of the container. I’ll do that in a debug shell on the worker node. So first, I query the pod’s worker node name and container ID:
% oc get -o json pod/nginx \
| jq '.spec.nodeName,
.status.containerStatuses[0].containerID'
"ft-48dev-5-f24l6-worker-0-q7lff"
"cri-o://d9d106cb65e4c965737ef66f15bd5b9e0988c386675e3404e24fd36e58284638"
Now I enter a debug shell on the worker node:
% oc debug node/ft-48dev-5-f24l6-worker-0-q7lff
Starting pod/ft-48dev-5-f24l6-worker-0-q7lff-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.8.1.64
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4#
I use crictl
to query the namespaces of the container:
sh-4.4# crictl inspect d9d106 \
| jq .info.runtimeSpec.linux.namespaces[].type
"pid"
"network"
"ipc"
"uts"
"mount"
Observe that there are pid
and mount
namespaces (among others),
but no cgroup
namespace. The worker node and container are using
cgroups v1.
The container_manage_cgroup
SELinux boolean is off
:
sh-4.4# getsebool container_manage_cgroup
container_manage_cgroup --> off
Now let’s see what processes are running in the container. We can
query the PID of the initial container process via crictl inspect
.
Then I use pgrep(1)
with the --ns
option, which lists processes
executing in the same namespace(s) as the specified PID:
sh-4.4# crictl inspect d9d106 | jq .info.pid
14591
sh-4.4# pgrep --ns 14591 | xargs ps -o user,pid,cmd --sort pid
USER PID CMD
root 14591 /sbin/init
root 14625 /usr/lib/systemd/systemd-journald
systemd+ 14636 /usr/lib/systemd/systemd-resolved
root 14642 /usr/lib/systemd/systemd-homed
root 14643 /usr/lib/systemd/systemd-logind
root 14646 /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 xterm
dbus 14647 /usr/bin/dbus-broker-launch --scope system --audit
dbus 14651 dbus-broker --log 4 --controller 9 --machine-id 2f2fcc4033c5428996568ca34219c72a --max-bytes 536870912 --max-fds 4096 --max-matches 16384 --audit
root 14654 nginx: master process /usr/sbin/nginx
polkitd 14655 nginx: worker process
polkitd 14656 nginx: worker process
polkitd 14657 nginx: worker process
polkitd 14658 nginx: worker process
polkitd 14659 nginx: worker process
polkitd 14660 nginx: worker process
polkitd 14661 nginx: worker process
polkitd 14662 nginx: worker process
The PID
column shows the PIDs from the point of view of the host’s
PID namespace. The first process (PID 1 inside the container) is
systemd (/sbin/init
). systemd has started other system services,
and also nginx.
systemd is running as root
on the host. The other processes
run under various system accounts. The container does not have its
own user namespace. This pod was created by a privileged account,
which allows it to run as root
on the host.
Running (unprivileged user) §
I created an unprivileged user called test
, and granted it admin
privileges (so it can create pods).
% oc create user test
user.user.openshift.io/test created
% oc adm policy add-role-to-user admin test
clusterrole.rbac.authorization.k8s.io/admin added: "test"
I did not grant to the test
account any Security Context
Constraints (SCCs) that would allow it to run privileged containers
or use host user accounts (including root
).
Now I create the same nginx
pod, as this user test
. The pod
fails to execute:
% oc --as test create -f pod-nginx.yaml
pod/nginx created
% oc get pod/nginx
NAME READY STATUS RESTARTS AGE
nginx 0/1 CrashLoopBackOff 1 23s
Let’s inspect the logs to see what went wrong:
% oc logs pod/nginx
%
There is no output. This baffled me, at first. Eventually I
learned that Kubernetes, by default, does not allocate
pseudo-terminal devices to containers. You can overcome this on a
per-container basis by including tty: true
in the Container object
definition:
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: quay.io/ftweedal/test-nginx:latest
tty: true
With the pseudo-terminal enabled, oc logs
now shows the error
output:
% oc logs pod/nginx
systemd v246.10-1.fc33 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified)
Detected virtualization container-other.
Detected architecture x86-64.
Welcome to Fedora 33 (Container Image)!
Set hostname to <nginx>.
Failed to write /run/systemd/container, ignoring: Permission denied
Failed to create /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod3bbed45f_634a_4f60_bb07_5f080c483f0f.slice/crio-90dead4cf549b844c4fb704765edfbba9e9e188b30299f484906f15d22b29fbd.scope/init.scope control group: Permission denied
Failed to allocate manager object: Permission denied
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...
The user executing systemd does not have permissions to write the cgroup filesystem. Although cgroups are heirarchical, cgroups v1 does not support delegating management of part of the heirarchy to unprivileged users. But cgroups v2 does support this.
Set the SYSTEMD_LOG_LEVEL
environment variable to info
or
debug
to get more detail in the systemd log output.
Enabling cgroups v2 §
We can enable cgroups v2 (only) on worker nodes via the following MachineConfig object:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
name: enable-cgroupv2-workers
labels:
machineconfiguration.openshift.io/role: worker
spec:
kernelArguments:
- systemd.unified_cgroup_hierarchy=1
- cgroup_no_v1="all"
- psi=1
After creating the MachineConfig, the Machine Config Operator applies the configuration change and restarts each worker node, one by one. This occurs over several minutes.
Running (unprivileged; cgroups v2) §
After some time, all worker nodes have the updated kernel
configuration to enable cgroups v2 and disable cgroups v1. I again
created the pod as the unprivileged test
user. And again, pod
execution failed. But this time the error is different:
% oc --as test create -f pod-nginx.yaml
pod/nginx created
% oc get pod
NAME READY STATUS RESTARTS AGE
nginx 0/1 Error 1 12s
% oc logs pod/nginx
systemd v246.10-1.fc33 running in system mode. (+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +ZSTD +SECCOMP +BLKID +ELFUTILS +KMOD +IDN2 -IDN +PCRE2 default-hierarchy=unified)
Detected virtualization container-other.
Detected architecture x86-64.
Welcome to Fedora 33 (Container Image)!
Set hostname to <nginx>.
Failed to write /run/systemd/container, ignoring: Permission denied
Failed to create /init.scope control group: Permission denied
Failed to allocate manager object: Permission denied
[!!!!!!] Failed to allocate manager object.
Exiting PID 1...
The error suggests that the container now has its own cgroup namespace. I can confirm it by creating a pod debug container…
% oc debug pod/nginx
Starting pod/nginx-debug ...
Pod IP: 10.130.2.10
If you don't see a command prompt, try pressing enter.
sh-5.0$
…finding out the node and container ID…
% oc get -o json pod/nginx-debug \
| jq '.spec.nodeName,
.status.containerStatuses[0].containerID'
"ft-48dev-5-f24l6-worker-0-qv7kq"
"cri-o://e870d022d1c53adf94e36877312fcfef5ef0431ad9cf1fbe9c9d2ace02bee858"
…and analysing the container sandbox in a node debug shell:
sh-4.4# crictl inspect e870d02 \
| jq .info.runtimeSpec.linux.namespaces[].type
"pid"
"network"
"ipc"
"uts"
"mount"
"cgroup"
The output confirms that the pod has a cgroup namespace. Despite
this, the unprivileged user running systemd in the container does
not have permission to manage the namespace. The oc logs
output
demonstrates this.
container_manage_cgroups
SELinux boolean §
I have one more thing to try. The container_manage_cgroups
SELinux boolean was disabled on the worker nodes (per default
configuration). Perhaps it is still needed, even when using cgroups
v2. I enabled it on the worker node (directly from the debug shell,
for now):
sh-4.4# setsebool container_manage_cgroup on
I again created the nginx pod as the test
user. It failed with
the same error as the previous attempt, when
container_manage_cgroup
was off. So that was not the issue, or
at least not the immediate issue.
Next steps §
At this point, I have successfully enabled cgroups v2 on worker nodes. Container sandboxes have their own cgroup namespace. But inside the container, systemd fails with permission errors when it attempts some cgroup management.
The next step is to test the systemd container in OpenShift with cgroups v2 enabled and user namespaces enabled. Both of these features are necessary for securely running a complex, systemd-based application in OpenShift. My hope is that enabling them together is the last step to getting systemd-based containers working properly in OpenShift. I will investigate and report the results in an upcoming post.