Multiple users in user namespaces on OpenShift
In the previous post I confirmed that user namespaced pods are working in OpenShift 4.7. There are some rough edges, and the feature must be explicitly enabled in the cluster. But it fundamentally works.
One area I identified for a follow-up investigation is the behaviour of containers that execute multiple processes as different users. The correct and “expected” behaviour is important for systemd-based containers (among other scenarios). I did not anticipate any problems, but this is something we need to verify as part of the effort to bring FreeIPA to OpenShift. This post records my steps to verify that multi-user containers work as needed in user namespaces on OpenShift.
Setup §
Cluster configuration §
I configured the cluster as recorded in my earlier post, User namespaces in OpenShift via CRI-O annotations.
Test program §
I wrote a small Python program to serve as the container entrypoint.
This program will run as root
(in the namespace). For each of
several hardcoded system accounts, it invokes fork(2)
to duplicate
the process. The child process executes setuid(2)
to switch user
account, then execlp(3)
to replace itself with the sleep(1)
program. The duration to sleep depends on the UID of the system
account that executes it.
Outside the container, we will be able to observe whether the program (and its child processes) are running, and which user accounts they are running under.
The source of the test program:
import os, pwd, time
= ['root', 'daemon', 'operator', 'nobody', 'mail']
users
for user in users:
= pwd.getpwnam(user)
ent = ent.pw_uid
uid if os.fork() != 0:
os.setuid(uid)'sleep', 'sleep', str(3000 + uid))
os.execlp(
3600) time.sleep(
Container §
The Containerfile
is simple. Based on fedora:33-x86_64
, it
copies the Python program into the container and defines the entry
point:
FROM fedora:33-x86_64
COPY test_multiuser.py .
ENTRYPOINT ["python3", "test_multiuser.py"]
I built the container and pushed it to
quay.io/ftweedal/test-multiuser:latest
.
Pod specification §
The pod YAML is:
apiVersion: v1
kind: Pod
metadata:
name: multiuser-test
annotations:
io.kubernetes.cri-o.userns-mode:
"auto:size=65536;map-to-root=true"
spec:
containers:
- name: multiuser-test
image: quay.io/ftweedal/test-multiuser:latest
securityContext:
runAsUser: 0
runAsGroup: 0
securityContext:
sysctls:
- name: "net.ipv4.ping_group_range"
value: "0 65535"
The io.kubernetes.cri-o.userns-mode
annotation tells CRI-O to run
the pod in a user namespace. The runAsUser
and runAsGroup
fields tell CRI-O to execute the entry point process as root
(inside the namespace).
Verification §
I created the pod:
% oc --as test create -f multiuser-test.yaml
pod/multiuser-test created
After a short time, I queried the status, node and container ID of the pod:
% oc get -o json pod multiuser-test \
| jq '.status.phase,
.spec.nodeName,
.status.containerStatuses[0].containerID'
"Running"
"ft-47dev-1-4kplg-worker-0-qjfcj"
"cri-o://ee693645f41aa5b54b890862778f173ebaf465f741231426c9e80237aa60660b"
Next I opened a debug shell on the worker node and queried the container PID (process ID):
% oc debug node/ft-47dev-1-4kplg-worker-0-qjfcj
Starting pod/ft-47dev-1-4kplg-worker-0-qjfcj-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.8.0.165
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# crictl inspect ee69364 | jq .info.pid
2445729
I viewed the user map of the process:
sh-4.4# cat /proc/2445729/uid_map
0 265536 65536
This confirms that the container is in a user namespace. The UID
range 0
–65535
in the container is mapped to 265536
–331071
on
the host. That is in line with what I expect.
Now let’s see what else is running in that namespace. We can use
pgrep(1)
with the --ns PID
option, which selects all processes
in the same namespace(s) as PID
. Then ps(1)
can tell us which
users are executing those processes.
sh-4.4# pgrep --ns 2445729 \
| xargs ps -o user,pid,cmd --sort pid
USER PID CMD
265536 2445729 sleep 3000
265538 2445766 sleep 3002
265547 2445767 sleep 3011
331070 2445768 sleep 68534
265544 2445769 sleep 3008
265536 2445770 python3 stuff.py
The entry point spawned the expected 5 child processes. Each is
running as a different user. This is the host view of the
processes. Subtracting the base of the uid_map
from each UID, we
observe that the UIDs in the namespace are: 0
, 2
, 11
,
65534
and 8
. These are the UIDs of the five accounts declared
in the test program.
Conclusion §
Containers that use multiple users work as expected when using user namespaces in OpenShift.
The so far unstated assumption is that the mapped UID range includes all the UIDs actually used by the containerised application. Different applications use different UIDs, and different operating systems define different UIDs. So take care that the UID map hinted by the CRI-O annotation suits the container and application.
Note that mapped UID ranges in Linux need not be contiguous (either
outside or inside the container). That is, a process may have
multiple lines in its /proc/<PID>/uid_map
, mapping multiple,
non-overlapping and not-necessarily-adjacent ranges. But I am
talking about the Linux user namespace feature here. I have not yet
checked whether CRI-O + OpenShift admits this more complex scenario.
But it is fundamentally possible.
The nobody
user in Fedora has UID 65534
. Therefore a “simple
mapping” must have a size not less than 65535 to use the nobody
account in a user namespaced pod. OK, let’s round that up to 65536
= 216. With a total UID space of 216+16, you are limited to
less than 65536 separate mappings. It sounds like a lot, but this
limit could be a problem in large, complex environments. But most
applications will use only a handful of UIDs. Non-contiguous UID
mapping could dramatically increase the number of ranges available,
by not mapping UIDs that applications do not use. But there is
substantial complexity in defining and managing non-contiguous UID
mappings.