Tags: openshift, security, containers

Creating user namespaces inside containers

Over the last year I have experimented with user namespace support in OpenShift. That is, making OpenShift run workloads inside a separate user namespace. We’re trying to drive this feature forward, but some people have reservations. Does having processes running as root inside a user namespace present an increased security risk? What if there are kernel bugs…

If you’re worried about the security of user namespaces, OpenShift or Kubernetes user namespace support doesn’t change the game at all. As I demonstrate in this post, you can create and use user namespaces inside your workloads right now.

Demo §

I tested on OpenShift 4.9.0 in the default configuration. So, no explicit user namespace support. I used a stock Fedora container image with the following Pod spec:

apiVersion: v1
kind: Pod
metadata:
  name: fedora
spec:
  containers:
  - name: fedora
    image: registry.fedoraproject.org/fedora:34-x86_64
    command: ["sleep", "3600"]
    securityContext:
      capabilities:
        drop:
        - CHOWN
        - DAC_OVERRIDE
        - FOWNER
        - FSETID
        - SETPCAP
        - NET_BIND_SERVICE

The Pod will run under the restricted SCC. I explicitly drop a number of default capabilities.

Next I created a project named userns, and new user me.

% oc new-project userns
Now using project "userns" on server "https://api.ci-ln-cih2n32-f76d1.origin-ci-int-gce.dev.openshift.com:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=k8s.gcr.io/serve_hostname

% oc create user me
user.user.openshift.io/me created

% oc adm policy add-role-to-user edit me
clusterrole.rbac.authorization.k8s.io/edit added: "me"

Operating as me I created the pod:

% oc --as me create -f pod-fedora.yaml
pod/fedora created

Soon after, the pod is running. I can see what node it is running on, and its CRI-O container ID:

% oc get -o json pod/fedora \
    | jq '.status.phase,
          .spec.nodeName,
          .status.containerStatuses[0].containerID'
"Running"
"ci-ln-cih2n32-f76d1-sjtwq-worker-a-qr5hr"
"cri-o://d164163951604b7fc9506b3a390ec6a14c76dc6077406fc7b5ffcbf81c406f68"

Next I started a shell in my container. I’ll leave it running for now, and come back to it later:

% oc exec -it pod/fedora /bin/sh
sh-5.1$

In another terminal, I opened a debug shell on the worker node. Then I used crictl to find out the process ID (pid) of the main container process.

% oc debug node/ci-ln-cih2n32-f76d1-sjtwq-worker-a-qr5hr
Starting pod/ci-ln-cih2n32-f76d1-sjtwq-worker-a-qr5hr-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.128.2
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# crictl inspect d1641639 | jq .info.pid
18668

Next I used pgrep to find all the processes that share the same set of namespaces as process 18668. In other words, processes running in the same pod sandbox.

sh-4.4# pgrep --ns 18668 \
    | xargs ps -o user,pid,cmd --sort pid
USER         PID CMD
1000580+   18668 sleep 3600
1000580+   26490 /bin/sh

There are two processes, running under an unpriviled UID. The UID comes from a unique range allocated for the userns project. These two processes are the main container process (sleep), and the shell that I exected a few steps ago. As expected.

Now for the fun part. Back to the shell we opened in pod/fedora. Observe that this shell process has an empty capability set:

sh-5.1$ grep Cap /proc/$$/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000000
CapAmb: 0000000000000000

And yet, using unshare(1) I was able to create a new user namespace. The -r option says to map root in the new user namespace to the user that created the namespace. And that is indeed what happens:

sh-5.1$ unshare -U -r
[root@fedora /]# id
uid=0(root) gid=0(root) groups=0(root),65534(nobody)

I confirmed it via the node debug shell. I ran pgrep again, this time restricting the search to processes in the same pid namespace as process 18668. The --nslist option gives the list of namespaces to match (all namespaces when not specified).

sh-4.4# pgrep --ns 18668 --nslist pid \
    | xargs ps -o user,pid,cmd --sort pid
USER         PID CMD
1000580+   18668 sleep 3600
1000580+   26490 /bin/sh
1000580+   36704 -sh

The new shell has pid 36704. Observe that UID 0 in the container maps to UID 1000580000:

sh-4.4# cat /proc/36704/uid_map
         0 1000580000          1

Discussion §

You can create and use user namespaces inside your containers without any special support from OpenShift or Kubernetes. Therefore, the idea of a OpenShift or Kubernetes feature for running a workload in an isolated user namespace by default does not lead to an increased risk of container escapes or privilege escalation related to processes running as uid 0 in a user namespace.

This is not to gloss over the fact that other parts of a “workloads in user namespaces” feature have to be designed and implemented with care. Particular aspects include pod admission and selection of the unprivileged UIDs to map to. But on the question of the security of the Linux user namespaces feature itself, a first class OpenShift of Kubernetes feature doesn’t introduce any new risk. Whatever risk there is, is there right now.

If some critical security with user namespaces emerges and you need an urgent mitigation, the only option is to alter the container runtime Seccomp policies to block the unshare(2) syscall. This is an advanced topic, involving changes to node configuration. For details, see Configuring seccomp profiles in the official OpenShift documentation.