Tags: openshift, security, containers

User namespace support in OpenShift 4.7

In a previous post I investigated how to use the annotation-based user namespace support in CRI-O 1.20. At the end of that post, I was stuck. Now that OpenShift 4.7 has been released, where do things stand?

User namespaces are working §

Using the same setup, and a similar pod specification, I am able to run the pod in a user namespace. The process executes as root inside the namespace, and an unprivileged account outside the namespace.

I won’t repeat all the setup here, but one important difference is that I granted to anyuid SCC to the account that creates the pod (named test):

% oc adm policy add-scc-to-user anyuid test
securitycontextconstraints.security.openshift.io/anyuid added to: ["test"]

The pod definition is:

% cat userns-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: userns-test
  annotations:
    io.kubernetes.cri-o.userns-mode: "auto:size=65536;map-to-root=true"
spec:
  containers:
  - name: userns-test
    image: freeipa/freeipa-server:fedora-31
    command: ["sleep", "3601"]
    securityContext:
      runAsUser: 0
      runAsGroup: 0
  securityContext:
    sysctls:
    - name: "net.ipv4.ping_group_range"
      value: "0 65535"

Note the io.kubernetes.cri-o.userns-mode annotation. That activates the user namespace feature. The runAsUser and runAsGroup fields in securityContext are also important.

I create the pod. After a few moments I observe that it is running, and query the node and container ID:

$ oc --as test create -f userns-test.yaml
pod/userns-test created

% oc get -o json pod userns-test \
    | jq .status.phase
"Running"

% oc get -o json pod userns-test \
    | jq .spec.nodeName
"ft-47dev-1-4kplg-worker-0-qjfcj"

% oc get -o json pod userns-test \
    | jq ".status.containerStatuses[0].containerID"
"cri-o://92bf6c3b61337f18f4c963450b5db76cbcd4aa73e2659759ba2725f4d0f8aac7"

In a debug shell on the worker node, I use crictl to find out the pid of the pod’s (first) process:

% oc debug node/ft-47dev-1-4kplg-worker-0-qjfcj
Starting pod/ft-47dev-1-4kplg-worker-0-qjfcj-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.8.0.165
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# crictl inspect 92bf6c3b | jq .info.pid
937107

Earlier versions of crictl have the PID in the top-level object (jq selector .pid). The selector is now .info.pid.

Now we can query the UID map of the container process:

sh-4.4# cat /proc/937107/uid_map
         0     200000      65536

This shows that the process is running as uid 0 (root) in the namespace, and uid 200000 outside the namespace. The mapped range is contiguous and has size 65536, which agrees with the annotation:

io.kubernetes.cri-o.userns-mode: "auto:size=65536;map-to-root=true"

This is great!

They still require a privileged service account §

In my earlier investigation I found that that users require the anyuid SCC (or equivalent) to create user-namespaced pods running as specific UIDs (e.g. root) inside the pod. This is still the case. Rescinding anyuid from user test and (re)creating the pod results in an error:

% oc adm policy remove-scc-from-user anyuid test
securitycontextconstraints.security.openshift.io/anyuid removed from: ["test"]

% oc --as test create -f userns-test.yaml
Error from server (Forbidden): error when creating
"userns-test.yaml": pods "userns-test" is forbidden: unable to
validate against any security context constraint:
[spec.containers[0].securityContext.runAsUser: Invalid value: 0:
must be in the ranges: [1000630000, 1000639999]]

At the end of my previous post, I wrote:

The security context constraint (SCC) is prohibiting the use of uid 0 for the container process. Switching to a permissive SCC might allow me to proceed, but it would also mean using a more privileged OpenShift user account. Then that privileged account could then create containers running as root in the system user namespace. We want user namespaces in OpenShift so that we can avoid this exact scenario. So resorting to a permissive SCC (e.g. anyuid) feels like the wrong way to go.

After giving this more thought, my opinion has shifted. This is still an important gap in overall security, and it should be addressed. But even though it currently requires a privileged account to create user-namespaced pods, that fact that you even can is a huge win.

In other words, the user namespace support in its current form is still a giant leap forward. Previously, many kinds of applications cannot run securely in OpenShift. The service account privileges caveat may be unacceptable to some, but I hope that would be addressed in time.

Inconsistent treatment of `securityContext` §

The PodSpec I used above (with success) is:

containers:
- name: userns-test
  image: freeipa/freeipa-server:fedora-31
  command: ["sleep", "3601"]
  securityContext:
    runAsUser: 0
    runAsGroup: 0
securityContext:
  sysctls:
  - name: "net.ipv4.ping_group_range"
    value: "0 65535"

Note there are two securityContext fields. The first, in the Container spec, is a SecurityContext object. The second, in the PodSpec, is a PodSecurityContext object.

The runAsUser and runAsGroup fields can be specified in either of these objects (or both, with SecurityContext taking precedence). I can move these fields to the PodSecurityContext, as below.

containers:
- name: userns-test
  image: freeipa/freeipa-server:fedora-31
  command: ["sleep", "3601"]
securityContext:
  runAsUser: 0
  runAsGroup: 0
  sysctls:
  - name: "net.ipv4.ping_group_range"
    value: "0 65535"

According to the documentation, this object should have the same meaning as the previous one. But there is a critical behavioural difference! I create and examine the pod as before:

$ oc --as test create -f userns-test.yaml
pod/userns-test created

% oc get -o json pod userns-test \
    | jq .status.phase
"Running"

% oc get -o json pod userns-test \
    | jq .spec.nodeName
"ft-47dev-1-4kplg-worker-0-qjfcj"

% oc get -o json pod userns-test \
    | jq ".status.containerStatuses[0].containerID"
"cri-o://c90760e88ee8493bfdb9af661c18afef139b79541160850ceac125b0c62e1de3"

And in the node debug shell, I query the uid_map for the container:

sh-4.4# crictl inspect c90760e | jq .info.pid
1022187
sh-4.4# cat /proc/1022187/uid_map
         1     200001      65535
         0          0          1

This subtle change to the object definition caused OpenShift to run the process as root in the container and on the host! Given that the Kubernetes documentation implies that the two configurations are equivalent, this is a dangerous situation. I will file a ticket to bring this to the attention of the developers.

Continuing investigation §

There are two particular lines of investigation I need to pursue from here. The first is to confirm that setuid(2) and related functionality work properly in the namespaced container. This is important for containers that run multiple processes as different users. I do not anticipate any particular issues here. But I still need to verify it.

This is not the cloud native way. But this is the approach we are taking, for now. “Monolithic container” is a reasonable way to bring complex, traditional software systems into the cloud. As long as it can be done securely.

The other line of investigation is to find out how user-namespaced containers interact with volume mounts. If multiple containers, perhaps running on different nodes, read and write the same volume, what are the UIDs on that volume? Do we need stable, cluster-wide subuid/subgid mappings? If so, how can that be achieved? I expect I will much more to say about this in upcoming posts.

Conclusion §

CRI-O annotation-based user namespaces work in OpenShift 4.7. But there are some caveats, and at least one scary “gotcha”. Nevertheless, for simple workloads the feature does work well. It is big leap forward for running more kinds of workloads without compromising the security of your cluster.

In time, I hope the account privilege (SCC) caveat and securityContext issues can be resolved. I will file tickets and continue to discuss these topics with the OpenShift developers. And my investigations about more complex workloads and multi-node considerations shall continue.