Running Pods in user namespaces without privileged SCCs
In previous posts I demonstrated how to run workloads in an
isolated user namespace on OpenShift. There are still come caveats
to doing this. One of these relates to Security Context
Constraints (SCCs), a security policy mechanism in OpenShift. In
particular, it appeared necessary to admit the Pod via the anyuid
SCC, or one with similar high privileges. This meant that although
the workload itself runs under unprivileged UIDs, the account that
creates the Pod would need privileges to create Pods that run under
arbitrary host UIDs. This is not a desirable situation.
I have investigated that matter further, and it turns out that you
can run a workload in a user namespace even via the default
restricted
SCC. But the configuration is not intuitive, and the
reasons why it must be configured that way are convoluted. In
this post I explain the challenges that arise when running a user
namespaced Pod under the restricted
SCC, and demonstrate the
solution.
This post assumes a basic knowledge of Security Context Constraints. If you are unfamiliar with SCCs, the DevConf.cz 2022 presentation Introduction to Security Context Constraints (slides, video) by Alberto Losada and Mario Vázquez will bring you up to speed.
Cluster configuration §
I am testing on an OpenShift 4.10 (pre-release) cluster. Some
changes to worker node configuration are required. The following
MachineConfig
object defines those changes:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: idm-4-10
spec:
kernelArguments:
- systemd.unified_cgroup_hierarchy=1
- cgroup_no_v1="all"
- psi=1
config:
ignition:
version: 3.1.0
systemd:
units:
- name: "override-runc.service"
enabled: true
contents: |
[Unit]
Description=Install runc override
After=network-online.target rpm-ostreed.service
[Service]
ExecStart=/bin/sh -c 'rpm -q runc-1.0.3-992.rhaos4.10.el8.x86_64 || rpm-ostree override replace --reboot https://ftweedal.fedorapeople.org/runc-1.0.3-992.rhaos4.10.el8.x86_64.rpm'
Restart=on-failure
[Install]
WantedBy=multi-user.target storage:
files:
- path: /etc/subuid
overwrite: true
contents:
source: data:text/plain;charset=utf-8;base64,Y29yZToxMDAwMDA6NjU1MzYKY29udGFpbmVyczoyMDAwMDA6MjY4NDM1NDU2Cg==
- path: /etc/subgid
overwrite: true
contents:
source: data:text/plain;charset=utf-8;base64,Y29yZToxMDAwMDA6NjU1MzYKY29udGFpbmVyczoyMDAwMDA6MjY4NDM1NDU2Cg==
- path: /etc/crio/crio.conf.d/99-crio-userns.conf
overwrite: true
contents:
source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZS53b3JrbG9hZHMub3BlbnNoaWZ0LXVzZXJuc10KYWN0aXZhdGlvbl9hbm5vdGF0aW9uID0gImlvLm9wZW5zaGlmdC51c2VybnMiCmFsbG93ZWRfYW5ub3RhdGlvbnMgPSBbCiAgImlvLmt1YmVybmV0ZXMuY3JpLW8udXNlcm5zLW1vZGUiLAogICJpby5rdWJlcm5ldGVzLmNyaS1vLmNncm91cDItbW91bnQtaGllcmFyY2h5LXJ3IiwKICAiaW8ua3ViZXJuZXRlcy5jcmktby5EZXZpY2VzIgpdCg==
The main parts of this MachineConfig
are:
The
kernelArguments
enable cgroupsv2, which are not strictly required for this demo, but are required for running systemd-based workloads.The
override-runc.service
systemd unit installs a custom version of runc that implements the new OCI Runtime Specification cgroup ownership semantics. This should be the default behaviour in future versions of OpenShift, perhaps as soon as OpenShift 4.11./etc/subuid
and/etc/subgid
provide a sub-id mapping range for CRI-O to use when creating Pods with user namespaces./etc/crio/crio.conf.d/99-crio-userns.conf
defines theio.openshift.userns
workload type for CRI-O. It is also not strictly necessary for this demo but is required for systemd-based workloads to run successfully. The default CRI-O configuration in OpenShift 4.10 provides theio.openshift.builder
workload type, which is sufficient if your workload does not need to manage cgroups.
Aside from the node configuration changes, I (as cluster admin) also created project and user account to use for the subsequent steps:
% oc new-project test
Now using project "test" on server "https://api.ci-ln-5rkyxfb-72292.origin-ci-int-gce.dev.rhcloud.com:6443".
…
% oc create user test
user.user.openshift.io/test created
% oc adm policy add-role-to-user edit test
clusterrole.rbac.authorization.k8s.io/edit added: "test"
I did not assign any special SCCs to the test
user account.
Remember to wait for the Machine Config Operator to finish updating
the worker nodes before proceeding with Pod creation. You can use
oc wait
to await this condition:
% oc wait mcp/worker \
--for condition=updated --timeout=-1s
Problem demonstration §
The objective is to run a Pod in a user namespace, with that Pod
being admitted via the default restricted
SCC. We will start with
the following Pod definition:
apiVersion: v1
kind: Pod
metadata:
name: fedora
annotations:
io.openshift.userns: "true"
io.kubernetes.cri-o.userns-mode: "auto:size=65536"
spec:
containers:
- name: fedora
image: registry.fedoraproject.org/fedora:35-x86_64
command: ["sleep", "3600"]
The io.openshift.userns
annotation selects the CRI-O workload
profile that we added via the MachineConfig
above. This profile
enables several other annotations, but does not automatically
execute the Pod in a user namespace. For that, you must also
supply the io.kubernetes.cri-o.userns-mode
annotation. Its
argument tells CRI-O to automatically select unique host UID range
of size 65536 to map into the container’s user namespace.
I created the Pod as user test
:
% oc --as test create -f pod-fedora.yaml
pod/fedora created
Observe that it was admitted via the restricted
SCC:
% oc get -o json pod/fedora \
| jq '.metadata.annotations."openshift.io/scc"'
"restricted"
Unfortunately, the container is not running:
% oc get -o json pod/fedora \
| jq '.status.containerStatuses[].state'
{
"waiting": {
"message": "container create failed: time=\"2022-02-02T05:43:34Z\" level=error msg=\"container_linux.go:380: starting container process caused: setup user: cannot set uid to unmapped user in user namespace\"\n",
"reason": "CreateContainerError"
}
}
The core error message is: cannot set uid to unmapped user in
user namespace. This arises because, in the absense of a
runAsUser
specification in the PodSpec, the restricted
SCC has
defaulted it to a value from the UID range assigned to the project:
% oc get -o json pod/fedora \
| jq '.spec.containers[].securityContext.runAsUser'
1000650000
The project UID range allocation is recorded in the project and namespace annotations:
% oc get -o json project/test namespace/test \
| jq '.items[].metadata.annotations."openshift.io/sa.scc.uid-range"'
"1000650000/10000"
"1000650000/10000"
OpenShift allocated to project test
a range of 10000 UIDs starting
at 1000650000
. The error arises because UID 1000650000
is not
mapped in the user namespace. The host UID range may be something
like 200000
–265535
, whereas the sandbox’s UID range is
0
–65535
.
I deleted the Pod and will try something different:
% oc delete pod/fedora
pod "fedora" deleted
Let’s say that we want to run the container process as UID 0
in
the Pod’s user namespace, as would be required for a systemd-based
workload. Instead of leaving it to the SCC machinery, I’ll set
runAsUser: 0
in the PodSpec myself:
apiVersion: v1
kind: Pod
metadata:
name: fedora
annotations:
io.openshift.userns: "true"
io.kubernetes.cri-o.userns-mode: "auto:size=65536"
spec:
containers:
- name: fedora
image: registry.fedoraproject.org/fedora:35-x86_64
command: ["sleep", "3600"]
securityContext:
runAsUser: 0
This time the test
user cannot even create the Pod:
% oc --as test create -f pod-fedora.yaml
Error from server (Forbidden): error when creating "pod-fedora.yaml"…
I’ve trimmed the rather long error message, but the core problem is:
spec.containers[0].securityContext.runAsUser: Invalid value:
0: must be in the ranges: [1000650000, 1000659999]
The restricted
SCC only allows runAsUser
values that fall in the
projects assigned UID range. And this is what we would expect. The
problem is that the admission machinery has no awareness of user
namespaces. It cannot discern that runAsUser: 0
means that we
want to run as UID 0
inside the user namespace, whilst mapped to
an unprivileged UID on the host.
The problem is twofold. First, we are unable to control the UID
mapping that CRI-O gives us, so that it would coincide with the
project’s UID range. Second, the SCC admission checks and
defaulting is oblivious to user namespace. runAsUser
is
interpreted as referring to host UIDs, and the restricted
SCC
restricts (or defaults) us to values that are not mapped in the
Pod’s user namespace.
Solution §
The map-to-root
option in the userns-mode
annotation provides a
solution to this dilemma. It takes whatever value runAsUser
is,
and ensures that that host UID gets mapped to UID 0
in the Pod
user namespace. The updated PodSpec is:
apiVersion: v1
kind: Pod
metadata:
name: fedora
annotations:
io.openshift.userns: "true"
io.kubernetes.cri-o.userns-mode:
"auto:size=65536;map-to-root=true"
spec:
securityContext:
runAsUser: 1000650000
containers:
- name: fedora
image: registry.fedoraproject.org/fedora:35-x86_64
command: ["sleep", "3600"]
Now the Pod is able to run:
% oc --as test create -f pod-fedora.yaml
pod/fedora created
% oc get -o json pod/fedora \
| jq '.spec.nodeName, .status.containerStatuses[].state'
"ci-ln-fizz88k-72292-9phfc-worker-c-7s99v"
{
"running": {
"startedAt": "2022-02-02T06:20:49Z"
}
}
We can observe the UID mapping:
% oc rsh pod/fedora cat /proc/self/uid_map
1 265536 65535
0 1000650000 1
This shows that UID 0
in the Pod’s user namespace maps to UID
10000650000
in the parent (host) user namespace. The remaining
UIDs 1
–65536
in the Pod’s user namespace are mapped contiguously
from UID 265536
in the host user namespace.
Objective achieved.
Why runAsUser
must be specified §
Referring back to the PodSpec, why is it necessary to explicitly
specify runAsUser
? Doesn’t the SCC admission machinery
automatically set the default value? Well… yes, and no. The SCC
machinery defaults runAsUser
in each container’s
securityContext
field. But it does not set it in the Pod’s
securityContext
. And it is the Pod securityContext
that CRI-O
examines when processing the map-to-root
option. If it is unset,
CRI-O
will not set the mapping up properly and container(s) will
fail to run.
The consequence of this is that the user or operator creating the
Pod must first examine the Project or Namespace object to learn what
its assigned UID range is. Then it must set the
spec.securityContext.runAsUser
field to the start value of that
range. The range assignment will certainly differ from project to
project so it cannot be hardcoded. This is a bit annoying: more
work for the human operator, or more automation behaviour to
implement and maintain.
The simplest solution I can think of is to enhance the SCC
processing to also set spec.securityContext.runAsUser
if it is
unset. Then CRI-O would see the value it needs to see.
Alternatively CRI-O could be enhanced to check the container
securityContext
if the runAsUser
is not specified in the Pod
securityContext
. But to me this seems ill principled because
different containers (in the same Pod) could specify different
values, and there is no obvious “right” way to resolve the
ambiguities.
Using multiple UIDs §
Although I have a nice range of 65536 UIDs mapped in the Pod’s user
namespace, I am not able to run processes as any UID other than 0
.
This is beacuse the restricted
SCC forcibly omits CAP_SETUID
(among others) from the capability bounding set of the container
process. Complex workloads, including any based on systemd, will
fail to run properly under such a constraint.
The simplest workaround is to admit the Pod via the anyuid
SCC.
But that undoes the good outcome achieved in this post!
An intermediate workaround is the create a new SCC that does not
forcibly deprive containers of CAP_SETUID
. This entails
administrative overhead.
It also increases the attack surface. The setuid(2)
system call
is restricted to UIDs mapped in the UID namespace of the calling
process. If the calling process is in an isolated user namespace
that maps to unprivileged host UIDs, it is safe (up to kernel bugs)
to grant CAP_SETUID
to that process. But recall that user
namespaces are still opt-in; by default Pods use the host user
namespace. An SCC can use MustRunAsRange
to restrict the
initial container process to running as a user in the project’s
assigned UID range. But if that SCC also lets containers use
CAP_SETUID
, then it doesn’t really provide more protection than
anyuid
A more robust solution would be to modify CRI-O to reinstate
CAP_SETUID
and related capapbilities when the Pod runs in a user
namespace. I will raise the topic with the CRI-O maintainers, as
solving this problem is important for our use case, and probably
other “legacy” workloads too.
Conclusion §
In this post I demonstrated how to run workloads in a user namespace
on OpenShift, under the default restricted
SCC. The map-to-root
option is critical to accomplishing this. There is an unfortunate
“rough edge” in that the workload must specifically refer to the UID
range assigned to the namespace in which the Pod will live, which
means additional work for or complexity in the operator (human or
otherwise).
Despite this progress, if you need to run processes under different
UIDs in the container(s), the restricted
UID won’t work because it
deprives the container process of the CAP_SETUID
capability. You
must go back to admitting the workload via anyuid
or a similar
SCC, which is a significant erosion of the security boundaries
between containers and the host. This issue will be the subject of
future investigations.