User namespaces in OpenShift via CRI-O annotations
In a recent post I covered the lack of user namespace support in OpenShift, and discussed the upcoming CRI-O feature for user namespacing of containers, controlled by annotations.
I now have an OpenShift nightly cluster deployed. It uses a prerelease version of CRI-O v1.20, which includes this new feature. So it’s time to experiment! This post records my investigation of this feature.
Preliminaries §
I’ll skip the details of deploying the nightly (4.7) cluster
(because they are not important). What is important is that I
created a MachineConfig
to enable the CRI-O user namespace
annotation feature, as described in my previous post.
As in the initial investigation, I created a new user account and project namespace for the experiments:
% oc new-project test
Now using project "test" on server "https://api.permanent.idmocp.lab.eng.rdu2.redhat.com:6443".
% oc create user test
user.user.openshift.io/test created
% oc adm policy add-role-to-user admin test
clusterrole.rbac.authorization.k8s.io/admin added: "test"
Creating a user namespaced pod - Attempt 1 §
I defined a pod that just runs sleep
, but uses the new
annotation to run it in a user namespace. The map-to-root=true
directive says that the “beginning” of the host uid range assigned
to the container should maps to uid 0 (i.e. root
) in the
container.
$ cat userns-test.yaml
apiVersion: v1
kind: Pod
metadata:
name: userns-test
annotations:
io.kubernetes.cri-o.userns-mode: "auto:map-to-root=true"
spec:
containers:
- name: userns-test
image: freeipa/freeipa-server:fedora-31
command: ["sleep", "3601"]
Create the pod:
$ oc --as test create -f userns-test.yaml
pod/userns-test created
After a few seconds, does everything look OK?
$ oc get pod userns-test
NAME READY STATUS RESTARTS AGE
userns-test 0/1 ContainerCreating 0 14s
Hm, 14 seconds seems a long time to be stuck at
ContainerCreating
. What does oc describe
reveal?
$ oc describe pod/userns-test
Name: userns-test
Namespace: test
Priority: 0
Node: ft-47dev-2-27h8r-worker-0-j4jjn/10.8.1.106
Start Time: Mon, 30 Nov 2020 12:41:34 +0000
Labels: <none>
Annotations: io.kubernetes.cri-o.userns-mode: auto:map-to-root=true
openshift.io/scc: restricted
Status: Pending
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> Successfully assigned test/userns-test to ft-47dev-2-27h8r-worker-0-j4jjn
Warning FailedCreatePodSandBox <invalid> (x96 over 20m) kubelet, ft-47dev-2-27h8r-worker-0-j4jjn Failed to create pod sandbox: rpc error: code = Unknown desc = error creating pod sandbox with name "k8s_userns-test_test_e4f69d50-e061-46ca-b933-000bcea3363a_0": could not find enough available IDs
The node failed to create the pod sandbox. To spare you scrolling to read the unwrapped error message, I’ll reproduce it:
Failed to create pod sandbox: rpc error: code = Unknown
desc = error creating pod sandbox with name
"k8s_userns-test_test_e4f69d50-e061-46ca-b933-000bcea3363a_0":
could not find enough available IDs
My initial reaction to this error is: this is good! It seems that CRI-O is attempting to create a user namespace for the container, but cannot. Another problem to solve, but we seem to be on the right track.
/etc/subuid
§
I had not yet done any host configuration related to user namespace
mappings. But I had a feeling that the /etc/subuid
and
/etc/subgid
files would come into play. According to
subuid(5)
:
Each line in /etc/subuid contains a user name and a range of subordinate user ids that user is allowed to use.
The description in subgid(5)
is similar.
If the user that is attempting to create the containers doesn’t have an sufficient range of unused host uids and gids to use, it follows that it will not be able to create the user namespace for the pod.
I used a debug shell to observe the current contents of
/etc/subuid
and /etc/subgid
on worker nodes:
sh-4.4# cat /etc/subuid
core:100000:65536
sh-4.4# cat /etc/subgid
core:100000:65536
The user core
owns a uid and gid range of size 65536, starting
at uid/gid 100000. There are no other ranges defined.
At this point, I have a strong feeling we need to define uid and gid
ranges for the appropriate user, and then things will hopefully
start working. The next question is: who is the appropriate user?
That is, in OpenShift which user is responsible for creating the
containers and, in this case, the user namespaces? Again on the
worker node debug shell, I queried which user is running crio
:
sh-4.4# ps -o user,pid,cmd -p $(pgrep crio)
USER PID CMD
root 1791 /usr/bin/crio --enable-metrics=true --metrics-port=9537
crio
is running as the root
user, which is not surprising.
So we will need to add mappings for the root
user to the mapping
files.
MachineConfig
for modifying /etc/sub[ug]id
§
I will create a MachineConfig
to append the mappings
/etc/subuid
and /etc/subgid
. First we need the base64
encoding of the line we want to add:
$ echo "root:200000:268435456" | base64
cm9vdDoyMDAwMDA6MjY4NDM1NDU2Cg==
The MachineConfig
definition (note that it is scoped to the
worker
role):
$ cat machineconfig-subuid-subgid.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: subuid-subgid
spec:
config:
ignition:
version: 3.1.0
storage:
files:
- path: /etc/subuid
append:
- source: data:text/plain;charset=utf-8;base64,cm9vdDoyMDAwMDA6MjY4NDM1NDU2Cg==
- path: /etc/subgid
append:
- source: data:text/plain;charset=utf-8;base64,cm9vdDoyMDAwMDA6MjY4NDM1NDU2Cg==
Creating the MachineConfig
object:
$ oc create -f machineconfig-subuid-subgid.yaml
machineconfig.machineconfiguration.openshift.io/subuid-subgid created
After a few moments, checking the machineconfigpool/worker
object revealed that cluster is in a degraded state:
$ oc get -o json mcp/worker |jq '.status.conditions[-2:]'
[
{
"lastTransitionTime": "2020-12-01T02:55:52Z",
"message": "Node ft-47dev-2-27h8r-worker-0-f8bnl is reporting: \"can't reconcile config rendered-worker-a37679c5cfcefb5b0af61bb3674dccc4 with rendered-worker-3cbd4cabeedd441500c83363dbf505fd: ignition file /etc/subuid includes append: unreconcilable\"",
"reason": "1 nodes are reporting degraded status on sync",
"status": "True",
"type": "NodeDegraded"
},
{
"lastTransitionTime": "2020-12-01T02:55:52Z",
"message": "",
"reason": "",
"status": "True",
"type": "Degraded"
}
]
The error message is:
Node ft-47dev-2-27h8r-worker-0-f8bnl is reporting: \"can't
reconcile config rendered-worker-a37679c5cfcefb5b0af61bb3674dccc4
with rendered-worker-3cbd4cabeedd441500c83363dbf505fd: ignition
file /etc/subuid includes append: unreconcilable\"",
Upon further investigation, I learned that the Machine Config
Operator does not support append
operations. This is because
appends, in general, are not idempotent and commutative. So I will
try again with a new machine config that completely replaces the
/etc/subuid
and /etc/subgid
files.
The new content shall be:
core:100000:65536
root:200000:268435456
The updated MachineConfig
definition is:
$ cat machineconfig-subuid-subgid.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: subuid-subgid
spec:
config:
ignition:
version: 3.1.0
storage:
files:
- path: /etc/subuid
overwrite: true
contents:
source: data:text/plain;charset=utf-8;base64,Y29yZToxMDAwMDA6NjU1MzYKcm9vdDoyMDAwMDA6MjY4NDM1NDU2Cg==
- path: /etc/subgid
overwrite: true
contents:
source: data:text/plain;charset=utf-8;base64,Y29yZToxMDAwMDA6NjU1MzYKcm9vdDoyMDAwMDA6MjY4NDM1NDU2Cg==
I replaced the MachineConfig
object:
$ oc replace -f machineconfig-subuid-subgid.yaml
machineconfig.machineconfiguration.openshift.io/subuid-subgid replaced
After a few moments, the cluster is no longer degraded and the worker nodes will be updated over the next several minutes:
$ oc get mcp/worker
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
worker rendered-worker-a37679c5cfcefb5b0af61bb3674dccc4 False True False 4 0 0 0 3d20h
After READYMACHINECOUNT
reached 4
(all machines in the
worker
pool), I used a debug shell on one of the worker nodes to
confirm that the changes had been applied:
$ oc debug node/ft-47dev-2-27h8r-worker-0-j4jjn
Starting pod/ft-47dev-2-27h8r-worker-0-j4jjn-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.8.1.106
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# cat /etc/subuid
core:100000:65536
root:200000:268435456
sh-4.4# cat /etc/subgid
core:100000:65536
root:200000:268435456
Looks good!
Creating a user namespaced pod - Attempt 2 §
It’s time to create the user namespaced pod again, and see if it succeeds this time.
$ oc --as test create -f userns-test.yaml
pod/userns-test created
Unfortunately, the same FailedCreatePodSandBox
error occurred.
My subuid
remedy was either incorrect, or insufficient. I
decided to use a debug shell on the worker node to examine the
system journal. I searched for the error string could not find enough available IDs
, and found the error in the output of the
hyperkube
unit. A few lines above that, there are some crio
log messages, including:
Cannot find mappings for user \"containers\": No subuid
ranges found for user \"containers\" in /etc/subuid"
So, my mistake was defining ID map ranges for the root
user. I
should have used the containers
user. I fixed the
MachineConfig
definition to use the file content:
core:100000:65536
containers:200000:268435456
Then I replaced the subuid-subgid
object and again waited for
Machine Config Operator to update the worker nodes.
Creating a user namespaced pod - Attempt 3 §
Once again, the container remained at ContainerCreating
. But
the error was different (lines wrapped for readability):
Failed to create pod sandbox: rpc error:
code = Unknown
desc = container create failed:
time="2020-12-01T06:40:49Z"
level=warning
msg="unable to terminate initProcess"
error="exit status 1"
time="2020-12-01T06:40:49Z"
level=error
msg="container_linux.go:366: starting container process caused:
process_linux.go:472: container init caused:
write sysctl key net.ipv4.ping_group_range:
write /proc/sys/net/ipv4/ping_group_range: invalid argument"
After a bit of research, here is my understanding of the situation:
CRI-O successfully created the pod sandbox (which includes the user
namespace) and is now initialising it. One of the initialisation
steps is to set the net.ipv4.ping_group_range
sysctl (the
subroutine is part of runc
), and this is failing. This step is
performed for all pods, but it is only failing when the pod is using
a user namespace.
net.ipv4.ping_group_range
and user namespaces §
The net.ipv4.ping_group_range
sysctl defines the range of group
IDs that are allowed to send ICMP Echo packets. Setting it to the
full gid range allows ping
to be used in rootless containers,
without setuid or the CAP_NET_ADMIN
and CAP_NET_RAW
capabilities.
The CRI-O config key crio.runtime.default_sysctls
declares the
default sysctls that will be set in all containers. The default
OpenShift CRI-O configuration sets it to the full gid range:
sh-4.4# cat /etc/crio/crio.conf.d/00-default \
| grep -A2 default_sysctls
default_sysctls = [
"net.ipv4.ping_group_range=0 2147483647",
]
My working hypothesis is that setting the sysctl in the
user-namespaced container fails because the gid range in the sandbox
is not 0–2147483647
but much smaller. This could explain the
invalid argument
part of the error message.
How to overcome this? I first thought to update the pod spec to specify a different value for the sysctl that reflects the actual gid range in the sandbox. And to do that, I have to calculate what that gid range is.
Computing the gid range §
I will work on the assumption that I must refer to the range as it appears in the namespace. That assumption could be wrong, but that’s where I’m starting.
Because I am using map-to-root=true
, the start value of the
range should be 0
. The second number in the
ping_group_range
sysctl value is not the range size but the end
gid (inclusive). CRI-O currently hard-codes a default user
namespace size of 65536
.
Because the size of the uid range is a critical parameter, I shall
from now on explicitly declare the desired size in the
userns-mode
annotation. This will protect the solution from
change to the default range size. I probably won’t need 65536
uids/gids but I’ll stick with the default for now.
io.kubernetes.cri-o.userns-mode: "auto:size=65536;map-to-root=true"
With a range of 65536
starting at 0
, the desired sysctl
setting is net.ipv4.ping_group_range=0 65535
.
Configuring the sysctl §
We need ping
to continue working in containers that are not
namespaced. Therefore, overriding or clearing the CRI-O
default_sysctls
config is not an option. Instead I need a way
to optionally set the net.ipv4.ping_group_range
sysctl to a
specified value on a per-pod basis.
You can specify sysctls to be set in a pod via the
spec.securityContext.sysctls
array (see Kubernetes
PodSecurityContext documentation). I updated the pod definition
to include the sysctl:
$ cat userns-test.yaml
apiVersion: v1
kind: Pod
metadata:
name: userns-test
annotations:
io.kubernetes.cri-o.userns-mode: "auto:size=65536;map-to-root=true"
spec:
containers:
- name: userns-test
image: freeipa/freeipa-server:fedora-31
command: ["sleep", "3601"]
securityContext:
sysctls:
- name: "net.ipv4.ping_group_range"
value: "0 65535"
As I write this, I don’t know yet how CRI-O behaves when both
default_sysctls
and the pod spec define the same sysctl. It
might just set the value from the pod spec, which is the behaviour I
need. Or it might first attempt to set the value from
default_sysctls
, and afterwards set it again to the value from
the pod spec (this will fail as before).
Time to find out!
Creating a user namespaced pod - Attempt 4 §
$ oc --as test create -f userns-test.yaml
pod/userns-test created
# ... wait ...
$ oc get pod userns-test
NAME READY STATUS RESTARTS AGE
userns-test 0/1 CreateContainerError 0 118s
OK, progress was made! It did not get stuck at
ContainerCreating
; this time we got a CreateContainerError
.
This means that the CRI-O sysctl behaviour is what we were hoping
for. As for the new error, oc describe
gave the detail:
Error: container create failed:
time="2020-12-01T12:38:45Z"
level=error
msg="container_linux.go:366: starting container process caused:
setup user: cannot set uid to unmapped user in user namespace"
My guess is that CRI-O is ignoring the fact that the pod is in a
user namespace and is attempting to execute the process using the
same uid as it would if the pod were not in a user namespace. The
uid is outside the mapped range (0
–65535
). For my next
attempt I will add runAsUser
and runAsGroup
to the
securityContext
.
But first some other quick notes and observations. First of all, a user namespace was indeed created for this pod!
sh-4.4# lsns -t user
NS TYPE NPROCS PID USER COMMAND
4026531837 user 277 1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 16
4026532599 user 1 684279 200000 /usr/bin/pod
We can examine the uid and gid maps for the namespace:
sh-4.4# cat /proc/684279/uid_map
0 200000 65536
sh-4.4# cat /proc/684279/gid_map
1 200001 65535
0 1000610000 1
It surprised me that gid 0
is mapped to system user
1000610000
. I don’t know what consequences this might have; for
now I am just noting it.
Because the pod sandbox does exist, I also decided to see if I could get a debug shell:
$ oc debug pod/userns-test
Starting pod/userns-test-debug, command was: sleep 3601
Pod IP: 10.129.3.170
If you don't see a command prompt, try pressing enter.
sh-5.0$ id
uid=1000610000(1000610000) gid=0(root) groups=0(root),1000610000
It worked! But the debug shell cannot be running in the user
namespace; the uid (1000610000
) is too high. Running lsns
in my worker node debug shell confirms it; the namespace still has
only one process running in it:
sh-4.4# lsns -t user
NS TYPE NPROCS PID USER COMMAND
4026531837 user 282 1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 16
4026532599 user 1 684279 200000 /usr/bin/pod
Creating a user namespaced pod - Attempt 5 §
I once again deleted the userns-test
pod. As proposed above, I
modified the pod security context to specify that the entry point
should be run as uid 0
and gid 0
:
$ cat userns-test.yaml
apiVersion: v1
kind: Pod
metadata:
name: userns-test
annotations:
io.kubernetes.cri-o.userns-mode: "auto:size=65536;map-to-root=true"
spec:
containers:
- name: userns-test
image: freeipa/freeipa-server:fedora-31
command: ["sleep", "3601"]
securityContext:
runAsUser: 0
runAsGroup: 0
sysctls:
- name: "net.ipv4.ping_group_range"
value: "0 65535"
Here we go:
$ oc --as test create -f userns-test.yaml
Error from server (Forbidden): error when creating
"userns-test.yaml": pods "userns-test" is forbidden: unable to
validate against any security context constraint:
[spec.containers[0].securityContext.runAsUser: Invalid value: 0:
must be in the ranges: [1000610000, 1000619999]]
sad trombone
I don’t have a clear idea how I could proceed. The security context
constraint (SCC) is prohibiting the use of uid 0
for the
container process. Switching to a permissive SCC might allow me to
proceed, but it would also mean using a more privileged OpenShift
user account. Then that privileged account could then create
containers running as root
in the system user namespace. We
want user namespaces in OpenShift so that we can avoid this exact
scenario. So resorting to a permissive SCC (e.g. anyuid
) feels
like the wrong way to go.
It could be that it’s the only way to go for now, and that more nuanced security policy mechanisms must be implemented before user namespaces can be used in OpenShift to achieve the security objective. In any case, I’ll be reaching out to other engineers and OpenShift experts for their suggestions.
For now, I’m calling it a day! See you soon for the next episode.