OpenShift and user namespaces
FreeIPA in its current form is very much not a “cloud native” application. Likewise the current FreeIPA container, which runs all the required services under systemd. My current team is working on operationalising FreeIPA for the OpenShift container platform. Our initial efforts are focused around this “monolithic” container, trying to get it to run in OpenShift, securely. Although we recognise we may eventually need to split up the container, it will be a major engineering effort to do so. We want to have a working proof of concept as early as possible, so that we (and others) can start the important integration work (e.g. with Keycloak / RHSSO).
This “lift and shift” of a complex traditional application to
OpenShift results in a container that needs to run several processes
as a variety of users, including root
. OpenShift isolates
containers (actually pods, which consist of one or more containers)
in their own PID namespace. This is good, but if we are to run
container processes as root
(in the container), we do not want
them to also be root
on the host. Rather, they should map to an
unprivileged account. If we want secure multitenancy of multiple
IDM servers on a single worker node, we want the user accounts in
different IDM pods to map to disjoint sets of unprivileged users on
the host.
Linux user_namespaces(7)
provide this kind of isolation. To
what extent are user namespaces supported in OpenShift? We needed
to find out, in order to decide how to proceed with the FreeIPA
OpenShift effort. In this blog post I discuss my investigation and
findings.
Investigating current OpenShift behaviour §
To investigate the use (or not) of user namespaces I deployed pods on our team’s OpenShift cluster, running a simple command, and observed the effects on the worker node.
As cluster admin, I created a new project:
% oc new-project test
Now using project "test" on server "https://api.permanent.idmocp.lab.eng.rdu2.redhat.com:6443".
...
To avoid the cluster admin user’s SCC bindings applying to pod
creation, I created a user named test
and granted it the
project (not cluster) admin
role. Subsequent pod creation
operations were performed as user test
.
% oc create user test
user.user.openshift.io/test created
% oc adm policy add-role-to-user admin test
clusterrole.rbac.authorization.k8s.io/admin added: "test"
Next I deployed a basic pod (as user test
) and inspected it to
find out which worker node it was scheduled on, and the CRI-O
conatiner ID:
% cat pod-test.yaml
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
containers:
- name: idm-test
image: freeipa/freeipa-server:fedora-31
command: ["sleep", "3600"]
% oc --as test create -f pod-test.yaml
pod/test created
% oc get -o json pod test \
| jq .spec.nodeName
"permanent-bdd7p-worker-9r4b6"
% oc get -o json pod test \
| jq ".status.containerStatuses[0].containerID"
"cri-o://a9c0cf0ac9c0c352b82a74cccf830dfa8c33aae28138808eb7bdd9d53aae2d1f"
Next, opening a debug shell on the worker node I inspected the container to find out the PID:
% oc debug node/permanent-bdd7p-worker-9r4b6
Starting pod/permanent-bdd7p-worker-9r4b6-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.8.3.215
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# crictl inspect a9c0cf0ac | jq .pid
1311115
Next I looked at which user the process is running under, and the UID map of the process:
sh-4.4# ls -l -d /proc/1311115
dr-xr-xr-x. 9 1000620000 root 0 Nov 5 05:34 /proc/1311115
sh-4.4# cat /proc/1311115/uid_map
0 0 4294967295
The process was running as user 1000620000
, and UID map has an
offset of 0
and a size of 2^32
. Which is to say, this
process is running in the same user namespace as the host. We can
use the lsns
command to confirm that everything on this
node–including all container processes–shares the single user
namespace:
sh-4.4# lsns -t user
NS TYPE NPROCS PID USER COMMAND
4026531837 user 296 1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 18
As a result, if we use runAsUser
to specify a different user
under which to run the container, the container will run as the
specified user both in the container and on the host. The
following transcript demonstrates this.
Delete the pod test
:
% oc delete pod test
pod "test" deleted
Add the anyuid
SCC to user test
:
% oc adm policy add-scc-to-user anyuid test
securitycontextconstraints.security.openshift.io/anyuid added to: ["test"]
Create the pod (as user test
):
% oc --as test create -f pod-test.yaml
pod/test created
Following the same procedure as earlier, find the PID (it was
1381728
) and observe that it is running as root
(UID 0
)
on the host:
sh-4.4# ls -l -d /proc/1381728
dr-xr-xr-x. 9 root root 0 Nov 5 05:55 /proc/1381728
Consequences for FreeIPA §
Traditional applications sometimes assume they will run as root
or some other “reserved” user. FreeIPA is such a case. Likewise,
running systemd in a container means running as UID 0 (from the
container’s point of view).
The lack of user namespace use in OpenShift means that for a process
to run under a particular UID in the container, it must run as that
user on the host too. If you application needs to be root
, it
will be root
on the host. Other kinds of namespaces (e.g.
pid
, mnt
, uts
among others) do mitigate the security
risk. But if a rogue process can escalate privileges and escape the
other sandbox(es) the result could be catastrophic.
FreeIPA, being composed of many components, some of which are large complex projects in their own right, and several of which are implemented in C or leverage C libraries, has a large attack surface. In the absense of user namespaces the risk of container host or co-tenant compromise—even by accident—seems high.
This all assumes that containers do not have user namespace
isolation and that FreeIPA continues to require running processes in
the FreeIPA container as fixed UIDs (probably including root
).
I will now discuss possible ways to eliminate these assumptions.
User namespace support in Kubernetes §
OpenShift is built on the Kubernetes container platform. Kubernetes Enhancement Proposal KEP-127 proposes user namespace support. The ticket has been open for 4 years and has since seen several efforts to formalise the proposal, the most recent of which is kubernetes/enhancements#2101 (rendered). There have also been several experimental implementations (e.g. #55707, #64005), none of which was accepted (yet).
There has been a recent resurgence of activity on this KEP, and related discussions and pull requests. But that has happened before. I believe that every new (or resurrected) discussion or experiment can move you closer to the goal, and that there can be several false starts before things happen. Maybe this time it will happen… or maybe not.
Right now there is no final proposal and no implementation plan. As a team we cannot proceed on the assumption that Kubernetes will support user namespaces. We will certainly present our case to OpenShift engineering internally at Red Hat, but we have to look at other options.
User namespace support in CRI-O §
The CRI-O container runtime recently implemented support for running each pod in a separate user namespace, via annotations on the pod, e.g.:
apiVersion: v1
kind: Pod
metadata:
annotations:
io.kubernetes.cri-o.userns-mode: "auto"
spec:
...
Using annotations means that no explicit support in Kubernetes is
required. All that is required is that Kubernetes is using the
CRI-O container runtime, and that CRI-O is configured to enable this
feature. OpenShift 4.x does use CRI-O, so we’re halfway there. The
remaining step is to enable the feature in crio.conf
:
allow_userns_annotation = true
The developer Giuseppe Scrivano kindly published a screencast showing the feature in action (2 minutes). This feature is not yet in a supported release but is available on the v1.20 branch and is included in OpenShift nightly builds.
Splitting the FreeIPA container §
If Kubernetes or CRI-O user namespace support to does not solve our problem (in our desired timeframe) then there is more pressure to abandon the monolithic container and devote our efforts to a “split-service” FreeIPA/IDM application. In this scenario, the various services that make up FreeIPA (LDAP, KDC, HTTP, CA and others) would each run as an unprivileged process in its own container.
This would be a big engineering effort. Apart from FreeIPA as a whole, most of the constituent services are also “traditional” applications that make assumptions about their environment and execution context—assumptions that do not hold in the OpenShift container paradigm.
There is a general (albeit unevenly distributed) feeling in the team that in the long run this effort is inevitable. I do hold this view myself, but also recognise that the sooner we can have a working proof of concept, the better. That is the main reason we are initially pursuing the monolithic container approach.
Next steps §
My next step will be to install an OpenShift cluster based on the nightly builds (which include CRI-O v1.20) and experiment with the annotation-based user namespace support. It seems to be what we want, or a big step in the right direction, but we need to confirm it. Expect a follow-up to this article with my findings, hopefully in the next week!