Tags: dogtag, ldap, troubleshooting

Dogtag, number ranges and VLV indices

In a previous post I explained Dogtag’s identifier range management. This is how a Dogtag replica knows what range it should use to assign serial numbers, request IDs, etc. What that article did not cover is how Dogtag at startup works out where it is up to in the range. In this post I explain how uses LDAP Virtual List View to do that, how it can break, and how to fix it.

LDAP Virtual List View

The LDAP protocol has an optional extension called Virtual List View (VLV), which is specified in an expired Internet-Draft. VLV supports result paging and is an extension of the Server Side Sort (SSS) control (RFC 2891). For a search that is covered by a VLV index, a client can specify a page size and offset and get just that portion of the result. It can also seek a specified attribute value and return nearby results.

In 389DS / RHDS, a VLV index is defined by two objects under cn=config. One of the VLV indices used in Dogtag is the search of all certificates sorted by serial number:

dn: cn=allCerts-pki-tomcat,
    cn=ipaca, cn=ldbm database, cn=plugins, cn=config
objectClass: top
objectClass: vlvSearch
cn: allCerts-pki-tomcat
vlvBase: ou=certificateRepository,ou=ca,o=ipaca
vlvScope: 1
vlvFilter: (certstatus=*)

dn: cn=allCerts-pki-tomcatIndex, cn=allCerts-pki-tomcat,
    cn=ipaca, cn=ldbm database, cn=plugins, cn=config
objectClass: top
objectClass: vlvIndex
cn: allCerts-pki-tomcatIndex
vlvSort: serialno
vlvEnabled: 0
vlvUses: 0

The first object defines the search base and filter. When performing a VLV search, these must match. The second object declares which attribute is the sort key. To perform a VLV search the client must use both the SSS control (which chooses the sort key) and the VLV control (which selects the page or the value of interest).

Dogtag range initialisation

When Dogtag is starting up, for each active identifier range it has to determine the first unused number. It uses VLV searches to do this. For serial numbers, it uses the VLV index shown above. For request IDs and other ranges, there are other indices. The VLV search targets the upper limit of the range, and requests the preceding values. It then looks for the highest value in the result that is also within the active range. This is the last number that was used; we increment it to get the next available number.

To make it a bit more concrete, we can perform a VLV search ourselves using ldapsearch:

# ldapsearch -LLL -D "cn=Directory Manager" -w $DM_PASS \
    -b ou=certificateRepository,ou=ca,o=ipaca -s one \
    -E 'sss=serialno' -E 'vlv=1/0:09267911168' \
    '(certStatus=*)' 1.1
dn: cn=397,ou=certificateRepository,ou=ca,o=ipaca

dn: cn=267911185,ou=certificateRepository,ou=ca,o=ipaca

# sortResult: (0) Success
# vlvResultpos=2 count=177 context= (0) Success

In this search the target value (end of the active range) is 09267911168. This is the integer 267911168 preceded by a two-digit length value. This is needed because the serialno attribute has Directory String syntax, which is sorted lexicographically. The 1/0 part of the control is asking for one value preceding the target value, and zero values following it.

The result contains two objects: 397 (which precedes the target) and 267911185 (which follows it). Why did we get a number following the target value? The target entry is the first entry whose sort attribute value is greater than or equal the target value. In this way, results greater than the target can appear in the result, as happened here.

The search above relates to the range 1..267911168. The result shows us to initialise the repository with 397 as the “last used” number. The next certificate issued by this replica will have serial number 398.

VLV index corruption

If a VLV index is corrupt or incomplete, Dogtag could initialise a repository with a too-low “last used” number. This could happen for serial numbers, request IDs or any other kind of managed range. When that happens, CA operations including certificate issuance or CSR submission could fail.

In fact, the ldapsearch above is from a customer case. A full search of the ou=certificateRepository showed thousands of certificates that were not included in the VLV index. If CA operations are failing due to LDAP “Object already exists” errors, you can perform this check to confirm or rule out VLV index corruption as the source of the problem. Keep in mind that VLV indices are maintained separately on each replica. Checks have to be performed on the replica where the problem is occurring.

Rebuilding VLV indices

389DS makes it easy to rebuild a VLV index. You create a task object and the DS takes care of it. For Dogtag, we even provide a template LDIF file for a task that reindexes all the VLV indices that Dogtag creates and uses.

First, copy and fill the template:

$ /bin/cp /usr/share/pki/ca/conf/vlvtasks.ldif .
$ sed -i "s/{instanceId}/pki-tomcat/g" vlvtasks.ldif
$ sed -i "s/{database}/ipaca/g" vlvtasks.ldif

Note that {database} should be replaced with ipaca in a FreeIPA instance, but for a standalone Dogtag deployment the correct value is usually ca. Now let’s look at the LDIF file:

dn: cn=index1160589769, cn=index, cn=tasks, cn=config
objectclass: top
objectclass: extensibleObject
cn: index1160589769
ttl: 10
nsinstance: ipaca
nsindexVLVAttribute: allCerts-pki-tomcatIndex
# ... 33 more nsindexVLVAttribute values

The cn is just a name for the task. I think you can put anything here. ttl specifies how many seconds 389DS will wait after the task finishes, before deleting it.

This task object refers to VLV indices in the Dogtag database. But you can see all that is needed to rebuild any VLV index is the nsinstance (name of the database) and the nsindexVLVAttribute (name of a VLV index).

Now we add the object, wait a few seconds, and have a look at it:

$ ldapadd -x -D "cn=Directory Manager" -w $DM_PASS \
    -f vlvtasks.ldif
$ sleep 5
$ ldapsearch -x -D "cn=Directory Manager" -w $DM_PASS \
  -b "cn=index1160589769,cn=index,cn=tasks,cn=config"
dn: cn=index1160589769,cn=index,cn=tasks,cn=config
objectClass: top
objectClass: extensibleObject
cn: index1160589769
ttl: 10
nsinstance: ipaca
nsindexvlvattribute: allCerts-pki-tomcatIndex
# .. 33 more nsindexvlvattribute values
nsTaskCurrentItem: 0
nsTaskTotalItems: 1
nsTaskCreated: 20200916021128Z
nsTaskLog:: aXBhY2E6IEluZGV4aW #... (base64-encoded log)
nsTaskStatus: ipaca: Finished indexing.
nsTaskExitCode: 0

We can see that the task finished successfully, and there is some (truncated) log output if we want more details. After a few more seconds, 389DS will delete the object. You can increase the ttl if you want to keep the objects for longer.

Discussion

This year I have encountered variations of this problem on several occasions. I don’t know what the cause(s) are, i.e. why VLV indices get corrupted or stop updating. Hopefully DS experts will be able to shed more light on the issue.

We are considering adding an automated check to the FreeIPA Health Check system, specifically for the range management VLVs. The GitHub ticket already contains some discussion and high level steps of how the check would work.

The proper fix for this issue is to move to UUIDs for all object identifiers. Serial numbers might need something different but it is the same idea. This work is on the roadmap. So many problems will go away when we make this change.

Historical commentary: I don’t know why the serialno, requestId and other attributes use Directory String syntax, which necessitates the length prefixing hack. Maybe SSS/VLV only work on strings (or it was thus in the past). The code predates our current VCS and the reasons are lost in time. The implication of this is that we can only handle numbers up to 99 decimal digits. Assumptions like this do bother me, but I think we are probably OK here. For my lifetime, anyway.

Creative Commons License
Except where otherwise noted, this work is licensed under a Creative Commons Attribution 4.0 International License .