notes.billmill.org / link_blog / 2024 / 07 /

German Strings

last updated: Jul 17, 2024

https://cedardb.com/blog/german_strings/

An article about a particular method ("German strings") for representing strings in databases.

For short (<12 byte) strings, it's a 4 byte length followed by 12 bytes, avoiding the need for a pointer dereference

 0                   1                   2                   3  
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                             length                            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               |
+                                                               +
|                             string                            |
+                                                               +
|                                                               |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

For anything else, it's a 4 byte length, followed by 4 bytes of its prefix, 2 bits of "class", and 62 bits of pointer

 0                   1                   2                   3  
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                             length                            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                             prefix                            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|cl.|                                                           |
+-+-+                                                           +
|                            pointer                            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

They hint at, but don't talk about, the encoding of the string. I assume that they're utf-8 encoding the string into bytes, but don't say, for example, how they treat a long string which changes meaning or becomes invalid if you chop off the 13th byte. (For example, imagine the 11th and 12th bytes are a combining diaeresis or a 4-byte emoji spans the 12th byte)

They point to duckdb's string implementation as an example; as far as I can tell from a quick scan it doesn't use the two-bit class, but instead just a regular 64-bit pointer.

They also point to apache arrow and polars documentation of their string types.

Backlinks:

protocol

↑ up