German Strings

last updated: Jul 17, 2024

https://cedardb.com/blog/german_strings/

An article about a particular method ("German strings") for representing strings in databases.

0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + + | string | + + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | length | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | prefix | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |cl.| | +-+-+ + | pointer | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

They hint at, but don't talk about, the encoding of the string. I assume that they're utf-8 encoding the string into bytes, but don't say, for example, how they treat a long string which changes meaning or becomes invalid if you chop off the 13th byte. (For example, imagine the 11th and 12th bytes are a combining diaeresis or a 4-byte emoji spans the 12th byte)

They point to duckdb's string implementation as an example; as far as I can tell from a quick scan it doesn't use the two-bit class, but instead just a regular 64-bit pointer.

They also point to apache arrow and polars documentation of their string types.

↑ up