unicode is harder than you think

Jul 25, 2023


A nice walk through the history of unicode, along with code samples in python and C, explaining why each encoding sucks (but UTF-8 sucks the least) and unicode is not easy.

The beauty of UTF-8 is that code that splits, searches or synchronises using ASCII symbols8 will work fine as-is, with little to no modification, even with Unicode text.

Another headache is the fact Unicode also may define special forms for the same letter or group of letters, which are visibly different but understood by humans to be derived from the same symbol.

A very common example of this is the  (U+FB01),  (U+FB02),  (U+FB00) and  (U+FB03) ligatures, which are ubiquitous in Latin text as a “more readable” form of the fifl and ffi digraphs. In general, users expect officeoffice and office to be treated and rendered similarly, because they all represent the same identical word, but not necessarily without any visual difference. 9

There is a section on the different types of unicode normalizations, which is very helpful for me as I always get them confused

↑ up