Metal Umlauts, Searching, and Other Unicode Fun
Unicodeâthe computer âalphabetâ that includes all the characters you see on this page, plus most modern writing systems in common use (â), plus punctuation and currency (àŻč), plus arrows and mathematical notation (â), plus drawing symbols (â”), plus emoji (đŁ), and moreâhas a lot going on in it beyond the obvious complexity of multiple formats (UTF-8, UTF-16, GB18030, UTF-32, BOCU, SCSU, UTF-7, and probably others) and byte orderings. The part that has grabbed my interest, recently, is the idea of Normal Forms, of which we have four.
- NFD: Canonical Decomposition
- NFC: Canonical Composition
- NFKD: Compatibility Decomposition
- NFKC: Compatibility Composition
Specifically, Normalization Form Canonical Decomposition interests me, because it represents each accented letter in a string as the base letter followed by any accents.
Better yet, in JavaScript, itâs easy to change normalization forms. Specifically, for these purposes, we want:
str.normalize('NFD');
The decomposed letters have some nice uses.
Sorting
At least in English, diacritical marks are usually a marker for either history (fiancĂ©e, ĂŒber, soupçon, MÄori, piñata) or pronunciation (naĂŻve, coöperate), rather than as an element of spelling. This is especially true of names, where we generally want a personâs name to be represented properly (Karel Äapek, Charlotte BrontĂ«, BeyoncĂ© Knowles), and that name can come from anywhere in the world, but English treats it more as an affectation than a critical element of the name.
Of particular importance, here, is that we generally wish to sort a name with accented letters as if the accents donât exist. So, we want piñata to sort identical to âpinataâ and Äapek to sort like âCapek.â
The decomposed form allows us to do this by stripping the diacritical marks out of the string when we sort it.
const sortedStrings = strings.sort((a,b) => {
const aNorm = a
.normalize('NFD')
.replace(/[\u0300-\u036f]/g, '');
const bNorm = b
.normalize('NFD')
.replace(/[\u0300-\u036f]/g, '');
return aNorm < bNorm ? -1 : 1;
});
That admittedly looks a bit complicated, given the regular expression, but the entire process boils down to decomposing the strings, stripping off the diacritical marks (Unicode codepoints 0x0300 to 0x036f), and comparing the results.
In other words, by normalizing the name, the computer represents âÄapekâ something like
[C] [caron] [a] [p] [e] [k]
Then, we remove any diacritical marks (the caron or Ë Â Â in this case) by replacing them with nothing, leaving us with only the Latin letters.
Searching
More so than with sorting, itâs a better experience to search without regard for diacritical marks. For example, an increasing number of laws intended to suppress minority votes are posed as âexact matchâ measures, which require that handwritten voter registration documents be identical to personal identification documents, meaning that the exactness of accents and diacritical marks relies primarily on the comprehension and interest of an underpaid data entry clerk.
By the same token, even something with much lower stakes like searching an employee directory shouldnât rely on the person searching for BeyoncĂ© realizing that she has an acute accent in her name or that Human Resources input her name properly.
And neither of those even touches on the problem that a standard keyboard for English doesnât have an easy way to type accented characters. So, even if a user has cleared the above hurdles, itâs still a waste of the userâs time to make them hunt down the exact spelling with diacritical marks.
We can solve this problem in a way similar to the sorting, normalizing and stripping both the target string and the corpus being searched.
Metal Umlauts (or MÍÍeĆŁal UmÍlÌŒaÍuÌtÌšs)
Itâs a bit before my time, but one of my favorite television shows growing up (via re-runs and now streaming) is Mission: Impossible, in no small part because of the signage in their fictional foreign countries. Especially in earlier episodes, to make foreign countries seem both exotic and approachable to American audiences, show creator Bruce Geller had the idea of creating signs written mostly in English, but a version of English with clever misspellings representative of stereotypes of certain parts of the world, often including bogus diacritical marks.
For example, if you pay careful attention, youâll spot both Zöna Restrik (for Restricted Area) or Prıziion Mılıtık (for Military Prison).
And, of course, if youâre a heavy metal music fan, youâre undoubtedly familiar with the similar but distinct Metal Umlaut, though its use seems surprisingly limited to the diaeresis (š) mark.
If we wanted to do something like transforming English text to Gellerese, youâre on your own figuring out how to change the base spelling in a reasonable way. But adding bogus diacritical marks? That, we can definitely do.
let output = '';
str = str.normalize('NFD');
for (let i = 0; i < str.length; i++) {
const c = str[i];
output += c;
if (c.match(/[a-z]/i)) {
const rLen = Math.floor(Math.log2(Math.random() * 3));
for (j = 0; j < rLen; j++) {
const rCh = 0x0300 + Math.floor(Math.random() * 0x006f);
output += String.fromCharCode(rCh);
}
}
}
Again, we normalize the input string. But instead of removing diacritical marks as weâve been doing, here we visit each character and, if itâs a letter, we pick a random-but-small number of diacritical marks to add (using log2()
pushes the numbers lower and biases the distribution towards the lower end), and then selects the necessary diacritical marks from that same 0x0300 to 0x036f range we previously needed to remove.
If desired, this can easily be made more âintelligentâ with lists of diacritical marks that are more appropriate to that letter, so that you donât end up with implausible combinations like what you see in the above section heading.
In either case, it might be a good idea to call output.normalize('NFC')
at the end to set the characters back to their âcomposedâ forms.
Exception
One place where normalization has no effect is the Polish L-with-stroke (Ć or Ć). It turns out that, as far as Unicode is concerned, those are letters unto themselves rather than modified letters. So, if youâre planning on using any of these techniques, you will want to take that into account.
Other (Programming) Languages
The above sample code snippets are all in JavaScript, but the Windows API supports NormalizeString()
and .NET has supported String.Normalize()
for quite some time. Ruby supports string.unicode_normalize()
. It shouldnât be hard to find the equivalent for other languages, now that we know the key words to search for are âunicode normalize,â maybe throwing in ânfdâ or âdecomposedâ to make the context clearer.
Happy umlauting!
Credits: Untitled header photograph from PxHere, made available under the CC0 1.0 Universal Public Domain Dedication.
No webmentions were found.
By commenting, you agree to follow the blog's Code of Conduct and that your comment is released under the same license as the rest of the blog. Or do you not like comments sections? Continue the conversation in the #entropy-arbitrage chatroom on Matrix…
Tags: javascript unicode programming technology techtips