Prefixen frequencies

On the way to lunch today, Bob asked where the word ``cantilever'' comes from[*]. Jerry and I didn't know. So, we were trying to think of other ``canti-'' words to compare. The only one we could come up with is ``canticle'' which didn't seem related.

So, we backed off to looking for ``cant-'' words. We came up with:

  • cant, cantaloupe, cantankerous, canteen, cantina, cantor[**]

This seemed like a pretty slender set. But, it got me wondering, there are a few really dominant four-letter prefixen like ``over-'' and ``unde-'' and ``anti-'' and ``semi-''. I wondered how the other four-letter starts to words lined up.

A few quick tests with /usr/share/dict/words under Mac OS X shows that there are 18244 different combinations of the opening four-letters of non-capitalized words of at least four letters. This is out of the 456976 possible four-letter combinations.

The /usr/share/dict/words has some things that may be considered repeats like ``cantankerous'' and ``cantankerously''. Similarly, it only has one entry per homograph. But, if you're not too picky about dimpled chads, it will give a pretty decent idea of the big picture.

According to /usr/share/dict/words there are 78 non-capitalized words which begin with ``cant-''. By comparison to the other 18243 four-letter combinations that start words, this is very considerable. Starting 78 words puts it in the 97-th percentile. A full 6142 (more than 1/3) of the 18244 four-letter openers only opened one word.

The top 10 four-letter openers accounted for almost 5% of the non-captilized words of at least four letter in /usr/share/dict/words.[***].

I wonder if other languages have similar histograms.

