Unknown Language Identification



This page will attempt to identify the language of a sample of text by comparing the n-gram statistics in the sample to the n-gram statistics of a number of reference texts in different languages. It was developed from my dissertation at Georgetown University (The Genetic Classification of Languages by n-Gram Analysis: A Computational Technique, 1998), which was itself based on an algorithm known as Acquaintance, developed by Dr. Marc Damashek (see Damashek M. Science, Gauging Similarity with n-Grams: Language Independent Categorization of Text, 10 February 1995, Vol. 267, pp. 843-8).

To understand how the algorithm works, what constitutes a good sample text, and how to interpret the output, see the Acquaintance information page. To see the list of reference languages in currently in the database, go to the reference languages page. Note that if a language is not listed in the reference languages page, the algorithm cannot correctly identify a text in that language.

I hope to add more languages to my database as time permits, and as I come across good language samples. If you have a good sample of text in a language not represented here, please send it to send me and I will add it to the database. A "good" sample is approximately 8,000 to 10,000 characters long, representing "typical" language. "Typical" language would be a sample from a story or newsarticle, not a list of soccer scores. I have found that passages from the Bible, particularly the Gospels, usually contain good samples of typical text.

Please contact me with any questions or suggestions.

Steve Huffman

GLBH@email.msn.com


Copyright (c) Stephen Huffman 2000 All rights reserved