The Acquaintance algorithm uses a statistical approach to measure similarity among documents. The statistics are based on "n-grams," which are simply n-long strings of continuous characters in a document. The distribution of n-grams between pairs of documents is compared, and a score is computed that represents the similarity of those documents.
For each document in the reference database, Acquaintance creates a vector which records how many of each n-gram the document contains. The vector is normalized by dividing the count of each n-gram by the total number of n-grams in the document.
While creating individual document vectors, the algorithm also creates a "centroid" vector which is essentially the average of all the normalized document vectors in the corpus of documents. This centroid provides the context within which document similarity is judged.
When a sample of text is submitted to Acquaintance, a vector of normalized n-gram counts is created from that sample just as with the reference vectors. The cosine of the angle between the sample text vector and each of the reference text vectors is computed (after subtracting the centroid from each). The higher the computed value, the greater the similarity between documents.
For a detailed explanation of the algorithm, see Marc Damashek's paper in Damashek M. Science, Gauging Similarity with n-Grams: Language Independent Categorization of Text, 10 February 1995, Vol. 267, pp. 843-8.
The scores range from 1.0, which means that the two documents are identical (the cosine of the angle between them is 0), to -1.0, which means that the two documents are utterly dissimilar (they are 180 degrees apart in space, as viewed from the origin). For documents at these extremes, interpretation of the scores is easy; in actual practice, consideration of both the size of the sample text and the range of scores generated by Acquaintance is important.
However, in general, if
you can be fairly confident that the sample language you submitted was correctly identified. (If it happens that all the above conditions were met, and the language was not correctly identified, please save the sample and let me know!)
The size of your sample is important. Though it is sometimes possible to correctly identify the sample language in as few as a dozen characters, it is not recommended that you rely on such very short samples. Remember, this is a statistical approach, and it needs enough data to make meaningful comparisons. For reliable results, your sample size should be 150 to 200 characters long, at a minmum (though if you just don't have that large of a sample, try it anyway; the result will probably be in the correct language family with a sample of at least 50 characters).
For very small samples, your scores will also be very small. For instance, if your sample is only 40 characters long, the best match might have a score of 0.025, with the next best 0.009. In such cases a gauge of the reliability of the results is the spread between the best and the next best score. If the best score is nearly twice that of the next best, it may well be the correct answer. If the top few scores are all very close (and are not closely related languages), then the results should be viewed with a great deal of suspicion. Of course if the top few scores are very close, and all are closely related dialects, then it is quite likely one of those languages is correct, though not necessarily the highest scoring one.
If possible, try to choose your sample text to reflect what seems to be "typical" of the unknown language. By that I mean don't choose a section that is full of "international" words and names like 'computer', 'president', 'IBM', 'Paris', and so forth. Also, don't use text that has a high proportion of numbers and special characters. I suspect it would be possible to construct a very short text sample using only such terms and characters, for which it would be nearly impossible even for a human to determine the correct underlying language.
If the best score is at least 0.25, you can have a high degree of confidence that the algorithm has correctly identified the language of the sample text. Exceptions sometimes occur if the best scoring language is one of several very closely related languages in the database. In that case, your sample could score very well against all the highly related languages, and the highest scoring language may not be the correct one.
One of the nice things about this algorithm is that it tends to fail gracefully. That is, if your
it is likely that IF the highest scoring language is not correct, it is at least related to the correct language.
For instance, I do not currently have a sample of Shona in my database, but if you submit a sample of Shona, the best scoring language returned will be Swahili. Shona and Swahili are both Bantu languages, so at least the algorithm will put you in the right linguistic ballpark.
Please contact me with any questions or suggestions.
Return to Language Identification page.
Steve Huffman
GLBH@email.msn.com
Copyright (c) Stephen Huffman 2000 All rights reserved