Tuesday, November 30, 2010

How confident is Google about its language detection?

Google provides a very handy online tool for language detection -- http://www.google.com/uds/samples/language/detect.html. After you input something and hit the Detect Language button, the result is shown below telling you what language Google thinks it is, together with whether Google thinks the result is reliable or not, and how confident it is.

The confidence level is between 0-1, so a value of 0.08 means Google has a confidence level of 8%.

I played with it and found some interesting results. First, I tried the word bell. Google thought it was English with a confidence level of 4.75%. Then Jingle bells. Surprisingly, two English words got a lower confidence level 1.36%. Well, that may be because jingle was an English word with a confidence level of 0.27%, and bells (with an s, it went lower to) 1.62%. But Jingle bells got a confidence of bells minus (yes, minus, not plus) Jingle.

Let us continue --
  • Jingle bells, jingle bells, (1.36%. Repetition does not increase the confidence.)
  • Jingle all the way; (32.91%)
  • Oh! what fun it is to ride (59.7%)
  • In a one-horse open sleigh. (34.12%. The confidence drops.)
  • the whole thing (Jingle bells, jingle bells, Jingle all the way; Oh! what fun it is to ride In a one-horse open sleigh.) is 81.57%.
 So, here are the rules --
  • plural form (or other forms) lowers the confidence;
  • more words may lower the confidence;
  • repetition does not increase the confidence;
  • your input history does not help Google to build up the confidence;
  • overconfidence is bad. Google is not 100% confident with any words, so Google is conservative.

No comments:

Get This <