Varun Ramakrishna pens a guest post on what it would take to get an Indian babel fish.
Wordlens is a company that makes an impressive translation app that was recently acquired by Google. A good step towards Douglas Adams’ fantastical Babel fish, the app lets you point your smartphone at a street sign in another language, and a translation appears on your phone’s screen, superimposed on the sign, as if the sign was always in the translated language. Currently, the app only allows translations between a few European languages.
In a nation as diverse as India where so many languages are spoken and written, often in their own unique scripts, we cannot fail to ask the question: what would it take for us to get a Wordlens for Indian languages?
The computer vision algorithms used in Wordlens are mostly well understood and fairly straightforward. The real technical achievement is that they have managed to get these algorithms to run in real-time on the phone’s processor. The processing involves detecting the region of the image that corresponds to text, unwarping it followed by binarizing the text image to normalize for color, background and illumination and separating out each character.
Once the text has been reduced to a standard form, what remains is to use an OCR (Optical Character Recognition) engine. The OCR engine does exactly what it says – it examines a pattern and identifies any language characters that are present. An answer on the website stack-overflow by a wordlens engineer seems to suggest they use an OCR engine they developed in-house rather than a commericially available or open source implementation. This is probably because commercial and open source engines have been designed and trained to work on documents which are usually imaged in controlled conditions, while for OCR to work “in the wild” (such as reading signboards on the fly) it would probably require some modification.
With the multitude of Indian languages and scripts, it seems like an app like this would find lots of users in India. So can this be done for Indian languages?
In principle, yes. The processing required is common across all languages except for the OCR. While OCR has been around for a while and commercial OCR engines have pretty high accuracies for latin languages, there seem to be strikingly few good OCR engines for Indic languages. The gold standard open source OCR engine Tesseract originally developed at HP labs but now under Google’s patronage (they use a version on the Google Books project), supports 34 languages but no Indian language. Although there are many academic publications on indic OCR, without a robust usable implementation it is unlikely that it will reach the masses.
Our best hope at this point is probably Google. If they do decide to extend their Google Books project to Indian language books, they will need to invest in improving indic OCR, which should be easily ported to an app like wordlens.