Languator: Language Detection for Local Queries

When you have an address lookup service like photon you need language detection for the short local queries like “Berlin Alexanderplatz” to e.g. apply a specific analysis chain. Of course German language is rather simple, but detecting if it is French or Italian etc is much more complex.

I tried the language-detection from google code. And although it has high precision for normal text, it is relative slow (creating the profile and querying) and not specific to local queries although short generic queries should work (e.g. Twitter data). Another problem is the complex and inflexible Java API as it uses lots of statics etc. E.g. the size of the n-grams is not configurable.

So I hacked together a new language detection called languator in Java specific for short local queries and fed it with OpenStreetMap data. E.g. I used the germany.pbf to create the German language profile and france.pbf for the French profile and so on. If you keep in mind that you have lots of foreign names for POI and street in every country the following error values are okayish and better than what I found from other tools:

English: 22.9%, German: 7%, French: 20%, Italian: 18.8%

Another good thing is that you can feed languator with these 4 countries in 20 minutes and create millions of queries from those countries which takes ~10 minutes, which is at least 2 times faster than the google code tool.

I invite you to beat my numbers: fork and have fun!

Advertisements