Whether you want to catalog your mined public tweets or offer suggestions to user’s language preferences, Python can help detect a given language with a little bit of hackery around the Natural Language Toolkit (NLTK).
Let’s get going by first installing NLTK and downloading some language data to make it useful. Just a note here, the NLTK is an incredible suite of tools that can act as your swiss army knife for almost all natural language processing jobs you might have–we are just scratching the surface here.
Install the NLTK within a Python virtualenv.
|
1 2 3 4 |
~# mkdir nltk_test && cd nltk_test ~/nltk_test# virtualenv venv ~/nltk_test# . venv/bin/activate (venv)~/nltk_test# pip install nltk |
Now we’re going to need some language data, hmm.
|
1 2 3 4 5 6 7 8 9 10 11 |
(venv)~nltk_test# venv/bin/python Python 2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import nltk >>> nltk.download() NLTK Downloader --------------------------------------------------------------------------- d) Download l) List u) Update c) Config h) Help q) Quit --------------------------------------------------------------------------- Downloader> |
Play around in the NLTK downloader interface for a while, particularly the list of available packages (by entering ell), but basically all we need to download are the punkt and stopwords packages.
Now we can finally start having some fun with a new script,detect_lang.py.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
#!/usr/bin/env python # -*- coding: utf-8 -*- # from nltk.corpus import stopwords # stopwords to detect language from nltk import wordpunct_tokenize # function to split up our words from sys import stdin # how else should we get our input :) def get_language_likelihood(input_text): """Return a dictionary of languages and their likelihood of being the natural language of the input text """ input_text = input_text.lower() input_words = wordpunct_tokenize(input_text) language_likelihood = {} total_matches = 0 for language in stopwords._fileids: language_likelihood[language] = len(set(input_words) & set(stopwords.words(language))) return language_likelihood def get_language(input_text): """Return the most likely language of the given text """ likelihoods = get_language_likelihood(input_text) return sorted(likelihoods, key=likelihoods.get, reverse=True)[0] if __name__ == '__main__': input_text = " ".join([x for x in stdin.readlines()]) print get_language(input_text) |
Basically what we’re doing above is seeing which language stopwords dictionary contains the most coincident words to our input text and returning that language.
Let’s test it out!
|
1 2 3 4 |
(venv)~/tmp/nltk_test# curl --compressed https://en.wikipedia.org/wiki/Barack_Obama 2> /dev/null | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | venv/bin/python detect_lang.py english (venv)~/tmp/nltk_test# curl --compressed https://es.wikipedia.org/wiki/Barack_Obama 2> /dev/null | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | venv/bin/python detect_lang.py spanish |
Not too bad! We tried to strip out most of the HTML from a Wikipedia page for that, so some of the Javascript calls are still contained and may through off our detector, but this technique should work for most data. I found it works pretty well for detecting English tweets versus non-English tweets… more on that later.
