Mining the Public Tweet Stream for Fun and Profit

The Twitter streaming API for those without firehose access is still useful and interesting, you just need to get your feet wet.

Because you should love Python, today we’ll focus on how to mine tweets from the Twitter streaming API using Python and Tweepy. Before we begin, you need to create a new application on Twitter and get your API keys and secrets ready to roll.

Got them? Good, let’s continue.

Oh also, I’d like to assume you’re using virtualenv, if not, no worries… but please, you’re going to ruin your life.

So let’s setup our dumb little development environment.

Alrighty, so now we have Tweepy setup and we’re just about ready to get down to brass tacks. We’re going to sucking down 1% of the Twitter feed via the sample streaming API… while that may not sound like a lot, it does add up, so let’s use sqlite to handle the task.

Really though, the complication pretty much ends there. Tweepy is an amazing library that makes the next part pretty easy.

Before we get into our Python code below, let’s quickly create a table to hold all of the information we want to keep by entering our Python interpreter.

Okay, cool. Now we have a healthy place to store our tweets. Feel free to tweak that one to your desired data you want to capture, just be sure to modify the code below.

So, let’s pretend we put that code into a file called and said okay, let’s go!

That’s it! Now whenever we run this script, whilst it runs, it will continue to update the sqlite database tweets.db and after a few days you will have tons and tons of tweets from all around the world. Lucky you!

I set this up on an Amazon EC2 micro instance and let it run for a few days and pulled down about 400MiB of tweets, so it’s not a bad way to build that awesome dataset you’ve been craving.

Now, go save the world.

Oh, also, you can hone in on specific users or tweets if you like using the streaming API filter functionality. Maybe more on that later, stick around.

Detecting Language with Python and the Natural Language Toolkit (NLTK)

Whether you want to catalog your mined public tweets or offer suggestions to user’s language preferences, Python can help detect a given language with a little bit of hackery around the Natural Language Toolkit (NLTK).

Let’s get going by first installing NLTK and downloading some language data to make it useful. Just a note here, the NLTK is an incredible suite of tools that can act as your swiss army knife for almost all natural language processing jobs you might have–we are just scratching the surface here.

Install the NLTK within a Python virtualenv.

Now we’re going to need some language data, hmm.

Play around in the NLTK downloader interface for a while, particularly the list of available packages (by entering ell), but basically all we need to download are the punkt and stopwords packages.

Now we can finally start having some fun with a new script,

Basically what we’re doing above is seeing which language stopwords dictionary contains the most coincident words to our input text and returning that language.

Let’s test it out!

Not too bad! We tried to strip out most of the HTML from a Wikipedia page for that, so some of the Javascript calls are still contained and may through off our detector, but this technique should work for most data. I found it works pretty well for detecting English tweets versus non-English tweets… more on that later.

Manage your todo list on any device with todo.txt

While there are many options for managing your life with software, there exist few solutions that are truly portable and future proof. For years, I have found that Gina Trapani’s shell script and accompanying Android or iOS apps comfortably fit into a light weight task management system.

You can configure on your system following the online installation guide.

However, if you want to setup your system so that you can sync it to your phone through the Android or iOS app, you will first need to configure Dropbox on your system.

Once you have configured Dropbox on your system, you should have a folder that will remain continuously synced via the Dropbox daemon (default Dropbox folder location is ~/Dropbox). It is in this directory that you will need to configure a few things to get everything synced up properly.

There are several clever ways that you can configure this, but I will cover here how I have chosen to do this.


Because I like to keep track of much of my code on GitHub, I first forked the todo.txt-cli repository and cloned it to my local machine, but we will use the example of cloning from the project repository.

Once we have the repository cloned, we need to link the to wherever we like to keep our scripts.

One note here, you can chose to shorten the call to to simply t rather than the entire todo to make accessing the CLI easier with less typing.

Configuring a syncing todo.txt directory

Next, we need to create a directory for our todo lists! I chose to create a hidden directory within my home directory, however you may chose wherever you wish.

It is important to note, however, that the todo directory needs to exist first in your Dropbox folder, so we will create this one first.

We can then sym-link our configured todo.txt location to the Dropbox location.

Finally, we need to point our configuration to the correct place. To do this, you need to open your ~/.todo/todo.cfg file (which is now linked to ~/.todo_src/todo.cfg) in your favorite text editor and locate and modify the following line to point to ~/.todo:

Should be changed to something like:

Additionally, you should comment out any other definitions of the TODO_DIR environment variable.

Once this has been configured, so long as your Dropbox daemon is running, you should see that you have files populated on your Dropbox/todo directory online once you begin to use your new configuration.

Once this is complete, you can install and configure the todo.txt applications for your mobile device and point the application to the todo folder within your Dropbox. Once this is done, you should see your todo list pop into your mobile application!

You can find additional help at the todo.txt website, best of luck.

Configuring multiple Flask sites with uWSGI and nginx on an Amazon EC2 Instance

For the lazy, we have a shell script that will do all this for you on GitHub!

While trying to configure our new Amazon EC2 instance, it was a little too cumbersome and somewhat poorly documented how to setup Flask with uWSGI on nginx. While there were a few great writeups on how to configure this, they fell short to get a working configuration on Amazon’s Linux AMI, which we will try to describe here.

After creating a new EC2 instance and logging in, we will need to install nginx, uWSGI and required libraries.

Now we need a Flask project! Let’s just grab the latest simple repo from GitHub:

Alternatively, you can fork the project on GitHub and pull that one–this is probably the desirable option.

We will quickly setup a virtualenv in our new simple repository.

Now let’s give it a test, shall we?

Looking good, let’s move it real quick to where we want it and fix the permissions to our liking.

Next, let’s configure uWSGI. Following Tyler’s writeup, let’s make some directories.

Now let’s create the uWSGI configuration file with:sudo vim /etc/init/uwsgi.confWith the following content:

A few things to note here, we are setting up the general uWSGI emperor log and telling it to run as the nginx user for permission purposes.

With uWSGI configured, we can start it up withsudo start uwsgi

Now we can begin to configure our simple fork which runs this blog by creating a new file with:

And we will configure it with the following content:

Now we need to link our configuration file to the enabled sites folder:

Finally, we can configure nginx to serve up our new blog.

And we will configure nginx to serve up content from our uWSGI emperor like so:

Save the file and let’s fire up nginx, we are ready for launch!

Hope that you have found this post helpful. Later we hope to describe how to automate this on Amazon EC2 for automatic scaling of your server fleet.