Mining the Public Tweet Stream for Fun and Profit

The Twitter streaming API for those without firehose access is still useful and interesting, you just need to get your feet wet.

Because you should love Python, today we’ll focus on how to mine tweets from the Twitter streaming API using Python and Tweepy. Before we begin, you need to create a new application on Twitter and get your API keys and secrets ready to roll.

Got them? Good, let’s continue.

Oh also, I’d like to assume you’re using virtualenv, if not, no worries… but please, you’re going to ruin your life.

So let’s setup our dumb little development environment.

Alrighty, so now we have Tweepy setup and we’re just about ready to get down to brass tacks. We’re going to sucking down 1% of the Twitter feed via the sample streaming API… while that may not sound like a lot, it does add up, so let’s use sqlite to handle the task.

Really though, the complication pretty much ends there. Tweepy is an amazing library that makes the next part pretty easy.

Before we get into our Python code below, let’s quickly create a table to hold all of the information we want to keep by entering our Python interpreter.

Okay, cool. Now we have a healthy place to store our tweets. Feel free to tweak that one to your desired data you want to capture, just be sure to modify the code below.

So, let’s pretend we put that code into a file called tweet_gobbler.py and said okay, let’s go!

That’s it! Now whenever we run this script, whilst it runs, it will continue to update the sqlite database tweets.db and after a few days you will have tons and tons of tweets from all around the world. Lucky you!

I set this up on an Amazon EC2 micro instance and let it run for a few days and pulled down about 400MiB of tweets, so it’s not a bad way to build that awesome dataset you’ve been craving.

Now, go save the world.

Oh, also, you can hone in on specific users or tweets if you like using the streaming API filter functionality. Maybe more on that later, stick around.

  • Hywel Jones

    Very useful. Thanks. I had to delete the coordinates field though as I initially got ‘Error binding parameter 5 – probably unsupported type.’ I guess Twitter must have changed the detail of the coordinates since you wrote your code.