The Twitter streaming API for those without firehose access is still useful and interesting, you just need to get your feet wet.
Because you should love Python, today we’ll focus on how to mine tweets from the Twitter streaming API using Python and Tweepy. Before we begin, you need to create a new application on Twitter and get your API keys and secrets ready to roll.
Got them? Good, let’s continue.
Oh also, I’d like to assume you’re using virtualenv, if not, no worries… but please, you’re going to ruin your life.
So let’s setup our dumb little development environment.
~# mkdir twitter_miner && cd twitter_miner
~/twitter_miner# virtualenv venv
~/twitter_miner# echo "Hurray!"
~/twitter_miner# . venv/bin/activate
(venv)~/twitter_miner# pip install tweepy
Alrighty, so now we have Tweepy setup and we’re just about ready to get down to brass tacks. We’re going to sucking down 1% of the Twitter feed via the sample streaming API… while that may not sound like a lot, it does add up, so let’s use sqlite to handle the task.
Really though, the complication pretty much ends there. Tweepy is an amazing library that makes the next part pretty easy.
Before we get into our Python code below, let’s quickly create a table to hold all of the information we want to keep by entering our Python interpreter.
Python 2.6.8 (unknown, Sep 17 2012, 03:13:50)
[GCC 4.6.2 20111027 (Red Hat 4.6.2-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
>>> conn = sqlite3.connect('tweets.db')
>>> curs = conn.cursor()
>>> curs.execute("CREATE TABLE tweets (tid integer, username text, created_at text, content text, reply_to text, coordinates text, source text)")
<sqlite3.Cursor object at 0x7fabc9f63e48>
Okay, cool. Now we have a healthy place to store our tweets. Feel free to tweak that one to your desired data you want to capture, just be sure to modify the code below.
# -*- coding: utf-8 -*-
# twitter client
# database interface
conn = sqlite3.connect('tweets.db')
curs = conn.cursor()
""" Handles all incoming tweets as discrete tweet objects.
def on_status(self, status):
"""Called when status (tweet) object received.
See the following link for more information:
tid = status.id_str
usr = status.author.screen_name.strip()
txt = status.text.strip()
in_reply_to = status.in_reply_to_status_id
coord = status.coordinates
src = status.source.strip()
cat = status.created_at
# Now that we have our tweet information, let's stow it away in our
# sqlite database
curs.execute("insert into tweets (tid, username, created_at,
content, reply_to, coordinates, source)
values(?, ?, ?, ?, ?, ?, ?)",
(tid, usr, cat, txt, in_reply_to, coord, src))
except Exception as e:
# Most errors we're going to see relate to the handling of UTF-8 messages (sorry)
def on_error(self, status_code):
print('An error has occured! Status code = %s' % status_code)
# establish stream
consumer_key = "CONSUMER_KEY"
consumer_secret = "CONSUMER_SECRET"
auth1 = tweepy.auth.OAuthHandler(consumer_key, consumer_secret)
access_token = "ACCESS_TOKEN"
access_token_secret = "ACCESS_TOKEN_SECRET"
print "Establishing stream...",
stream = tweepy.Stream(auth1, StreamWatcherHandler(), timeout=None)
# Start pulling our sample streaming API from Twitter to be handled by StreamWatcherHandler
if __name__ == '__main__':
print "Disconnecting from database... ",
So, let’s pretend we put that code into a file called tweet_gobbler.py and said okay, let’s go!
~# venv/bin/python tweet_gobbler.py
That’s it! Now whenever we run this script, whilst it runs, it will continue to update the sqlite database tweets.db and after a few days you will have tons and tons of tweets from all around the world. Lucky you!
I set this up on an Amazon EC2 micro instance and let it run for a few days and pulled down about 400MiB of tweets, so it’s not a bad way to build that awesome dataset you’ve been craving.
Now, go save the world.
Oh, also, you can hone in on specific users or tweets if you like using the streaming API filter functionality. Maybe more on that later, stick around.