How to install Python pandas Development Version on Mac OS X

The pandas data analysis module is quickly becoming the go-to tool for data analysis in Python. Certain features, such as in memory joins and sorts, become extremely powerful when dealing with in-memory datasets. Often times, operations that take hours in Excel to execute take only seconds using pandas.

As the recent re-covert to Mac OS X, I wanted to get setup with the development version of pandas on my new machine running Mac OS X 10.8.

To begin, we need to have a few things installed, particularly pip and homebrew.

If you have not yet installed pip, and have a valid Python installation on your machine, simply run sudo easy_install pip in your terminal.

Once that’s done, we need to install a few libraries before trying to install our Python libraries.

This will bring in all the compilers and libraries that we’re going to need to build our stuff later on.

Assuming that you want the following libraries installed at the global Python install level, rather than a virtual environment, you can install the requirements to build pandas in a single line.

With that, you should be able to clone the latest pandas repository and install the latest development version.

That’s pretty much it, if you have any problems, feel free to leave a comment.

Using Python and the NLTK to Find Haikus in the Public Twitter Stream

So after sitting around mining the public twitter stream and detecting natural language with Python, I decided to have a little fun with all that data by detecting haikus.

The Natural Language Toolkit (NLTK) in Python, basically the Swiss army knife of natural language processing, allows for more than just natural language detection. The NLTK offers quite a few corpora, including the Carnegie Mellon University (CMU) Pronouncing Dictionary. This corpus contains quite a few features, but the one that piqued my interest was the syllable count for over 125,000 (English) words. With the ability to get the number of syllables for almost every English word, why not see if we can pluck some haikus from the public Twitter stream!

We’re going to be feeding Python a string formed Tweet and try to figure out if it is a haiku, trying our best to split it up into haiku form.

Building upon natural language detection with the NLTK, we should first filter out all the Tweets that come are probably not English (to speed things up a little bit).

Once we have that out of the way, we can dig into the haiku detection.

So what we have now is a function,  is_haiku, that will return a list of the three haiku lines if the given string is a haiku, or returns  False  if it’s (probably) not a haiku. I keep saying probably because this script isn’t perfect, but it works most of the time.

After all that hacky code, it’s just a matter of hooking it up to the public Twitter stream. Borrowing from the public Twitter stream mining code, we can pipe every Tweet into the is_haiku function and if it returns a list, add it to our database.

So running this for a while, we actually pick up some pretty entertaining Tweets. I have been running this script for a little while on a micro EC2 instance and created a basic site that shows them in haiku form, as well as a Twitter account that retweets every haiku that it finds.

Some samples of found haikus,

 

 

 

So it’s can be pretty interesting. What this exercise underlines is the publicity of your Tweets. There might be some robot out there mining all that stuff. In fact, every Tweet is archived by the Library of Congress, so be mindful what you post.

I have posted the full script in as a Gist that puts it all together. If you have any improvements or comments, feel free to contribute!

Deploying Your Flask Website with Python and Fabric

By the end of this post, you will be able to deploy your local code to any number of remote servers with a single command. This assumes, of course, that you have configured your web server with uWSGI to host your Flask website and you have shell access to this server. It is encouraged that prior to proceeding you at least understand hosting a Flask site, otherwise a lot of this might not make sense, as most details relating to hosting are omitted here (particularly the idiosyncrasies of of using a virtualenv in your hosting environment).

If you do not have shell access to your web server, maybe someone else can help. Maybe you need to check out the myriad of cloud hosting services out there (e.g. Amazon Web Services, which I use to host this blog, among other sites) and make the jump. You will learn a lot more this way and be able to do much more interesting things!

So, we have a little Flask website that we wish to be able to develop locally and then later deploy to our live (perhaps production) site.

First, locally, we should install Fabric, this can be in a virtual environment or…

Well, let’s assume that we have a Flask site, example_site, we’re going to first require a setup.py.

What we’re doing above is saying that our Flask site, example_site, is a package in a directory with the same name. We define our requirements, package name, etc. here.

After we have a nice setup file, we can start to script what’s called our fabfile. We must define two actions for the fabfile, a pack and a deploy function.

pack() defines how to ball up our sources.

deploy() defines how to push it to the server and what to do once our sources are there.

I chose to configure my fabfile like below, however the Fabric documentation offers the whole suite of features that are available to you.

Once we are satisfied that our deploy scripts look good, we are ready to rock!

We can execute our pack and deploy in a single command:

That’s it! Fabric will then ball up your sources, upload them to your remote server, and execute a script that handles those sources on the remote end. Once this is configured properly, developing is bliss–no matter what solution you choose, Fabric or otherwise, easy deployment of local code is incredibly important for your web project.

It allows you to focus on code and not operations (that are repeatable and trivial).

Mining the Public Tweet Stream for Fun and Profit

The Twitter streaming API for those without firehose access is still useful and interesting, you just need to get your feet wet.

Because you should love Python, today we’ll focus on how to mine tweets from the Twitter streaming API using Python and Tweepy. Before we begin, you need to create a new application on Twitter and get your API keys and secrets ready to roll.

Got them? Good, let’s continue.

Oh also, I’d like to assume you’re using virtualenv, if not, no worries… but please, you’re going to ruin your life.

So let’s setup our dumb little development environment.

Alrighty, so now we have Tweepy setup and we’re just about ready to get down to brass tacks. We’re going to sucking down 1% of the Twitter feed via the sample streaming API… while that may not sound like a lot, it does add up, so let’s use sqlite to handle the task.

Really though, the complication pretty much ends there. Tweepy is an amazing library that makes the next part pretty easy.

Before we get into our Python code below, let’s quickly create a table to hold all of the information we want to keep by entering our Python interpreter.

Okay, cool. Now we have a healthy place to store our tweets. Feel free to tweak that one to your desired data you want to capture, just be sure to modify the code below.

So, let’s pretend we put that code into a file called tweet_gobbler.py and said okay, let’s go!

That’s it! Now whenever we run this script, whilst it runs, it will continue to update the sqlite database tweets.db and after a few days you will have tons and tons of tweets from all around the world. Lucky you!

I set this up on an Amazon EC2 micro instance and let it run for a few days and pulled down about 400MiB of tweets, so it’s not a bad way to build that awesome dataset you’ve been craving.

Now, go save the world.

Oh, also, you can hone in on specific users or tweets if you like using the streaming API filter functionality. Maybe more on that later, stick around.

Detecting Language with Python and the Natural Language Toolkit (NLTK)

Whether you want to catalog your mined public tweets or offer suggestions to user’s language preferences, Python can help detect a given language with a little bit of hackery around the Natural Language Toolkit (NLTK).

Let’s get going by first installing NLTK and downloading some language data to make it useful. Just a note here, the NLTK is an incredible suite of tools that can act as your swiss army knife for almost all natural language processing jobs you might have–we are just scratching the surface here.

Install the NLTK within a Python virtualenv.

Now we’re going to need some language data, hmm.

Play around in the NLTK downloader interface for a while, particularly the list of available packages (by entering ell), but basically all we need to download are the punkt and stopwords packages.

Now we can finally start having some fun with a new script,detect_lang.py.

Basically what we’re doing above is seeing which language stopwords dictionary contains the most coincident words to our input text and returning that language.

Let’s test it out!

Not too bad! We tried to strip out most of the HTML from a Wikipedia page for that, so some of the Javascript calls are still contained and may through off our detector, but this technique should work for most data. I found it works pretty well for detecting English tweets versus non-English tweets… more on that later.