So after sitting around mining the public twitter stream and detecting natural language with Python, I decided to have a little fun with all that data by detecting haikus.
The Natural Language Toolkit (NLTK) in Python, basically the Swiss army knife of natural language processing, allows for more than just natural language detection. The NLTK offers quite a few corpora, including the Carnegie Mellon University (CMU) Pronouncing Dictionary. This corpus contains quite a few features, but the one that peaked my interest was the syllable count for over 125,000 (English) words. With the ability to get the number of syllables for almost every English word, why not see if we can pluck some haikus from the public Twitter stream!
We’re going to be feeding Python a string formed Tweet and try to figure out if it is a haiku, trying our best to split it up into haiku form.
Building upon natural language detection with the NLTK, we should first filter out all the Tweets that come are probably not English (to speed things up a little bit).
|
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# natural language toolkit for syllable countin import nltk from nltk.corpus import cmudict ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words('english')) NON_ENGLISH_STOPWORDS = set(nltk.corpus.stopwords.words()) - ENGLISH_STOPWORDS def is_english(text): """Return True if text is probably English, False if text is probably not English """ text = text.lower() words = set(nltk.wordpunct_tokenize(text)) return len(words & ENGLISH_STOPWORDS) > len(words & NON_ENGLISH_STOPWORDS) |
Once we have that out of the way, we can dig into the haiku detection.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
# digit detection import curses from curses.ascii import isdigit def is_haiku(text): import re text_orig = text text = text.lower() if filter(str.isdigit, str(text)): return False words = nltk.wordpunct_tokenize(re.sub('[^a-zA-Z_ ]', '',text)) syl_count = 0 word_count = 0 haiku_line_count = 0 lines = [] d = cmudict.dict() for word in words: syl_count += [len(list(y for y in x if isdigit(y[-1]))) for x in d[word.lower()]][0] if haiku_line_count == 0: if syl_count == 5: lines.append(word) haiku_line_count += 1 elif haiku_line_count == 1: if syl_count == 12: lines.append(word) haiku_line_count += 1 else: if syl_count == 17: lines.append(word) haiku_line_count += 1 if syl_count == 17: try: final_lines = [] str_tmp = "" counter = 0 for word in text_orig.split(): str_tmp += str(word) + " " if lines[counter].lower() in str(word).lower(): final_lines.append(str_tmp.strip()) counter += 1 str_tmp = "" if len(str_tmp) > 0: final_lines.append(str_tmp.strip()) return final_lines except Exception as e: print e return False else: return False |
So what we have now is a function, is_haiku, that will return a list of the three haiku lines if the given string is a haiku, or returns False if it’s (probably) not a haiku. I keep saying probably because this script isn’t perfect, but it works most of the time.
After all that hacky code, it’s just a matter of hooking it up to the public Twitter stream. Borrowing from the public Twitter stream mining code, we can pipe every Tweet into the is_haiku function and if it returns a list, add it to our database.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
# importing (unpublished) SQLAlchemy code so db saving is a lot easier # see http://h6o6.com/2012/12/mining-the-public-tweet-stream-for-fun-and-profit/ from haiku import db from haiku.models import TweetHaiku class StreamWatcherListener(tweepy.StreamListener): def on_status(self, status): try: if is_english(status.text): haiku_result = is_haiku(status.text) if haiku_result is not False: print "%s\nby %s at %s from %s\n\n" % (status.text, status.author.screen_name, status.created_at, status.source,) try: tweet_haiku = TweetHaiku(status, haiku_result) db.session.add(tweet_haiku) db.session.commit() except Exception as e: print haiku_result print e pass except Exception as e: pass def on_error(self, status_code): print "An error has occurred! Status code = %s" % status_code return True # keep the dream alive def main(): # establish stream consumer_key = "TWITTER_CONSUMER_KEY" consumer_secret = "TWITTER_CONSUMER_SECRET" auth1 = tweepy.auth.OAuthHandler(consumer_key, consumer_secret) access_token = "TWITTER_ACCESS_TOKEN" access_token_secret = "TWITTER_ACCESS_TOKEN_SECRET" auth1.set_access_token(access_token, access_token_secret) print "Establishing stream" haiku_test = "Learn to Write haiku there are rules for syllables ham radio rest" if is_haiku(haiku_test) is not False and len(is_haiku(haiku_test)) == 3: print "Haiku detection is (probably) working properly" else: print "Haiku detection (probably) broken :(" stream = tweepy.Stream(auth1, StreamWatcherListener(), timeout=None) stream.sample() if __name__ == '__main__': try: main() except KeyboardInterrupt: print "\n Later gator!" |
So running this for a while, we actually pick up some pretty entertaining Tweets. I have been running this script for a little while on a micro EC2 instance and created a basic site that shows them in haiku form, as well as a Twitter account that retweets every haiku that it finds.
Some samples of found haikus,
I still need you as my lover, but I also hate you fuck (fake name)
— Dorian Spence (@DorianSpence) January 5, 2013
What the heck is an apple? Who decided to eat one off a tree?
— Mason Mueller (@masonamueller) February 23, 2013
Let go of all the hate. Life is too short to spend your time filled with hate.
— LORD KNOWS (@sedgims) February 11, 2013
So it’s can be pretty interesting. What this exercise underlines is the publicity of your Tweets. There might be some robot out there mining all that stuff. In fact, every Tweet is archived by the Library of Congress, so be mindful what you post.
I have posted the full script in as a Gist that puts it all together. If you have any improvements or comments, feel free to contribute!
