Friday, May 28th, 2010
A Few of ITP’s Favorite Things
Process on medium and rights on animation
Bright copper html and warm woolen experiment
Brown paper studies tied up with strings
These are a few of ITP’s favorite things
Cream colored zomfg and crisp performing streudels
Partying latex and sensors with noodles
Wild geese that fly with javascript on their wings
These are a few of ITP’s favorite things
Breadboards that stay on my wack, and eyelashes
Organizer in white taiwanese with blue radioshack sashes
Silver white stressful that melt into springs
These are a few of ITP’s favorite things
When the arduino bites
When the template stings
When I’m feeling sad
I simply remember ITP’s favorite things
And then I don’t feel so bad
——————————-
For my final project in Adam Parrish’s Reading and Writing Electronic Text class, I focused on a rich and terrible source text: the ITP Student List, rich because of the sheer volume of revealing chatter and terrible because of how difficult it was for me to extract the body of each message, without the meta-text, quoted responses, ASCII art signatures, and the other detritus that we’ve trained ourselves to gloss over. As a result, most of the hours spent on this project were set off against that problem, and not against the far more interesting one of doing something compelling with the results.
To get started, I copied a year’s worth of emails to the student list from a classmate who downloads to apple’s mail program, but doesn’t purge her inbox, bless her heart. Apple Mail, because it’s an Apple product and Apple can’t seem to resist mucking up standard formats, uses a modified version of the mbox, which I was able to convert to regular mbox format using the creatively named emlx to mbox converter.
Once I had extracted the mbox, it was trivial to turn them into text files, though the amount of raw text was pretty huge. To give you an idea: I started experimenting with a year’s worth of email from only one student, which came to 1,107,782 words, or more appropriately given the number of >> and other dingbats in a typical email, 9,524,210 characters (without spaces). Moby Dick has about 210,000 words. Not bad, anonymous ITP student!
Next up was a long and harrowing battle to programatically extract just the body of each email. The first problem is that there’s no consistent tag or phrase marking where the header information ends and the body of each email begins. The second problem is that because most people ‘quote’ the emails they’re replying to beneath the text of their own messages, there’s often no consistent dividing line between new messages and what’s simply being quoted. I tried BeatuifulSoup and a number of XML and HTML parsers, as well simply iterating through each message and eliminating lines that included header information. It was all a terrible mess, and left me with lists of attempted output files with names like these:
The debacle in question came when I realized that a day’s worth of previous attempts had been building on a source text which had eliminated good words along with bad, due to some faulty regular expressions. Eventually, I got help from Paul Paradiso with Python’s built in mailbox library and email module, which allowed me to iterate through the messages and finally get rid of the headers, after which I was able to use a combination of for loops to go through an mbox, and only write the lines I wanted to a new file. I still had to combine this process with quite a few experimental stabs at different regular expressions to filter out as much quoted material as I could. An example from deep in the process. I iterated using a particular regular expression at a time to see what my results are. Commented out code reflects that process. (I also changed the email address in the example below, which originally reflected the email of the person who gave me the whole mess of emails):
import mailbox
import email
import re
from email.Parser import Parser
p = Parser()
if __name__ == "__main__":
mbox = mailbox.mbox('mbox2')
f = open("nonumbers2.txt",'a')
for message in mbox:
mssg = p.parsestr(message.as_string())
for part in mssg.walk():
if part.get_content_maintype() == "text":
payload = part.get_payload()
lines = payload.split('\n')
for line in lines:
line = line.translate(None, '1234567890')
# if line.find('>') == 0:
# continue
# if re.match('On (Sun|Mon|Tue|Wed|Thu|Fri|Sat), ',line):
# continue
# if re.search('zz2408@nyu.edu',line):
# continue
# if re.search('@lists.nyu.edu', line):
# continue
# #if line.find('zz2408@nyu.edu') != -1:
# # continue
# #if re.search('>', line):
# continue
# continue
# if line.find('@lists.nyu.edu'):
# continue
# if line.find('To unsubscribe send a blank email to '):
# continue
# f.write(line)
# f.write('\n')
#else:
f.write(line)
f.write('\n')
# break
After all this I had thousands of fairly useable words, but even after getting rid of all numeric characters, I still had a bunch of crappy gibberish polluting my words. So I modified a bit of code from Digital Noah’smidterm to clean the list up:
for word in words:
if "-" in word or "_" in word or ".." in word or "'" in word or "<" in word or ":" in word or "." in word or "=" in word or ">" in word or "/" in word or "~" in word or "&" in word or "=" in word or "#" in word or "@" in word or "*" in word or "+" in word or len(word) > 30 or len(word) < 4:
wordsNull.add(word)
for word in wordsNull:
words.remove(word)
for word in words:
wordsTemp.append(word)
Good enough! I spat out the results randomly and one at a time into that lovely American standard that only John Coltrane could possibly make cool, “My Favorite Things,” and the result is a look into the ITP zeitgeist. The code is really ugly and needs a lot of work, but feel free to take a look. Use the code pasted above, and this file, as well.
Thankfully, it’s all about the poetry.








