Archive for the 'rwet' Category

Friday, May 28th, 2010

A Few of ITP’s Favorite Things

Process on medium and rights on animation
Bright copper html and warm woolen experiment
Brown paper studies tied up with strings
These are a few of ITP’s favorite things

Cream colored zomfg and crisp performing streudels
Partying latex and sensors with noodles
Wild geese that fly with javascript on their wings
These are a few of ITP’s favorite things

Breadboards that stay on my wack, and eyelashes
Organizer in white taiwanese with blue radioshack sashes
Silver white stressful that melt into springs
These are a few of ITP’s favorite things

When the arduino bites
When the template stings
When I’m feeling sad
I simply remember ITP’s favorite things
And then I don’t feel so bad

——————————-
For my final project in Adam Parrish’s Reading and Writing Electronic Text class, I focused on a rich and terrible source text: the ITP Student List, rich because of the sheer volume of revealing chatter and terrible because of how difficult it was for me to extract the body of each message, without the meta-text, quoted responses, ASCII art signatures, and the other detritus that we’ve trained ourselves to gloss over. As a result, most of the hours spent on this project were set off against that problem, and not against the far more interesting one of doing something compelling with the results.

To get started, I copied a year’s worth of emails to the student list from a classmate who downloads to apple’s mail program, but doesn’t purge her inbox, bless her heart. Apple Mail, because it’s an Apple product and Apple can’t seem to resist mucking up standard formats, uses a modified version of the mbox, which I was able to convert to regular mbox format using the creatively named emlx to mbox converter.

Once I had extracted the mbox, it was trivial to turn them into text files, though the amount of raw text was pretty huge. To give you an idea: I started experimenting with a year’s worth of email from only one student, which came to 1,107,782 words, or more appropriately given the number of >> and other dingbats in a typical email, 9,524,210 characters (without spaces). Moby Dick has about 210,000 words. Not bad, anonymous ITP student!

Next up was a long and harrowing battle to programatically extract just the body of each email. The first problem is that there’s no consistent tag or phrase marking where the header information ends and the body of each email begins. The second problem is that because most people ‘quote’ the emails they’re replying to beneath the text of their own messages, there’s often no consistent dividing line between new messages and what’s simply being quoted. I tried BeatuifulSoup and a number of XML and HTML parsers, as well simply iterating through each message and eliminating lines that included header information. It was all a terrible mess, and left me with lists of attempted output files with names like these:

angst and iteration

The debacle in question came when I realized that a day’s worth of previous attempts had been building on a source text which had eliminated good words along with bad, due to some faulty regular expressions. Eventually, I got help from Paul Paradiso with Python’s built in mailbox library and email module, which allowed me to iterate through the messages and finally get rid of the headers, after which I was able to use a combination of for loops to go through an mbox, and only write the lines I wanted to a new file. I still had to combine this process with quite a few experimental stabs at different regular expressions to filter out as much quoted material as I could. An example from deep in the process. I iterated using a particular regular expression at a time to see what my results are. Commented out code reflects that process. (I also changed the email address in the example below, which originally reflected the email of the person who gave me the whole mess of emails):


import mailbox
import email
import re
from email.Parser import Parser
p = Parser()

if __name__ == "__main__":
mbox = mailbox.mbox('mbox2')
f = open("nonumbers2.txt",'a')
for message in mbox:
mssg = p.parsestr(message.as_string())
for part in mssg.walk():
if part.get_content_maintype() == "text":
payload = part.get_payload()
lines = payload.split('\n')
for line in lines:
line = line.translate(None, '1234567890')
# if line.find('>') == 0:
# continue
# if re.match('On (Sun|Mon|Tue|Wed|Thu|Fri|Sat), ',line):
# continue
# if re.search('zz2408@nyu.edu',line):
# continue
# if re.search('@lists.nyu.edu', line):
# continue
# #if line.find('zz2408@nyu.edu') != -1:
# # continue
# #if re.search('>', line):
# continue
# continue
# if line.find('@lists.nyu.edu'):
# continue
# if line.find('To unsubscribe send a blank email to '):
# continue

# f.write(line)
# f.write('\n')

#else:
f.write(line)
f.write('\n')
# break

After all this I had thousands of fairly useable words, but even after getting rid of all numeric characters, I still had a bunch of crappy gibberish polluting my words. So I modified a bit of code from Digital Noah’smidterm to clean the list up:


for word in words:
if "-" in word or "_" in word or ".." in word or "'" in word or "<" in word or ":" in word or "." in word or "=" in word or ">" in word or "/" in word or "~" in word or "&" in word or "=" in word or "#" in word or "@" in word or "*" in word or "+" in word or len(word) > 30 or len(word) < 4:
wordsNull.add(word)

for word in wordsNull:
words.remove(word)

for word in words:
wordsTemp.append(word)

Good enough! I spat out the results randomly and one at a time into that lovely American standard that only John Coltrane could possibly make cool, “My Favorite Things,” and the result is a look into the ITP zeitgeist.  The code is really ugly and needs a lot of work, but feel free to take a look. Use the code pasted above, and this file, as well.

Thankfully, it’s all about the poetry.

Posted by admin | Filed in itp, rwet | Comment now »

 

Saturday, February 6th, 2010

Limerick Enfilthizer

For my first project in Reading and Writing Electronic Text, I made a very simple program to take perfectly clean limericks and make them sound dirty.  A few selections:

—————————–

There was a Young Lady whosebleep
Were unique as to colour and size;
When she opened thembleep
People all turned bleep
And started ableepitybleep

—————————–

There was a Young Lady of Dobleep
Who bought a large bonnet for walking;
But its colour andbleep
So bedazzled herbleep
That she very soon went bleepitybleep

—————————–

There was a Young Person of bleep
Whose toilette was far from complete;
She dressed in ableep
Spickle-speckled with bleep
That ombliferous bleepitybleep

—————————–

There was an Old Person of bleep
Whose conduct was painful and bleep
He sate on the sbleep
Eating apples and bleep
That imprudent Old bleepitybleep

—————————–

There was an Old Man in ableep
Whobleep ‘I’m afloat, I’m afloat!’
When theybleep ‘No! you ain’t!’
He was ready to bleep
That unhappy Olbleepitybleep

—————————–

There was an Old Person of bleep
Who rushed through a field of blue Clover;
But some very largebleep
Stung his nose and his bleep
So he very soon wenbleepitybleep

—————————–

See all 35 dirtied limericks here.

—————————–

What I did:
I grabbed a slew of Edward Lear limericks from this flowery page and spent a lot of time trying to figure out how to do a find for “Limerick,” which was helpfully entered above every single poem, and then go back two lines to do replace the last few characters of each poem.  After banging my head against that for a while, I realized that most of the limericks ended in periods, and none of them had periods within the text.  So I replaced each period and the fifteen characters preceding with ‘bleepitybleep’ and then replaced each comma and the five characters before it with ‘bleep’ just to make it funnier.  And now the world is a slightly filthier place.  You’re welcome.

Code:

 import sys
searchstr = ","
for line in sys.stdin:
	line = line.strip()
	offset = line.find(".")
	comma = line.find(",")
	if comma != -1:
		foo2 = line[comma -5:comma+1]
		line = line.replace(foo2, 'bleep')
	if offset != -1:
		foo = line[offset -15:]
		line = line.replace(foo, 'bleepitybleep')
	print line

Posted by admin | Filed in itp, rwet | 1 Comment »