Fun With Squeak: Stack Overflow Parser!

The first thing I did when I started with Squeak was to see if I could program a nice interface for parsing Stack Overflow data. I’ve tried before, with various languages, but Ruby was way too slow and Clojure’s interface to SAX seemed…pretty difficult, for some reason. Would Smalltalk be any different/easier?

Yes (both different and easier). Well…the parser was easier, anyways. I made my own subclass of the SAXHandler class, and gave it a message for collecting data from each element:

startElement: elementName attributeList: attributeList
(elementName = 'row')
ifTrue: [self saveAttributes: attributeList]
ifFalse: [Transcript show: 'Parsed "', elementName, '" is not a  element. Skipping.'; cr

Then, I defined a saveAttributes message (so I don’t have to rewrite too much if I want to find different data later). I started out simply getting the reputations from all the people and putting them in a Bag: a dictionary that keeps a counter for each different item in it (so, if there are 5 1s, the bag would have a key “1″ and value “5″):

saveAttributes: attributeList
"This method should be edited to provide proper searching for the desired attributes."
| foundData |
foundData := (attributeList at: 'Reputation') asNumber.
self data add: foundData.

Note that “data” is a reference to an instance variable that was created as a Bag when an instance of the class is created. I took a while trying to figure out how to get this thing to actually return the bag so I could do stuff with it, and finally ended up rewriting the SAXHandler’s message parseDocumentFromFile: (I named it the much shorter parseFile:). I rewrote it so that it created the SAX driver and handler and started the parsing all in one function, and gave it a return statement, returning the data instance variable.

parseFile: fileName
   | stream driver parser |
   stream := FileDirectory default readOnlyFileNamed: fileName.
   driver := SAXDriver on: stream.
   driver validating: true.
   parser := self new driver: driver.
   parser useNamespaces: false.
   parser startDocument.
   parser parseDocument.
   ^ parser data

Thus, I could assign the results of the message to something or other like so (in a workspace):

aBag := StackOverflowParser parseFile: 'C:/dir/subdir/file.xml'.

Published in: on February 16, 2010 at 2:04 am  Leave a Comment  

Data Collections Update: The First One!

It’s time to add some links to my dataroll (like a blogroll, but with data instead of blogs).

First, there’s Stack Overflow. Did you know that they, combined with Server Fault and Super User, dump all their data every month? Yep. Get the meat here, but be warned: some of the files are downright massive (I’m talking gigabyte-plus XML files). Don’t load a whole file into memory at once! Try something that streams the XML files, like StAX.

Then, there’s something called Project Vote Smart. They have an interesting API for public access that includes all sorts of information about political candidates, such as addresses, nicknames, birthplace, religion, and more. Their API also provides access to other fun things like info on cities, states, elections, committees, and districts. I haven’t had a whole lot of time to play around with it, but there might be potential here!

In keeping with the political aroma (it’s a sort of stuffy, cloying scent), another exciting website is Data.gov (and its British counterpart, which is supposedly better but might not be?). Data.gov has 1074 raw data entries, on all sorts of topics. Definitely worth checking out. Might impress your political science teacher, somehow! (I don’t know how to impress mine with this stuff yet, but I’ll let you know if I manage).

Finally, there is a website called Freebase. It seems to be a sort of attempt to programmatically access Wikipedia-esque entries on everything. I highly recommend giving it a look-see; you might be inspired!

Published in: on January 25, 2010 at 6:26 pm  Leave a Comment  

Get some Dataz: Personal Finance

Maybe you want to be like those people who have all sorts of graphs about their life. Maybe you think it would be cool to be able to program something that will tell you why you do things, or at least what you are doing (because you sure don’t know!) Maybe your doctor was surprised when you said you were an engineer, because he expected engineers to have lots of graphs about their health?

Well, sorry, Doc: I’m not recording facts about my health until I figure out how to automate it. In the meantime, maybe I can graph something else about me: Personal Finance.

Okay well you could do this with Excel or Google Docs, if I wanted to be old school. The only thing is, you would have to enter all your financial data manually (ew) and that’s not the Programmer Way. You could try scraping your bank account’s website for your data (I tried this. Their login system was too complicated.) Or, you could try using Mint.

Mint is cool. I hardly use a lot of its features, like budgets and SMS alerts, but what I like about it is that it makes recording personal finance easy.  See, once you give Mint your account data, (it’s safe!) it will generate some nice info for you. It’ll even graph stuff for you. It also has a bunch of organizational tools like categorizing and tagging purchases. Nice, right? However, the biggest thing about Mint: you can export your transaction history into a CSV (Comma,Separated,Values) file. The feature is sort of hidden: there’s just an unobtrusive button right underneath your transaction list. Once you have it saved on your computer in CSV, of course, you can play with it however you want. Maybe throw some Python or Ruby at it, or try parsing it with R. I, for one, am going to try my hand with Clojure, my latest hobby.

Published in: on January 24, 2010 at 1:35 am  Leave a Comment  
Follow

Get every new post delivered to your Inbox.