The first thing I did when I started with Squeak was to see if I could program a nice interface for parsing Stack Overflow data. I’ve tried before, with various languages, but Ruby was way too slow and Clojure’s interface to SAX seemed…pretty difficult, for some reason. Would Smalltalk be any different/easier?
Yes (both different and easier). Well…the parser was easier, anyways. I made my own subclass of the SAXHandler class, and gave it a message for collecting data from each element:
startElement: elementName attributeList: attributeList (elementName = 'row') ifTrue: [self saveAttributes: attributeList] ifFalse: [Transcript show: 'Parsed "', elementName, '" is not a element. Skipping.'; cr
Then, I defined a saveAttributes message (so I don’t have to rewrite too much if I want to find different data later). I started out simply getting the reputations from all the people and putting them in a Bag: a dictionary that keeps a counter for each different item in it (so, if there are 5 1s, the bag would have a key “1″ and value “5″):
saveAttributes: attributeList "This method should be edited to provide proper searching for the desired attributes." | foundData | foundData := (attributeList at: 'Reputation') asNumber. self data add: foundData.
Note that “data” is a reference to an instance variable that was created as a Bag when an instance of the class is created. I took a while trying to figure out how to get this thing to actually return the bag so I could do stuff with it, and finally ended up rewriting the SAXHandler’s message parseDocumentFromFile: (I named it the much shorter parseFile:). I rewrote it so that it created the SAX driver and handler and started the parsing all in one function, and gave it a return statement, returning the data instance variable.
parseFile: fileName | stream driver parser | stream := FileDirectory default readOnlyFileNamed: fileName. driver := SAXDriver on: stream. driver validating: true. parser := self new driver: driver. parser useNamespaces: false. parser startDocument. parser parseDocument. ^ parser data
Thus, I could assign the results of the message to something or other like so (in a workspace):
aBag := StackOverflowParser parseFile: 'C:/dir/subdir/file.xml'.
