Google Summer of Code 2010: Open States

This post originally appeared at but has moved here for posterity.

Hello! I’m Gabriel, I’m a 4th year student of Computer Engineering from the University of Puerto Rico in Mayagüez. This summer I worked as a GSoC student on developing new scapers for the Open State Project. The states I worked on were Colorado, Hawaii, Washington, Oregon and the territory of Puerto Rico. I really enjoyed the whole experience. The work is very fulfilling as coding in Python is always delightful and fun.

Writing scrapers can pose a series of problems. The Internet is full of inconsistent, unstructured, and badly written html. Thankfully the lxml.html library is very good at handling all kinds of html and it does so quite fast. Also it has a powerful and well documented API that makes the scraping work a whole lot easier. But still one can be hurt by the woes of inconsistencies. For example sometimes different styles of html are used for different years. Also sometimes the way the html is structured doesn’t help at all with the scraping and one has to resort to regular expressions and other techniques.

At first it was more difficult for me to write the scrapers. Something that helped me out a lot at first was looking at the other available states to see how they dealt with some recurring problems. Also I constantly checked the lxml.html, Python and Open States documentation. All of them have great documentation and never had much of a problem on that front. Later I got more accustomed to the process and things went quite smoothly.

I want to definitely continue contributing to the Open State Project however I can. Sunlight is a great organization that has a very important mission and the Open State Project is definitely one of its most important projects. I would like to thank my mentor James Turk for all of his help and for being so cool and accommodating. Also I would like to thank Google for helping the FOSS community and students through the Google Summer of Code program.