Scrax is a scrapper designed for converting html page archive data from web comics into RSS feeds. This is quite useful for tracking many web comics at once, especially those with odd update schedules. It can be used for other similar things too though.
Scrax has two parts. First, a module that provides a few common functions that should work for a large majority of sites. Second, a set of scripts that use this module. These scripts look at specific web pages, and parse out links and produce RSS.
The module and the scripts that use it are now distributed as seperate archives. Mostly because the scrax module is now a ruby gem.
HTMLtoRSS was my first attempt at this. Scrax is a lot cleaner I think. But it isn’t quite as generic.
First install the module. The best way is to use the gem, otherwise unpack the zip somewhere, and make sure that is in the include path for ruby.
Second, if you want to use the scrapers I wrote, unpack the Comics zip somewhere. They should just run from where ever once the scrax module is properly installed.
This is a half version of sorts. I’m in the midst of reworking most of the guts. I think this is a fairly stable turning point where I [hopfully] haven’t broken anything, but still am bringing the new stuff. Mostly what is happening is that I’m trying to make the Scrax API more flexible. Personally, I would like to be able to build quick-ish one-shot styled Scraxes in irb, and the previous API made that painful. I’m not there yet, but I think I’m moving in the right direction.
Scrax scripts output RSS2.0, and are intended to be used within something like NetNewsWire2.0. Use the included ‘runnable.rb’ script to set the execution bit on the comic specific scripts. Then add the scripts with ‘File > New Special Subscription > Script…’. You then need to open the info window for the new script, and under ‘Script Settings’ change the type to ‘Shell’.
If a scrax script is acting funny, the first thing to try is deleting its saved state. On Macs, delete files in:
~/Library/Application Support/Scrax On Unixes:
Installing via gem usually does this automatically for you. Otherwise you can just run
rdoc scrax/*.rb to generate the API documentation.
- Ruby 1.8
- htmltokenizer 1.0
- builder 2.0
- Hpricot 0.4 Scrax doesn’t require Hpricot, but a couple of the comic scraping scripts do. Future versions of Scrax will most likely require Hpricot.
Changes since Release 0.13
- Start of API redesign.
- Start using the better date finding/parsing code.
- Swtich catharsiscomic to custom work, need to get around date format flipping.
- Rebuild countyoursheep scrapper with new API design.
- Oots changed the way they do things.
- New comics Castlevania RPG, SkyFall, and Friendly Hostility
- Moved out some broken scrappers I don't feel like fixing.
- Added atom output, not used or accessible at the momment.