For today's Friday Puzzler (yes, I know it's been a while), I have something of a doozy. It may not be a 5 minute puzzle, but it could still be fun, and most of all, it will be helpful to the Model-Glue team. The Model-Glue docs (http://docs.model-glue.com/) were written using Robohelp, and unfortunately, the original files are missing. We need to get an "export" of the docs so that they can be republished in a new format. Your task, if you choose to accept it, is to write a scraper for the docs that can download and store each page from the documentation. This needs to keep the HTML for layout purposes, so it can't be just plain text.
Anyone up for that challenge?
Comment 1 written by John Bliss on 13 March 2009, at 8:40 AM
http://softbytelabs.com/us/bw/
Comment 2 written by James Buckingham on 13 March 2009, at 8:41 AM
I did just that when a competitor of ours "stole" content off our site.
Comment 3 written by Steve Ross on 13 March 2009, at 8:50 AM
Comment 4 written by John Lyons on 13 March 2009, at 8:57 AM
http://www.gnu.org/software/wget/
Comment 5 written by Raymond Camden on 13 March 2009, at 9:10 AM
The other view is - I'd like to a CF solution too, just for fun. :)
Comment 6 written by Rand Thacker on 13 March 2009, at 10:22 AM
I never knew about the recursive option. Amazing.
I learned something new today: Thanks John Lyons (and Ray for issuing the challenge to begin with).
Comment 7 written by Steve Ross on 14 March 2009, at 9:27 AM
Check out http://scrubyt.org lets you grab parts of pages (even interact with them)
for example:
ebay_data = Scrubyt::Extractor.define do
fetch 'http://www.ebay.com/'
fill_textfield 'satitle', 'ipod'
submit
click_link 'Apple iPod'
record do
item_name 'APPLE NEW IPOD MINI 6GB MP3 PLAYER SILVER'
price '$71.99'
end
next_page 'Next >', :limit => 5
end
Comment 8 written by Rick Stone - RoboHelp ACE on 14 March 2009, at 1:07 PM
If so, I have a page up at the link below that will assist in reverse engineering things.
http://tinyurl.com/2g8kd6
If anyone creates a utility to grab this content and convert it to basic HTML pages, I'd love to know about it so I may steer folks to obtain it if needed. It would be a nice one to see.
Cheers all... Rick :)
Comment 9 written by Ed Bartram on 24 March 2009, at 4:01 PM
I kept the code simple using CFHTTP calls and looping through the pages using FindNoCase() to strip out the desired content.
Is this what you were looking for?
Comment 10 written by Raymond Camden on 25 March 2009, at 12:31 PM
[Add Comment] [Subscribe to Comments]