"MY GOD JIM"
I feel like I'm in an episode of star trek all of a sudden (but I always thought I'd be scotty, not bones...)
Dude are you telling me you just
WROTE a site ripper?? :shock: :hypno:
Based on your explanation, thats essentially what you are describing. I am kind of shocked, you really are as crazy as me! :clap:
imp:
By all means I'd love to check it out. I wasn't going to post to this thread again til I got something worth reporting, but I guess I do, (knock on wood), I totally scrapped the first run stuff (including the program itself and all the iterations of corrupted crap it made) and am currently running a different one that seems to be doing much better so far (knock on wood)...
At the time of this posting, it has been running 15 hours 7 min 28 sec and has downloaded 132,684 htmls 723 jpg/gif and I did one thing here that really really helped the rip out, I posted a new thread in the 1.0 DF right up top page one that contains direct links to every single index page in the archives. This (to the eye of the ripper) turned the archive from something dozens or hundreds of levels deep into something 3 levels deep all the way through.
This made the crawl much faster and also ensured the ripper much more quickly/easily mapped out the ENTIRE archive right at the very start of the job, as opposed to trying to crawl down down down as it goes blind guessing where or how deep it's going (not sure why but they all seem to choke out eventually that route)...
So anyway, the current rip shot looks like it may just work (not trying to jinx myself though). I'd love to see what you made, and if this one pukes out in the end as well, I'd love to give it a shot. There is one thing I need to ask though. How hard would it be for you to revise its code such that it could do a blanket rename operation to all html files created (as opposed to just naming them exactly as found in urls, and btw, this would entail not just renaming the resulting file but also then converting any occurrence of that link in all other saved files to the new renamed filename)? Go here and you'll see the index shell that illustrates what I mean (hover over a link and check your status bar or right click and properties to see the revised link naming convention):
http://xwebarchives.org/1p0
I ask this because the original 1.0's url titles are huge and contain post headlines converted to addresses, full of some strange characters that I'm finding make for tough ripping as well as possible browsing issues in the rip. The rip I am making now is tweaked to rename ALL saved html files in DOS compliant 8.3 and revise all link occurrences in the rip to those names as it goes. I'm going to put it all in a flat directory (meaning all in the same giant folder) for easier browsing/linking/usage once its online. So is a blanket html file rename operation something you could incorporate into this without too much effort? If not thats
totally ok, I could probably use a mass file editor to splice link name changes in after the fact too, but it would be a lot more time consuming.
Jim you rock! You are officially a "hardcore" Xwebber! :clap: