*"Best Of Xweb 1.0"*?- Update

Status
Not open for further replies.
thank you

Its the fact I know you guys love and use this site that I am so motivated to work on it.

As a side note it seems I've got my foot in mouth again here, maybe I'll just quit posting the blow-by-blow til I get something that actually works!! :rolleyes: Sometime last night while I slept, the dying choking ripper crashed and, as a last gasp spit at my efforts, it took the other ripper right out with it, at about 30k posts...

So, we're still workin it. Going to re-re-re-restart the rip using the new program today and hopefully see what it can pull off (pun) by the weekend without its dysfunctional/jealous compatriate throwing a stick in its spokes...
 
Hi Mac

I started working on a little file manipulator for you to play with. It is a simple (so far anyway) script written in Tcl. (Tool Command Language)

Basically, it will read a list of file names from a text file. (I am thinking this would be a list of the html files you ripped) After reading an html file name, it will find that file and open it for read. Then it will open a fiel of the same name for output in another directory. At that point it will read through the lines in the input html file one at a time, then test the lines against a set of exclusion rules and only write the "good" lines into the output html file. Does that sound like a plan??

Tcl isn't the fastest language performance wise, but it is pretty straight forward. Would you like me to email you a zip file with the sample script so you can see what I am talking about?
 
"MY GOD JIM"

I feel like I'm in an episode of star trek all of a sudden (but I always thought I'd be scotty, not bones...)

Dude are you telling me you just WROTE a site ripper?? :shock: :hypno:

Based on your explanation, thats essentially what you are describing. I am kind of shocked, you really are as crazy as me! :clap: :pimp: :D

By all means I'd love to check it out. I wasn't going to post to this thread again til I got something worth reporting, but I guess I do, (knock on wood), I totally scrapped the first run stuff (including the program itself and all the iterations of corrupted crap it made) and am currently running a different one that seems to be doing much better so far (knock on wood)...

At the time of this posting, it has been running 15 hours 7 min 28 sec and has downloaded 132,684 htmls 723 jpg/gif and I did one thing here that really really helped the rip out, I posted a new thread in the 1.0 DF right up top page one that contains direct links to every single index page in the archives. This (to the eye of the ripper) turned the archive from something dozens or hundreds of levels deep into something 3 levels deep all the way through.

This made the crawl much faster and also ensured the ripper much more quickly/easily mapped out the ENTIRE archive right at the very start of the job, as opposed to trying to crawl down down down as it goes blind guessing where or how deep it's going (not sure why but they all seem to choke out eventually that route)...

So anyway, the current rip shot looks like it may just work (not trying to jinx myself though). I'd love to see what you made, and if this one pukes out in the end as well, I'd love to give it a shot. There is one thing I need to ask though. How hard would it be for you to revise its code such that it could do a blanket rename operation to all html files created (as opposed to just naming them exactly as found in urls, and btw, this would entail not just renaming the resulting file but also then converting any occurrence of that link in all other saved files to the new renamed filename)? Go here and you'll see the index shell that illustrates what I mean (hover over a link and check your status bar or right click and properties to see the revised link naming convention): http://xwebarchives.org/1p0

I ask this because the original 1.0's url titles are huge and contain post headlines converted to addresses, full of some strange characters that I'm finding make for tough ripping as well as possible browsing issues in the rip. The rip I am making now is tweaked to rename ALL saved html files in DOS compliant 8.3 and revise all link occurrences in the rip to those names as it goes. I'm going to put it all in a flat directory (meaning all in the same giant folder) for easier browsing/linking/usage once its online. So is a blanket html file rename operation something you could incorporate into this without too much effort? If not thats totally ok, I could probably use a mass file editor to splice link name changes in after the fact too, but it would be a lot more time consuming.

Jim you rock! You are officially a "hardcore" Xwebber! :clap: :cool:
 
No no no :)

I did not write a site ripper. That would be hard. :)

Earlier you said you would like to trim the size of the ripped archive. To do this you said you would need some type of file manipulation help to strip unwanted lines from the html files produced by the ripper.

All I have done is toss together a small example of what can be done fairly easily to manipulate the file contents.

I am not exactly sure what can be done about the file naming convention you used. I will have to look into it and see what I can come up with.
 
cool :)

Ok ok I misunderstood you I guess, I thought when you were talking about pulling files to manipulate you meant pulling them off the site as opposed to the hdd. Yes absolutely I would love to check that out, as there are only a couple of options for that that I'm looking at off-the-shelf here and I'm not entirely thrilled with any of them. Let me ask you this, how large a folder or how many files could your script handle? We're talking about making a search/find/remove/replace type change operation to 100,000+ files in a folder probably... I'd love to check it out heck yeah email away! :cool:
 
failed

Latest rip resulted in failure. I did get all the index pages though, which is the best result yet, and some of the post contents, but most of the post content pages resulted in binary garbage somehow... back to the drawing board... I may just rip one or a few index pages and contents at a time so I can progress it in steps and rectify such errors as I go... obviously this is going to take more time... the saga continues...

In the meantime, I did put up a fully navigable rip of the index pages, just for mockup grins (there are no post contents yet though, and the urls are likely to all change again next rip, but for now it's at least something to look at anyway)...
 
success! (partial)

HOORAY!! This damn thing might fly after all, it looks as though I just pulled off a mini-rip of all the index pages plus all the full first 30 pages of post contents :clap: I figured out that if I do it in smaller 'sorties' instead of one giant 'battle' I seem to be making some ground here. Shooting for the next 30 page chunk now. I will upload what I've got so far to the archive server to play with soon... more to come...
 
CLOSURE? :)

Just wanted to bring closure to this saga, I now have figured it out! I had to set up a whole mirror index on our archive server linking to every post over on the old system and then rip from the mirror indexes (weird, I know, but it is finally working!)

I currently have about the first 100 archived index pages worth of stuff on my hdd (this includes the indexes themselves and all posts' contents) out of 361 indexes total, and will be working on the rest over the next few days. I am doing it in sections of about 25-50 index pages (& their post contents) per run, and assembling the segments on my laptop to create the master which will be uploaded when complete, probably in the next week or so.

Once that is done, and up, and verified working etc, and google starts indexing it for our custom search widget, we'll probably just delete the existing mirror section we were lining up to do the "mass sort" on, get all that back out of the son of db, and link off to the rip mirror

I appreciate those who offered support and participation throughout this ordeal. To quote Ricardo, "thanks for watching!" :clap: :cool: :)
 
:)

As a side note, previous #s were skewed too, by the time this is done, by the new numbers, it's looking like gonna be around 150k+ posts :shock: :hypno:
 
bummer!

Well, this just never seems to end, I finally got the whole archive on my hdd, it is in fact closer to 200k htm files of posts, and everything looked good (99%, only a couple hundred? errored posts out of the whole 200k) and I've moved on to mass file edits to tweak the links etc for our purposes, and just about the time I am almost ready to try to start uploading all this to the archive server, I discover to my horror, it appears the ripper has somehow managed to dump/remove 99% of all the smilies ever posted :mad: :sigh:

SO either I've got to figure out why that happened and start ripping over, or, the whole archive will just not have more than 3 or 4 smilies anywhere in it... which would make many posts look kinda weird considering we used them rather a lot...


[Update: 3/6/09 - Got a successful re-re-re-re-re-rip, got indexes, got post contents, got smileys :) Mass editing links and posted a new thread to announce uploading commences... http://xwebforums.com/forum/index.php?posts/10843/]
 
Last edited by a moderator:
Status
Not open for further replies.
Back
Top