"Best Of Xweb 1.0"?- Update

User1 · Feb 20, 2009

[UPDATE: The project outlined below has been placed "on hold" while I explore a final site rip option, read further down, stay tuned -Mac]

---

Who's got some free time to kill? :shh: :whistle:

I am recruiting a few (or?) volunteers to assist work setting up the 1.0 archive mirror on 'Son Of'

Qualifications are:

You're here often enough you consider yourself a 'regular' (be it lurker or poster)
You're comfortable enough with the basics of navigating the new site controls
You've got some free time to kick in occasionally tinkering with such a project

Work involves:

I can grant your account "move thread" permission applying only when you're in the mirror section, and as a team working in free time we do a mass sort of the current mirror contents, examining threads and moving them from the main mirror into either a 'tech' or 'nontech' subforum depending on the content of each thread. At present no transferring is involved, no posting, no editing, no deleting, just moving what we've got so far into appropriate buckets.

If you may be interested in taking part in this let me know by posting a reply here or sending me an email seattlex19 at fastmail.fm

User1 · Feb 20, 2009

BTW it may help to add

The primary goal of this sorting effort is not just for separation's sake. The driving concern in reality is that the original archive is simply too large to snap "everything" manually in this lifetime (or at least not any time soon)

Since we cannot get a direct db rip, and all the ripper programs I've tried have created somehow bunk/flawed results (not to mention that it would still not be integrated or searchable from here), doing a mass manual dupe/transfer seems the only viable (if not time consuming) option.

I view duping the archive in some form as a nonquestion, it is a must, eventually.

Again though due to the original's sheer size and the constraints we're facing, it seems it may only be possible to do a sort of "Best Of SKIM" that would become the "condensed" 'Son Of' mirror.

I really don't want to get too deep into any business of deciding what is kept or goes in some willy nilly or arbitrary way, slippery slope, really I just think we approach it with a 'GIANT searchable Best Of' attitude, if its a tech topic and going to make for sweet search results it goes in the tech mirror, if its a great DF thread about a get together or tall tale of Harbor Freight or Moxie, it goes in nontech mirror, if it misses either bucket, it gets skipped and left on the original archive.

To me this would leave behind mainly ten years worth of "paging user abc..." with zero or one responses, various tech questions with zero responses, single post threads containing dead links and no or very brief discussion, etc, search engine clutter would not make the cut, most everything else of any significant level of discussion would by default.

This is the de-facto 'plan' as I'm envisioning it anyway, barring any sudden unexpected miracle db rip falling in my lap...

User1 · Feb 21, 2009

2 so far

We have 2 volunteers via email, who have been granted 'move' power in the mirror section to go read some archives and click 'move' a while in free time... anyone else?

As a side note, I've found a different ripper and am giving that a test run in parallel. Not holding high hope out for that path though, but working both angles, we will get there one way or the other, eventually

Karfrik · Feb 21, 2009

I would but....

definately not qualified.I`m still trying to learn how to post pics here after 9yrs?......Darn:confuse2:

User1 · Feb 22, 2009

thank you Albert

It is the thought/support that counts

BTW to all, it looks like after several days of trial and error I might actually be very near a breakthrough on the ripper front... looks like I am just a few pages from having the entire thing ripped to a partition on my laptop hard drive

c::wacko:

It's a start... I have already had to restart the ripper from scratch several times over the weekend, reconfiguring it to add several more elements to 'skip/ignore' each try, to get it to just download what I am after (the site and contents) and less I'm not (a gazillion other things linked out in ten years of archives, or duplicate pages interlinked onsite etc)

I may have to run it one last time though before I can get a rip archive that would be hostable online for users to access... the rip I'm going to end up with will probably be functional such it could be navigated via browser and maybe even searchable via google, but its going to be utterly massive because its built entirely of html files (probably about 100,000 of them total)... its well over a gig on drive the way its set up now... if I am going to get this hostable I will probably have to discard some things (like 'whole thread view' duplicates of every thread in the archive, just make the mirror navigable in post view only, like old old school 1.0)

Stay tuned... more to come...

JimD · Feb 22, 2009

Hmmmmm

What does this development mean for the thread sorting project? I have spent many hours this weekend sorting. Should we wait until you figure out if the "rip" will be the final form of the old content?

User1 · Feb 22, 2009

well..

...as I said, I'm doing both in parallel. I spent a lot of hours loading those in here to sort (intended for you more as an occasional pass time that would help me/us out if that is ultimately the only route we succeed at) but yes still trying other things that may promise an outside shot at full rip (full would be better and a lot less work for both of us if a rip flies)...

I have really not been expecting a full rip to "work" due to several failed and flawed outcome experiments at it in the distant and recent past... this time I am trying something completely different from all I had tried before, and it suddenly is showing some promise... I have scrapped the whole rip I just made and am rerunning it one more time (starting now so will prob run against the old site for a dayish before I will see the sum of what it will produce to work with), even had to do some reconfig of the website to benefit the rip...

So, I would say, if you want to shuffle some threads around on the existing mirror by all means do, but don't kill yourself at it just yet, because if this rip actually does work out to be something small enough to host on our existing resource and search it with google or whatever, that would certainly be imo the winning route as it would save us months (or?) of manual labor to get where we want to go, in terms of a searchable mirror.

So stay tuned for the moment, I am going to go launch my final rip attempt here and will let you guys know the outcome of it tomorrow or whenever it finishes running...

JimD · Feb 22, 2009

OK... I can wait a while.

I think I was only averaging around 60-80 threads an hour. Some are obvious, some you have to read a little, and others you really have to read entirely because they are interesting.

It is funny seeing all the folks I thought were "new" since I got here in 2003, but now I find out they were here all along and maybe just quiet for a long while. Of course, Tony is a constant. There is lot's of back story on the history of Xweb and you (Mac). Interesting stuff.

User1 · Feb 22, 2009

Agreed

Everyone here knows my memory's crap... it was really enlightening for me to take that long walkabout through (literally) every thread from day one '99 up to about 9/11... wow... Reading Stockton Brad back when he used to do nothing but post enthusiastic X sighting reports in the DF... seeing how many names have been part of Xweb since the very beginning (not the least of whom being, Mark Freeman, Eric Armstrong, Gregory Smith...) heck of a trip reading some of that stuff now... God did I really sound like that ten years ago? :hypno:

Stay tuned..

*PS* 60-80 sorts/hr is not a bad clip dude, doing the copy-paste-over avgs about 40-50/hr max (my best anyway) ...

User1 · Feb 22, 2009

HOLY COW

JIM! I just looked in there and saw how many threads you actually moved, dang dude! you weren't kidding! :shock: yes please hang back til Mon-Tue and let me see whether this last gasp rip effort is going to have legs... and thank you for moving all those (it is reassuring to know there are a few someones here as far gone over Xs and Xweb as me... :wacko:

)

JimD · Feb 22, 2009

Don't forget Robert!!

I don't know how many he did or how many I did, but we were both in there sorting away the hours.

User1 · Feb 22, 2009

WOW

I should not have launched you guys yet yes thank you both indeed, I didn't mean to disclude Rob, I just didn't know you were both already starting... I just meant start whenever time permits.. Anyhow I was earnestly expecting to have no viable result with my last ditch rip efforts over here (evidenced by us all proceeding with the manual mirror) else I wouldn't have. ... but just Friday night out of nowhere it seems I may have struck upon a random mishmash of configuration settings in this certain ripper that may actually assemble something viable... and was out til today while it ran over the weekend on my laptop...

Looks like this latest rip attempt initiated today so far is failing to grab out about 50 posts per 10,000 snagged... some fail out for various reasons (people posted something with a bizarre character in it the ripper choked on, cant parse, etc) but it will retry all the fails at the end of the cycle, last run it totalled out near 100k page files saved with total fails at a couple hundred.. out of 100k saves. That is more posts captured in any case than we'll ever get manually... so it seems worth a shot. If it is going to fly yet? I don't know but I'm giving it a last big go as we speak... will update soon...

User1 · Feb 23, 2009

more tweaks

hopefully 50th time's a charm here, had to rererestart the rip now, discovered a way to throttle it back some so it's not just pounding the tar out of the server, and may be less likely to 'miss' any fails that may have been a result of moving very quickly, the thing is now though its going to probably run significantly longer to complete, so, could be a few days before we see the end result... will keep you all posted (pun)...

Black-Tooth · Feb 23, 2009

Mac...

I'd like to help... could I work with Jim?

I specifically would like to see the Best Of stuff sectionalized or indexed in some manner so it can be pulled up easily. Also, much of it needs to be edited, updated, and condensed.

Otherwise, I'm currently computer illiterate when it comes to HOW this can be done... but I feel I could possibly be an asset if I was coached.

User1 · Feb 23, 2009

of course

Tony (and anyone else who's emailed to join, I haven't been able to check email much this weekend) yes of course you are welcome to join in such efforts... at the moment though we're now waiting to see what will be the result/outcome of this latest site rip attempt. It looks (for the first time) as though it might actually work. If so then no sorting/condensing will be required (at least not for the mirror itself, although a newly compiled 'Best Of' for inclusion in the future wiki or as some part of 'Son Of' would still be doable in either case). At the time of this posting the rip has completed about 15k posts out of about 88k (with ZERO fails!) so it is probably going to be a few more days before we see the final result of it. Once that happens I will have a better idea what our next step is... stay tuned, and thank you!

User1 · Feb 24, 2009

Quick status report

The rip experiment is still ongoing, the rip itself will probably be complete tonight or tomorrow, then there are a number of modifications I will need to make to the resulting files before I can upload it all for test here. In the meantime I've put up a tiny (tiny tiny, about 1400 out of a projected total 88k files!) snippet consisting mainly of the index pages and a few random post contents, just so I could work out 404 errors (made it just like the old classic "no such message" error lol) some server config issues and other things on the backend beforehand. As a side note, we're getting very serious about Xweb's future back here behind the scenes. I just secured us yet another web host (!) that will more easily accomodate the massive rip archives, and likely the forthcoming wiki as well, than the server we're running the forums on now. We may consolidate them at some future point but in any case the bottom line is that we are taking Xweb's evolution to the next level very seriously. I will post another update on all this in a couple more days or perhaps this weekend, when I have some more solid details to share...

JimD · Feb 24, 2009

I hope it works out Mac

I poked at it a bit and it would be awesome if you could get it to work. Looks just like Xweb 1.0... of course I guess it should.

Just no database behind it... correct?

User1 · Feb 24, 2009

correct

The original site archive is in a database and the forum code serves up requests on the fly so it is way smaller storage wise. Since we cannot get a direct db rip all we can do is hit the site from outside with the ripper and that captures the resulting html document and saves it to files. So from the user end it looks and acts identical (in terms of read only) just it is all stored as static files instead of a live system, thus, it is going to be gigantic by comparison. I think we can make it work though. After the rip is done I will need to attack those 88k files on my hdd with some kind of bulk/mass file editor (as yet to be figured out) in order to remove boatloads of stat tracking codes, popup ad codes, a bunch of scripts that will be useless bloat in a static scenario etc etc, and install on it a google custom search module, load it up on the newest xweb server and if all goes well we'd end up with an exact replica of Xweb 1.0 DF archives on our own server but with a full archive search function powered by google. Stay tuned...

User1 · Feb 25, 2009

rip is having a few issues

The rip is encountering a few issues, nothing I am thinking that can't be overcome, just a question of how much it will take to make this turkey fly... I still have to wait to tomorrow at earliest to know whether result as a sum will be something at-or-near usable or terminally dysfunctional. I have lined up a second ripper option that may end up getting the job done if this one ultimately produces crap, but it is, while more promising maybe, also a lot more complicated to set up... anyway, we're going to see. The more I pursue this the more it seems viable this time than before. Viable, just a pita

Anyway thought I'd say I'd like to 'officially' suspend sort work on the existing partial 'son of' mirror, I think I'd really like to take a whack at wringing these 2 rippers for all they're worth (the art of ripping apparently has evolved somewhat since I last tried it out)... and if it all still comes up bunk, then we'll examine where we stand in the son of transfer/sort and decide accordingly...

Thanks again to all involved, I'll update on any rip progress made on the weekend (I took the 'tiny sample' one down for now, it has to move to a different server, so...)

As a side note, if we ever get this thing sorted out, maybe Xweb could offer the 1.0 archive on cd to users who'd like a copy, for free+ship or free+small donation for the cause... but gotta get a right rip ourselves first... stay tuned

User1 · Feb 26, 2009

bad news, good news

Bad news is, after fighting with this ripper non stop for days, it has gotten to about 83k of 88k posts downloaded and due to sheer size of the job it is freezing up and choking out. Not sure whether it is ever going to succeed at a full rip, finally its starting to look like its going to be just littered with broken links even if the files are there...

Good news is, I had 2 more other rippers lined up at this point, and just got the second different ripper off the ground, running right alongside the dying choking one thats been the utter bane of my existence, and this cheerful little ripper has only been running about 2 hours and has already pulled down nearly a quarter of the whole archive with less than a dozen errors so far...

...so, after (too) much grief, we may just get there after all :shh: "stay tuned"...:whistle:

*"Best Of Xweb 1.0"*?- Update

RETIRED Admin, pm OFF

RETIRED Admin, pm OFF

RETIRED Admin, pm OFF

Albert

RETIRED Admin, pm OFF

Waiting for Godot...

RETIRED Admin, pm OFF

Waiting for Godot...

RETIRED Admin, pm OFF

RETIRED Admin, pm OFF

Waiting for Godot...

RETIRED Admin, pm OFF

RETIRED Admin, pm OFF

Tony Natoli

RETIRED Admin, pm OFF

RETIRED Admin, pm OFF

Waiting for Godot...

RETIRED Admin, pm OFF

RETIRED Admin, pm OFF

RETIRED Admin, pm OFF

"Best Of Xweb 1.0"?- Update