Question

Unfortunately, our hosting provider experienced 100% data loss, so I've lost all content for two hosted blog websites:

(Yes, yes, I absolutely should have done complete offsite backups. Unfortunately, all my backups were on the server itself. So save the lecture; you're 100% absolutely right, but that doesn't help me at the moment. Let's stay focused on the question here!)

I am beginning the slow, painful process of recovering the website from web crawler caches.

There are a few automated tools for recovering a website from internet web spider (Yahoo, Bing, Google, etc.) caches, like Warrick ^[1], but I had some bad results using this:

My IP address was quickly banned from Google for using it
I get lots of 500 and 503 errors and "waiting 5 minutes…"
Ultimately, I can recover the text content faster by hand

I've had much better luck by using a list of all blog posts, clicking through to the Google cache and saving each individual file as HTML. While there are a lot of blog posts, there aren't that many, and I figure I deserve some self-flagellation for not having a better backup strategy. Anyway, the important thing is that I've had good luck getting the blog post text this way, and I am definitely able to get the text of the web pages out of the Internet caches. Based on what I've done so far, I am confident I can recover all the lost blog post text and comments.

However, the images that go with each blog post are proving…more difficult.

Any general tips for recovering website pages from Internet caches, and in particular, places to recover archived images from website pages?

(And, again, please, no backup lectures. You're totally, completely, utterly right! But being right isn't solving my immediate problem… Unless you have a time machine…)

[1] http://warrick.cs.odu.edu/warrick.html

Answer 1

Here's my wild stab in the dark: configure your web server to return 304 for every image request, then crowd-source the recovery by posting a list of URLs somewhere and asking on the podcast for all your readers to load each URL and harvest any images that load from their local caches. (This can only work after you restore the HTML pages themselves, complete with the <img ...> tags, which your question seems to imply that you will be able to do.)

This is basically a fancy way of saying, "get it from your readers' web browser caches." You have many readers and podcast listeners, so you can effectively mobilize a large number of people who are likely to have viewed your web site recently. But manually finding and extracting images from various web browsers' caches is difficult, and the entire approach works best if it's easy enough that many people will try it and be successful. Thus the 304 approach. All it requires of readers is that they click on a series of links and drag off any images that do load in their web browser (or right-click and save-as, etc.) and then email them to you or upload them to a central location you set up, or whatever. The main drawback of this approach is that web browser caches don't go back that far in time. But it only takes one reader who happened to load a post from 2006 in the past few days to rescue even a very old image. With a big enough audience, anything is possible.

Answer 2

I was able to recover about a year's worth of blog posts from my bloglines cache. It looks like the images are in there too. I saved it as a complete website, which pulls all the files. You can download it here:

http://drop.io/horrorbackup

Good Luck!

Answer 3

Some of us follow you with an RSS reader and don't clear caches. I have blog posts that appear to go back to 2006. No images, from what I can see, but might be better than what you're doing now.

Answer 4

(1) Extract a list of the filenames of all missing images from the HTML backups. You'll be left with something like:

stay-puft-marshmallow-man.jpg
internet-properties-dialog.png
yahoo-homepage-small.png
password-show-animated.gif
tivo2.jpg
michael-abrash-graphics-program

(2) Do a Google Image Search for those filenames. It seems like MANY of them have been, um, "mirrored" by other bloggers and are ripe for the taking because they have the same filename.

(3) You could do this in an automated fashion if it proves successful for, say, 10+ images.

Answer 5

By going to Google Image search ^[1] and typing site:codinghorror.com ^[2] you can at least find the minified versions of all of your images. No, it doesn't necessarily help, but it gives you a starting point for retrieving those thousands of images.

Codinghorror images

It looks like Google stores a larger thumbnail in some cases:

Google vs. Bing

Google is on the left, Bing on the right.

[1] http://images.google.com/imghp?hl=en&tab=wi
[2] http://images.google.com/images?hl=en&safe=active&um=1&sa=1&q=site%3Acodinghorror.com&aq=f&oq=&aqi=&start=0&imgtbs=s

Answer 6

Sorry to hear about the blogs. Not going to lecture. But I did find what appears to be your images on Imageshack. Are they really yours or has somebody been keeping a copy of them around.

http://profile.imageshack.us/user/codinghorror

They seem to have what looks like 456 images that are full size. This might be the best bet for recovering everything. Maybe they can even provide you a dump.

Answer 7

Jeff, I have written something for you here ^[1]

In short what I propose you do is:

Configure the web server to return 304 for every image request. 304 means that the file is not modified and this means that the browser will fetch the file from its cache if it is present there. (credit: this SuperUser answer ^[2])
In every page in the website, add a small script to capture the image data and send it to the server.
Save the image data in the server.
Voila!

You can get the scripts from the given link.

(Sorry if this answer is not appropriate, but why am I getting down-voted?)

[1] http://www.diovo.com/2009/12/getting-cached-images-in-your-website-from-the-visitors/
[2] http://superuser.com/questions/82036/recovering-a-lost-website-with-no-backup/82060#82060

Answer 8

+1 on the dd recommendation if (1) the raw disk is available somewhere; and (2) the images were simple files. Then you can use a forensic 'data-carving' tool to (for example) pull out all credible ranges that appear to be JPGs/PNGs/GIFs. I've recovered 95%+ of the photos on an iPhone that was wiped this way.

The open source tools 'foremost' and its successor 'scalpel' can be used for this:

http://foremost.sourceforge.net/

http://www.digitalforensicssolutions.com/Scalpel/

Answer 9

Try this query on the Wayback Machine ^[1]:

http://web.archive.org/web/*sa_re_im_/http://codinghorror.com/*

This will get you all the images from codinghorror.com archived by archive.org. This returns 3878 images, some of which are duplicates. It will not be complete, but a good start none the less.

For the remaining images, you can use the thumbnails from a search engine cache, and then do a reverse look-up using these at http://www.tineye.com/ . You give it the thumbnail image, and it will give you a preview and a pointer to closely matching images found on the web.

[1] http://en.wikipedia.org/wiki/Wayback%5FMachine

Answer 10

You could always try archive.org, as well. Use the wayback machine. I've used this to recover images from my websites.

Answer 11

So, absolute worst case, you can't recover a thing. Damn.

Try grabbing the minified google ones, and putting them through TinEye ^[1], the reverse-image search engine. Hopefully it should grab any duplicates or rehosts people have made.

[1] http://www.tineye.com/

Answer 12

It is a long shot, but you could consider:

Posting the exact list of picture you are missing
crowd-sourcing the retrieval process through all your readers's internet cache.

For instance, see the Nirsoft Mozilla Cache Viewer ^[1]:

alt text

It can quickly dig up any "blog.stackoverflow.com" picture one might still have through a simple command line:

MozillaCacheView.exe -folder "C:\Documents and Settings\Administrator\Local Settings\Application Data\Mozilla\Firefox\Profiles\acf2c3u2.default\Cache" 
/copycache "http://blog.stackoverflow.com" "image" /CopyFilesFolder "c:\temp\blogso" /UseWebSiteDirStructure 0

Note: they have the same cache explorer for Chrome ^[2].

alt text

(I must have 15 days worth of blog.stackoverflow.com pictures in it)

And Internet Explorer ^[3], or Opera ^[4].

Then update the public list to reflect what the readers report finding in their cache.

[1] http://www.nirsoft.net/utils/mozilla%5Fcache%5Fviewer.html
[2] http://www.nirsoft.net/utils/chrome%5Fcache%5Fview.html
[3] http://www.nirsoft.net/utils/ie%5Fcache%5Fviewer.html
[4] http://www.nirsoft.net/utils/opera%5Fcache%5Fview.html

Answer 13

In the past I've used http://www.archive.org/ to pull up cached images. It's kind of hit or miss but it has worked for me.
Also, when trying to recover stock photos that I've used on an old site, www.tineye.com is great when I only have the thumbnails and I need the full size images.

I hope this helps you. Good Luck.

Answer 14

This is probably not the easiest or most full-proof solution, but services like Evernote typically save both the text and images when they are stored inside the application - maybe some helpful readers who saved your articles could save the images and send them back to you?

Answer 15

I've had great experiences with archive.org ^[1]. Even if you aren't able to extract all of your blog posts from the site, they keep periodical snapshots:

alt text

This way you can check out each page and see the blog posts you made. With the names of all the posts you can easily find them in Google's cache if archive.org doesn't have it. Archive tries to keep images, Google cache will have images, and I haven't emptied my cache recently so I can help you with the more recent blog posts :)

[1] http://archive.org/

Answer 16

Have you tried your own local browser cache? Pretty good chance some of the more recent stuff is still there. http://lifehacker.com/385883/resurrect-images-from-my-web-browser-cache

(Or you could compile a list of all missing images and everyone could check their cache to see if we can fill in the blanks)

Answer 17

A suggestion for the future: I use Windows Live Writer ^[1] for blogging and it saves local copies of posts on my machine, in addition to publishing them out to the blog.

[1] http://windowslivewriter.spaces.live.com/

Answer 18

The web archive caches the images. It's under heavy load right now, you should be ok until 2008 or so.

http://web.archive.org/web/20080618014552rn%5F2/www.codinghorror.com/blog/

Answer 19

About five years ago, an early incarnation of an external hard drive on which I was storing all my digital photos failed badly. I made an image of the hard drive using dd and wrote a rudimentary tool to recover anything that looked like a JPEG image. Got most of my photos out of that.

So, the question is, can you get a copy of the virtual machine disk image which held the images?

Answer 20

I suggest the combination of archive.org and a request anonymizer like [Tor][2]. I suggest using anonymizer because that way each of your requests will have a random IP and location and that way you can avoid getting banned by a archive.org (like Google did) for unusually high number of requests.

Good Luck, there are a lot of gems in that blog.

Answer 21

archive.org sometimes hides images. Get each URL manually (or write a short script) and query them for it like this:

string.Format("GET /*/{0}", nextUri)

Of course that's going to be quite a pain to search through.

I might have some in my browser cache. If I do I'll host them somewhere.

Answer 22

The wayback machine will have some. Google cache and similar caches will have some.

One of the most effective things you'll be able to do is to email the original posters, asking for help.

I do actually have some infrastructural recommendations, for after this is all cleaned up. The fundamental problem isn't actually backups, it's lack of site replication and lack of auditing. If you email me at the private email field's contents, later, when you're sort of back on your feet, I'd love to discuss the matter with you.

Answer 23

Jeff, you told in one of your SO podcasts, that you tried to use Flickr as CDN for your images. Maybe you still have that account and have some images there?

Some of the images could be found searching on Google Images ^[1] and click on "Find similar images", maybe there are copies on other sites.

If you need some crowd-source power, let me/us know!

[1] http://images.google.com/q=site%3Acodinghorror.com

Answer 24

You could use TinEye ^[1] to find duplicates of your images ^[2] by searching the thumbnails with google cache ^[3]. This will help only with images you've taken from others site, though.

[1] http://www.tineye.com
[2] http://www.tineye.com/search/001248c59ca6ac6ed00356013fb9089a196e41e2
[3] http://images.google.com/images?gbv=2&hl=es&sa=1&q=site%3Ablog.stackoverflow.com&btnG=Buscar+im%C3%A1genes&aq=f&oq=&start=0&imgtbs=s

Answer 25

If you're hoping to try to scrape users' caches, you may want to set the server to respond 304 Not Modified to all conditional-GET ('If-Modified-Since' or 'If-None-Match') requests, which browsers use to revalidate their cached material.

If your initial caching headers on static content like images were pretty liberal -- allowing things to be cached for days or months -- you could keep getting revalidate requests for a while. Set a cookie on those requests, and appeal to those users to run a script against their cache to extract the images they still have.

Beware, though: the moment you start putting up any textual content with inline resources that aren't yet present, you could be wiping out those cached versions as revalidators hit 404s.

Answer 26

I've managed to recover these files from my Safari cache on Snow Leopard:

bad-code-offset-back.jpg
bad-code-offset-front.jpg
code-whitespace-invisible.png
code-whitespace-visible.png
coding-horror-official-logo-small.png
coding-horror-text.png
codinghorror-search-logo1.png
crucial-ssd-128gb-ct128m225.jpg
google-microformat-results-forum.png
google-microformat-results-review.png
kraken-cthulhu.jpg
mail.png
powered-by-crystaltech-web-hosting.png
ssd-vs-magnetic-graph.png

If anyone else wants to try, I've written a Python script to extract them to ~/codinghorror/filename, which I've put online here ^[1].

I hope this helps.

[1] http://junk.cdslash.net/codinghorror.py

Answer 27

At the risk of pointing out the obvious, try mining your own computer's backups for the images. I know my backup strategy is haphazard enough that I have multiple copies of a lot of files hanging around on external drives, burned discs, and in zip/tar files. Good luck!

Answer 28

Very sorry to hear this and I am very annoyed for you, and the timing - I wanted an offline copy of a few of your posts and did HTTrack on your entire site but had to go out (this was a couple of weeks ago) and I stopped it.

If the host is half descent - and by the fact I am guessing you are a good customer... I would ask them to either send you the hard drives (as I am guessing they should be using RAID) or do some recovery themselves.

Whilst this may not be a fast process, I did this with one host for a client and was able to recover entire databases intact (... basically, the host tried an upgrade for the control panel they were using and messed it up.. but nothing was overwritten).

Whatever happens - Good luck from all your fans on the SO sites!

Answer 29

Did you get a chance to see if, your hosting provider has any backup at all (some older versions)?

Answer 30

How much is this data worth to you? If it's worth a significant sum (thousands of dollars) then consider asking your hosting provider for the hard drive used to store the data for your website (in the case of data loss due to hardware failure). You can then take the drive to ontrack or some other data recovery service to see what you can get off the drive. This might be tricky to negotiate due to the possibility of other people's unrecovered data on the drive as well, but if you really care about it you can probably work it out.

Answer 31

Have you tried doing a Google Image search, with the syntax site:codinghorror.com?

Answer 32

I can read old posts on my Google Reader account. Maybe that helps: relating to your horror .

Answer 33

Beta Bloglines ^[1] can access more of the bloglines archive than the previous poster recovered using the classic interface. I'm currently saving whatever cached images I can get from them back to the following dates/posts:

Feb 15, 2007 for codinghorror - Oldest post: Origami Software and Crease Patterns
May 14, 2008 for blog.stackoverflow - Oldest post: Podcast #5

Once it's finished saving and I've uploaded it somewhere I'll update this post.

Update: This is taking longer than I anticipated. I'm saving from Chrome and Firefox and merging the images so I get both sets of cached images.

Update: Looks like there aren't any differences between the two sets, so I'm just seeing what you've already restored, if I had anything in my cache at all.

[1] http://beta.bloglines.com/

Answer 34

I was going to suggest Warrick because it was written by one of my CS professors ^[1]. I'm sorry to hear that you had a bad experience with it. Maybe you can at least send him a note with some bug reports.

[1] http://www.harding.edu/fmccown/

Answer 35

Your images, ask SUN microsystems to give them back to you, they have made " an entire internet backup ^[1]" ... in a shipping container

"The Internet Archive offers long-term digital preservation to the ephemeral Internet," said Brewster Kahle, founder, the Internet Archive organization. "As more of the world's most valuable information moves online and data grows exponentially, the Internet Archive will serve as a living history to ensure future generations can access and continue to preserve these important documents over time."

Founded in 1996 by Brewster Kahle, the Internet Archive is a non-profit organization that has built a library of Internet sites and other cultural artifacts in digital form that include moving images, live audio, audio and text formats. The Archive offers free access to researchers, historians, scholars, and the general public; and also features "The Wayback Machine" -- a digital time capsule that allows users to see archived versions of Web pages across time. At the end of 2008, the Internet Archive housed over three petabtyes of information, which is roughly equivalent to about 150 times the information contained in the Library of Congress. Going forward, the Archive is expected to grow at approximately 100 terabytes a month.

alt text

more here ^[2] and here ^[3]

[1] http://www.sun.com/aboutsun/pr/2009-03/sunflash.20090325.1.xml
[2] http://gizmodo.com/5185947/sun-storing-the-entire-internet-in-a-shipping-container
[3] http://www.sun.com/featured-articles/2009-0325/feature/index.jsp

Answer 36

Most solutions use a combination of blog reader assistance, archive.org, and Google caching. Consider turning this data crisis into a blog recovery tool specification. Several features listed in the question and answers look ready to automate, given knowledge an owner would have of their root site.

Restore pages from archive.org, Google cache, or local cache using web spider that avoids bannable techniques
Check local cache, Google image search, and imageshack for matching file names
After initial recovery, make list of site's missing images and other URLs (e.g., return 304 code for images)
Add upload or contribution form for readers who have cached versions
Site owner previews and validates contributions
Resubmit recovered pages to search engines, if desired

Owners that derive a lot of value from quick recovery might offer a bounty for missing files or other outside assistance.

Answer 37

Just automate grabbing the individual Google page cache files.

Here's a Ruby script I used in the past.

http://pastie.org/739757

My script doesn't appear to have any sleeps. I didn't get IP banned for some reason, but I'd recommend adding one.

Answer 38

You could try to get the broken HDD from hosting company and give it to a hdd recovery service, I think you could find one. At least the backup images would probably be restored form there. Also this disk could be part of some mirror/RAID system and there is somewhere a mirror image?

Answer 39

Maybe you could crowd-source it asking us to look in our browser caches. I generally read Coding Horror via Google Reader, so my Firefox cache doesn't seem to have anything from codinghorror.com in it.

Others can look in their own Firefox cache by browsing to: about:cache?device=disk .

Answer 40

Just another shot at retrieving the content.

I was subscribed using feed burner. So might have some archives in my mail! You can ask others, who might be able to forward you those posts.