Web ApplicationsRecovering a lost website with no backup?
[+172] [40] Jeff Atwood
[2009-12-11 20:52:00]
[ backup website cache ]

Unfortunately, our hosting provider experienced 100% data loss, so I've lost all content for two hosted blog websites:

(Yes, yes, I absolutely should have done complete offsite backups. Unfortunately, all my backups were on the server itself. So save the lecture; you're 100% absolutely right, but that doesn't help me at the moment. Let's stay focused on the question here!)

I am beginning the slow, painful process of recovering the website from web crawler caches.

There are a few automated tools for recovering a website from internet web spider (Yahoo, Bing, Google, etc.) caches, like Warrick [1], but I had some bad results using this:

I've had much better luck by using a list of all blog posts, clicking through to the Google cache and saving each individual file as HTML. While there are a lot of blog posts, there aren't that many, and I figure I deserve some self-flagellation for not having a better backup strategy. Anyway, the important thing is that I've had good luck getting the blog post text this way, and I am definitely able to get the text of the web pages out of the Internet caches. Based on what I've done so far, I am confident I can recover all the lost blog post text and comments.

However, the images that go with each blog post are proving…more difficult.

Any general tips for recovering website pages from Internet caches, and in particular, places to recover archived images from website pages?

(And, again, please, no backup lectures. You're totally, completely, utterly right! But being right isn't solving my immediate problem… Unless you have a time machine…)

(3) This will be an nice test to see if images do live forever in the internet. - rick schott
(71) When somebody like Jeff Atwood himself can lose two entire websites in one fell swoop... Well. I'm going to review my own backup procedures, for one :P - community_owned
(197) @Phoshi: Jeff has some good articles on Coding Horror on backup. You should give them a quick read. - community_owned
(30) joshhunt wins one (1) internet. This offer may not be combined with other offers, exchanged, or substituted. No rainchecks. - Adam Davis
(6) @joshhunt: epic! - community_owned
(4) I have to ask: Peak (I believe the Trilogy host?) isn't going to be susceptible to anything like this, right? - community_owned
(22) The lengths some people will go to, to earn rep on SU... - Antony
(4) Crowed-sourced backup retrieval. Nice... - Luke
(2) In Google Reader I have 495 posts all the way back to March 5, 2007 As others have said, no images though - community_owned
(1) I had to recover my wife's company site once from the Google cache as well. The hosting plan did include "nightly back-ups", but failed to deliver on that point when required. Lesson learned. As for recovery of the pictures, you can get quite a few from an image restricted query over*sa_re_im_/* - community_owned
Comment markup seems to mangle the query url. I have it in an answer below. - community_owned
(1) Offtopic but important and related question: does Jeff have offsite backups for the stackoverflow/serverfault/superuser websites? (I should check, he probably posted about it there. Oh, wait...) - community_owned
@CesarB: If all goes wrong, they still have the data dump. - Macha
@Macha: yes, but the data dump does not have non-public data, the loss of which could be a bit harder to recover from (there probably is a post at listing the database tables which are not in the public data dump. Oh, wait...). Not to mention the site's source code (though this last one is probably alredy "offsite" at least at the developer's machines, since AFAIK it is in a compiled language. I should look for a post at either or which tells which language it was writen in...). - community_owned
(3) Are you going to use this as a good excuse to lose the post on NP Complete? Sorry, just had to... - community_owned
(12) Please don't refer to what you did as "backups" - if those files are on the same server, they're in no way "backups." - community_owned
Wait, I have Time Machine on my Mac! Does that count?? :-) (Ironically, in this case, having "Time Machine(TM)" actively backing up to an external drive WOULD have saved you!) - community_owned
(1) Time machine you say? How about a way-back machine? - community_owned
(1) Jeff, since comments are disabled on CH now, I'll comment here. I got news of your lost sites on SO, and I remember looking forward with great interest in how you responded to all of it. It was nice to see that your response was mature and humble. Thank you. - John at CashCommons
(2) This is why I'm a stickler for the old-fashioned "write it on my computer, then FTP it to the web server". If the server goes down I have all my pages on my computer and vice-versa. - DisgruntledGoat
(2) I'm missing how you actually solved this or any kind of follow up. Extra points for pointing what you used to begin to do at least 1 offsite backup. - Cawas
I created a service just because I experienced losing my site... It is in it's very early alpha/beta stage, so don't expect too much of it :) Also it's use is for retrieving the html, for now it doesn't retrieve the images automatically. - Dofs
[+153] [2009-12-11 21:08:54] John Siracusa

Here's my wild stab in the dark: configure your web server to return 304 for every image request, then crowd-source the recovery by posting a list of URLs somewhere and asking on the podcast for all your readers to load each URL and harvest any images that load from their local caches. (This can only work after you restore the HTML pages themselves, complete with the <img ...> tags, which your question seems to imply that you will be able to do.)

This is basically a fancy way of saying, "get it from your readers' web browser caches." You have many readers and podcast listeners, so you can effectively mobilize a large number of people who are likely to have viewed your web site recently. But manually finding and extracting images from various web browsers' caches is difficult, and the entire approach works best if it's easy enough that many people will try it and be successful. Thus the 304 approach. All it requires of readers is that they click on a series of links and drag off any images that do load in their web browser (or right-click and save-as, etc.) and then email them to you or upload them to a central location you set up, or whatever. The main drawback of this approach is that web browser caches don't go back that far in time. But it only takes one reader who happened to load a post from 2006 in the past few days to rescue even a very old image. With a big enough audience, anything is possible.

(38) +1 for the most creative approach. Could actually work since CH has some many readers. - delux247
(13) implemented here?… - Jeff Atwood
(2) I think you could crawl your static files for the image tags and copy all of those into one giant page of images, instead of having everybody click each link. The implementation looks very impressive, hope it works out for you. - community_owned
OMG! Very nice analysis. - Soner Gönül
[+50] [2009-12-11 21:40:05] delux247

I was able to recover about a year's worth of blog posts from my bloglines cache. It looks like the images are in there too. I saved it as a complete website, which pulls all the files. You can download it here:

Good Luck!

(1) It's only 1MB? I guess that files doesn't contain all images. - splattne
That's not a bad start! - Marc Gravell
(13) cool -- it has some of the images, I'll add these to the folder, thank you! - Jeff Atwood
[+48] [2009-12-11 21:00:11] retracile

Some of us follow you with an RSS reader and don't clear caches. I have blog posts that appear to go back to 2006. No images, from what I can see, but might be better than what you're doing now.

+1 definitely. Google Reader doesn't, but I bet a desktop-based one would. - Nicolas Webb
(2) You could also ask people to check their browser caches. Those who view Coding Horror retro-style might have some of the images cached. - Alex Rozanski
I've got blog posts back to 2005 in GReader, but unfortunately, they don't have images, and they won't let me just export those as a series of pages... I could email them to you though, Jeff... - gms8994
Yeah, there was an implied "I'll send you what I have if you ask for it." in my answer as well. - retracile
At least the RSS will make it easier to import. - ScottKoon
Don't forget that typing into the Google also coughs up every blog post he ever made. - George Stocker
(2) Too many RSS readers assume images will never die. I know mine does :( - community_owned
(54) All CH posts for the past ~4 years. Should be easy enough to scrape from. - gms8994
[+40] [2009-12-11 21:20:25] Portman

(1) Extract a list of the filenames of all missing images from the HTML backups. You'll be left with something like:

  • stay-puft-marshmallow-man.jpg
  • internet-properties-dialog.png
  • yahoo-homepage-small.png
  • password-show-animated.gif
  • tivo2.jpg
  • michael-abrash-graphics-program

(2) Do a Google Image Search for those filenames. It seems like MANY of them have been, um, "mirrored" by other bloggers and are ripe for the taking because they have the same filename.

(3) You could do this in an automated fashion if it proves successful for, say, 10+ images.

[+36] [2009-12-11 20:58:19] George Stocker

By going to Google Image search [1] and typing [2] you can at least find the minified versions of all of your images. No, it doesn't necessarily help, but it gives you a starting point for retrieving those thousands of images.

Codinghorror images

It looks like Google stores a larger thumbnail in some cases:

Google vs. Bing

Google is on the left, Bing on the right.


(2) yeah, worst case, we'll have to scale up the thumbnails from Google. I hear Bing stores larger thumbnails, though? - Jeff Atwood
I don't know; I'm not a bing sort of guy. I don't even know if they do Image search like Google does. I'll find out and update said post. - George Stocker
(17) I don't know if this is you. But Imageshack seems to have many of your blog images. - Nick Berardi
They seem to have what looks like 456 images that are full size. This might be the best bet for recovering everything. Maybe they can even provide you a dump. - Nick Berardi
(21) Use the Google thumbnails as a start, then use to see if anyone is hosting a copy. - sep332
[+30] [2009-12-11 21:05:24] Nick Berardi

Sorry to hear about the blogs. Not going to lecture. But I did find what appears to be your images on Imageshack. Are they really yours or has somebody been keeping a copy of them around.

They seem to have what looks like 456 images that are full size. This might be the best bet for recovering everything. Maybe they can even provide you a dump.

[+24] [2009-12-12 07:54:45] community_owned

Jeff, I have written something for you here [1]

In short what I propose you do is:

  1. Configure the web server to return 304 for every image request. 304 means that the file is not modified and this means that the browser will fetch the file from its cache if it is present there. (credit: this SuperUser answer [2])

  2. In every page in the website, add a small script to capture the image data and send it to the server.

  3. Save the image data in the server.

  4. Voila!

You can get the scripts from the given link.

(Sorry if this answer is not appropriate, but why am I getting down-voted?)


Super User answer isn't linked. - Nathaniel
@Nathaniel: FIXED - alexanderpas
[+20] [2009-12-11 21:58:15] gojomo

+1 on the dd recommendation if (1) the raw disk is available somewhere; and (2) the images were simple files. Then you can use a forensic 'data-carving' tool to (for example) pull out all credible ranges that appear to be JPGs/PNGs/GIFs. I've recovered 95%+ of the photos on an iPhone that was wiped this way.

The open source tools 'foremost' and its successor 'scalpel' can be used for this:

(2) wow -- great tips! - Jeff Atwood
(2) Photorec may also be of use once you get dd images. - community_owned
foremost is available via yum on Fedora - retracile
[+18] [2009-12-12 09:44:44] community_owned

Try this query on the Wayback Machine [1]:*sa_re_im_/*

This will get you all the images from archived by This returns 3878 images, some of which are duplicates. It will not be complete, but a good start none the less.

For the remaining images, you can use the thumbnails from a search engine cache, and then do a reverse look-up using these at . You give it the thumbnail image, and it will give you a preview and a pointer to closely matching images found on the web.


returns a 404 now? - rogerdpack
[+17] [2009-12-11 20:56:53] community_owned

You could always try, as well. Use the wayback machine. I've used this to recover images from my websites.

(3) Doesn't seem to have much of a cache for CodingHorror, at least. I do see images for blog.stackoverflow though. - community_owned
i rebuilt a website using internet wayback machine once but i tried a few times since and it really doesn't archive very many sites... - djangofan
Looks like it goes back to 2004 here*/ - Chris Nava
Thank goodness it didn’t have a robots.txt file huh? :) - Synetech
[+12] [2009-12-11 21:11:54] community_owned

So, absolute worst case, you can't recover a thing. Damn.

Try grabbing the minified google ones, and putting them through TinEye [1], the reverse-image search engine. Hopefully it should grab any duplicates or rehosts people have made.


[+10] [2009-12-11 21:16:02] community_owned

It is a long shot, but you could consider:

  • Posting the exact list of picture you are missing
  • crowd-sourcing the retrieval process through all your readers's internet cache.

For instance, see the Nirsoft Mozilla Cache Viewer [1]:

alt text

It can quickly dig up any "" picture one might still have through a simple command line:

MozillaCacheView.exe -folder "C:\Documents and Settings\Administrator\Local Settings\Application Data\Mozilla\Firefox\Profiles\acf2c3u2.default\Cache" 
/copycache "" "image" /CopyFilesFolder "c:\temp\blogso" /UseWebSiteDirStructure 0

Note: they have the same cache explorer for Chrome [2].

alt text

(I must have 15 days worth of pictures in it)

And Internet Explorer [3], or Opera [4].

Then update the public list to reflect what the readers report finding in their cache.


[+8] [2009-12-11 20:58:04] community_owned

In the past I've used to pull up cached images. It's kind of hit or miss but it has worked for me.
Also, when trying to recover stock photos that I've used on an old site, is great when I only have the thumbnails and I need the full size images.

I hope this helps you. Good Luck.

I looked through a few minutes ago for images and the few posts I clicked didn't have any showing. - George Stocker releases the data months after they first indexed them. - Christian
[+8] [2009-12-11 20:58:33] community_owned

This is probably not the easiest or most full-proof solution, but services like Evernote typically save both the text and images when they are stored inside the application - maybe some helpful readers who saved your articles could save the images and send them back to you?

[+7] [2009-12-11 20:59:20] community_owned

I've had great experiences with [1]. Even if you aren't able to extract all of your blog posts from the site, they keep periodical snapshots:

alt text

This way you can check out each page and see the blog posts you made. With the names of all the posts you can easily find them in Google's cache if doesn't have it. Archive tries to keep images, Google cache will have images, and I haven't emptied my cache recently so I can help you with the more recent blog posts :)


I tried to get some data from the website of a company I used to work for a while ago. It was good for the text, less so for the images. But YMMV - ChrisF
I believe Google web cache does not store images. - Nathaniel
[+6] [2009-12-11 21:02:25] community_owned

Have you tried your own local browser cache? Pretty good chance some of the more recent stuff is still there.

(Or you could compile a list of all missing images and everyone could check their cache to see if we can fill in the blanks)

[+6] [2009-12-11 22:05:59] Matt Sherman

A suggestion for the future: I use Windows Live Writer [1] for blogging and it saves local copies of posts on my machine, in addition to publishing them out to the blog.


Plus, using Windows Live Writer is just good common sense. - community_owned
[+5] [2009-12-11 22:17:14] community_owned

The web archive caches the images. It's under heavy load right now, you should be ok until 2008 or so.

[+5] [2009-12-11 21:46:35] Sinan Ünür

About five years ago, an early incarnation of an external hard drive on which I was storing all my digital photos failed badly. I made an image of the hard drive using dd and wrote a rudimentary tool to recover anything that looked like a JPEG image. Got most of my photos out of that.

So, the question is, can you get a copy of the virtual machine disk image which held the images?

[+5] [2009-12-11 21:08:30] community_owned

I suggest the combination of and a request anonymizer like [Tor][2]. I suggest using anonymizer because that way each of your requests will have a random IP and location and that way you can avoid getting banned by a (like Google did) for unusually high number of requests.

Good Luck, there are a lot of gems in that blog.

Given that Jeff wants to make a donation to, so abusing the anonymizer might not be absolutely innacceptable. But I still want give you a kick for that. :-| - community_owned
[+4] [2009-12-11 21:02:49] community_owned sometimes hides images. Get each URL manually (or write a short script) and query them for it like this:

string.Format("GET /*/{0}", nextUri)

Of course that's going to be quite a pain to search through.

I might have some in my browser cache. If I do I'll host them somewhere.

[+4] [2009-12-11 20:58:04] community_owned

The wayback machine will have some. Google cache and similar caches will have some.

One of the most effective things you'll be able to do is to email the original posters, asking for help.

I do actually have some infrastructural recommendations, for after this is all cleaned up. The fundamental problem isn't actually backups, it's lack of site replication and lack of auditing. If you email me at the private email field's contents, later, when you're sort of back on your feet, I'd love to discuss the matter with you.

[+4] [2009-12-11 21:30:45] splattne

Jeff, you told in one of your SO podcasts, that you tried to use Flickr as CDN for your images. Maybe you still have that account and have some images there?

Some of the images could be found searching on Google Images [1] and click on "Find similar images", maybe there are copies on other sites.

If you need some crowd-source power, let me/us know!


[+3] [2009-12-11 22:49:11] community_owned

You could use TinEye [1] to find duplicates of your images [2] by searching the thumbnails with google cache [3]. This will help only with images you've taken from others site, though.


(1) No, it would help with images others have taken from CH. - DisgruntledGoat
@DisgruntledGoat: I didn't even thought of that at first :D - community_owned
[+3] [2009-12-11 22:02:12] gojomo

If you're hoping to try to scrape users' caches, you may want to set the server to respond 304 Not Modified to all conditional-GET ('If-Modified-Since' or 'If-None-Match') requests, which browsers use to revalidate their cached material.

If your initial caching headers on static content like images were pretty liberal -- allowing things to be cached for days or months -- you could keep getting revalidate requests for a while. Set a cookie on those requests, and appeal to those users to run a script against their cache to extract the images they still have.

Beware, though: the moment you start putting up any textual content with inline resources that aren't yet present, you could be wiping out those cached versions as revalidators hit 404s.

[+3] [2009-12-12 00:14:28] community_owned

I've managed to recover these files from my Safari cache on Snow Leopard:


If anyone else wants to try, I've written a Python script to extract them to ~/codinghorror/filename, which I've put online here [1].

I hope this helps.


[+3] [2009-12-12 01:05:44] lo_fye

At the risk of pointing out the obvious, try mining your own computer's backups for the images. I know my backup strategy is haphazard enough that I have multiple copies of a lot of files hanging around on external drives, burned discs, and in zip/tar files. Good luck!

[+2] [2009-12-11 21:23:48] wilhil

Very sorry to hear this and I am very annoyed for you, and the timing - I wanted an offline copy of a few of your posts and did HTTrack on your entire site but had to go out (this was a couple of weeks ago) and I stopped it.

If the host is half descent - and by the fact I am guessing you are a good customer... I would ask them to either send you the hard drives (as I am guessing they should be using RAID) or do some recovery themselves.

Whilst this may not be a fast process, I did this with one host for a client and was able to recover entire databases intact (... basically, the host tried an upgrade for the control panel they were using and messed it up.. but nothing was overwritten).

Whatever happens - Good luck from all your fans on the SO sites!

[+2] [2009-12-11 21:01:32] community_owned

Did you get a chance to see if, your hosting provider has any backup at all (some older versions)?

it does not look good.. their backup program was unable to backup the virtual machine hard drive files, so there are no backups. - Jeff Atwood
[+2] [2009-12-11 21:14:02] Wedge

How much is this data worth to you? If it's worth a significant sum (thousands of dollars) then consider asking your hosting provider for the hard drive used to store the data for your website (in the case of data loss due to hardware failure). You can then take the drive to ontrack or some other data recovery service to see what you can get off the drive. This might be tricky to negotiate due to the possibility of other people's unrecovered data on the drive as well, but if you really care about it you can probably work it out.

the server was a VM as far as I know. - splattne
(1) @splattne even so, there's a non-zero chance a lot of the data could be recovered. - community_owned
Would have to be a highly specialised service. - community_owned
[+1] [2009-12-11 20:59:53] community_owned

Have you tried doing a Google Image search, with the syntax

[+1] [2009-12-11 21:34:04] community_owned

I can read old posts on my Google Reader account. Maybe that helps: relating to your horror.

[+1] [2009-12-12 12:46:31] Sam Hasler

Beta Bloglines [1] can access more of the bloglines archive than the previous poster recovered using the classic interface. I'm currently saving whatever cached images I can get from them back to the following dates/posts:

Feb 15, 2007 for codinghorror - Oldest post: Origami Software and Crease Patterns
May 14, 2008 for blog.stackoverflow - Oldest post: Podcast #5

Once it's finished saving and I've uploaded it somewhere I'll update this post.

Update: This is taking longer than I anticipated. I'm saving from Chrome and Firefox and merging the images so I get both sets of cached images.

Update: Looks like there aren't any differences between the two sets, so I'm just seeing what you've already restored, if I had anything in my cache at all.


[+1] [2009-12-12 19:49:24] community_owned

I was going to suggest Warrick because it was written by one of my CS professors [1]. I'm sorry to hear that you had a bad experience with it. Maybe you can at least send him a note with some bug reports.


[+1] [2009-12-12 23:13:02] community_owned

Your images, ask SUN microsystems to give them back to you, they have made " an entire internet backup [1]" ... in a shipping container

"The Internet Archive offers long-term digital preservation to the ephemeral Internet," said Brewster Kahle, founder, the Internet Archive organization. "As more of the world's most valuable information moves online and data grows exponentially, the Internet Archive will serve as a living history to ensure future generations can access and continue to preserve these important documents over time."

Founded in 1996 by Brewster Kahle, the Internet Archive is a non-profit organization that has built a library of Internet sites and other cultural artifacts in digital form that include moving images, live audio, audio and text formats. The Archive offers free access to researchers, historians, scholars, and the general public; and also features "The Wayback Machine" -- a digital time capsule that allows users to see archived versions of Web pages across time. At the end of 2008, the Internet Archive housed over three petabtyes of information, which is roughly equivalent to about 150 times the information contained in the Library of Congress. Going forward, the Archive is expected to grow at approximately 100 terabytes a month.

alt text

more here [2] and here [3]


[0] [2009-12-12 08:39:32] Matthew Glidden

Most solutions use a combination of blog reader assistance,, and Google caching. Consider turning this data crisis into a blog recovery tool specification. Several features listed in the question and answers look ready to automate, given knowledge an owner would have of their root site.

  1. Restore pages from, Google cache, or local cache using web spider that avoids bannable techniques
  2. Check local cache, Google image search, and imageshack for matching file names
  3. After initial recovery, make list of site's missing images and other URLs (e.g., return 304 code for images)
  4. Add upload or contribution form for readers who have cached versions
  5. Site owner previews and validates contributions
  6. Resubmit recovered pages to search engines, if desired

Owners that derive a lot of value from quick recovery might offer a bounty for missing files or other outside assistance.

[0] [2009-12-11 23:06:43] community_owned

Just automate grabbing the individual Google page cache files.

Here's a Ruby script I used in the past.

My script doesn't appear to have any sleeps. I didn't get IP banned for some reason, but I'd recommend adding one.

google page caches don't save images, just text - community_owned
[0] [2009-12-11 23:58:35] community_owned

You could try to get the broken HDD from hosting company and give it to a hdd recovery service, I think you could find one. At least the backup images would probably be restored form there. Also this disk could be part of some mirror/RAID system and there is somewhere a mirror image?

[0] [2009-12-11 21:09:12] community_owned

Maybe you could crowd-source it asking us to look in our browser caches. I generally read Coding Horror via Google Reader, so my Firefox cache doesn't seem to have anything from in it.

Others can look in their own Firefox cache by browsing to: about:cache?device=disk .

[0] [2009-12-11 21:11:35] community_owned

Just another shot at retrieving the content.

I was subscribed using feed burner. So might have some archives in my mail! You can ask others, who might be able to forward you those posts.