Saving Internet information from the Memory Hole

Comments Off

Before releasing the name of Sergeant Robert Bales, who’s accused of a murderous rampage in Afghanistan, the US military tried to wipe information about him from the Internet. Since this is a tech blog, I won’t be talking here about why they did it or whether it was a good idea. The questions I’m addressing here are: (1) If you want to wipe some information from the Internet, can you do it? (2) If someone tries to wipe information which you suddenly realize you want, how much can you recover?

We’re talking here about the government’s deleting only information which it directly controls. Parts of Bales’ wife’s blog disappeared, but probably this happened with her cooperation. If a government can control all websites in the country, including search engines, and restrict access to those outside, it’s a very different game. Think of China and the Tienanmen Square events of 1989.

If the government hadn’t been so rushed for time, it might have done a much more effective job. Keeping Bales’ name out of the media for a week was probably pushing their limits. If they could have had another week, many search engine caches could have lapsed, making it harder but still far from impossible to find old pages.

Let’s suppose information on someone or something has been sent down the Internet Memory Hole, and you’re an investigative reporter who wants it back. How would you do it? If you do a search on Google, many of the hits will have “cached” links. This lets you look at Google’s latest cached version of the page, which may be the only available version or may have information that was recently taken out. That technique is good for information that’s not more than a few days old.
Screen shot from Google showing "cached" link
For older information, you can look at the Internet Archive’s Wayback Machine. This site has a vast collection of old Web pages, but it’s still necessarily spotty. Usually it has only pages more than six months old, and anyone can ask not to have their pages archived.

If you looked at a page a few days ago and it’s no longer there, you may have a copy cached by your browser. Going into offline mode may improve your chances of seeing the cached page rather than a 404.

There may be copies of the vanished material elsewhere on the Internet. Some people fish out cached pages that have disappeared and post them, especially if they think the deletion was an attempt to hide something. I did this myself a few years ago; in 2004 John Kerry proposed mandatory service for high school students, making it illegal for them to graduate if they didn’t satisfy a federal service requirement. This stirred up a lot of anger and the page proposing it disappeared from his website. I grabbed a cached copy from Google and posted it. That copy greatly increased my site’s hit counts for a while. If you can remember key phrases from the delete page, using them in a search string may turn up copies.

There’s a Firefox add-on called “Resurrect Pages.” It offers to search several caches and mirrors if you get a 404 error. Another one is “ErrorZilla Plus.” I don’t have any experience with them.

Finding vanished information on the Internet is an art, and doubtless there are experts who know a lot more tricks than I’ve mentioned here.

A case study in website risk

Comments Off

In an earlier post, “Whose site is it anyway?” I discussed the risks to organizations of having their website’s eggs in one basket. Here’s a look at a situation that happened recently, omitting the names.

Organization X had a wiki for internal operations, hosted by a commercial company for a small annual fee. It had only one designated administrator, who lost interest in the organization. Well before the deadline, they were aware of the issue, and there was discussion of how to back it up. There’s an export function available, but only an administrator can use it. Short of that, there are solutions such as HTTrack, which can download the pages as HTML but not as editable wikitext.

Discussion happened. Not much else did. A few months later, logged-in users started seeing a warning that the account would expire in a matter of days. X contacted the hosting company, asking to transfer the ownership of the account, pointing out that the name of the wiki was their legally registered name. The company said (quite properly) that they couldn’t transfer it without appropriate legal procedures. It was registered by the administrator, not the organization.

Things got a bit frantic from there. The existing organizer was getting communications, but there were conflicting messages on just what he was being asked to do. Was he supposed to reassign the account? Was he supposed to start a legal transfer of ownership, meanwhile letting the account lapse? Was he supposed to renew it and get reimbursed? Was someone at least making a backup while the clock was ticking? If he was supposed to add administrators, who would they be, and could the same scenario happen again if they left?

Fortunately, this story has a happy ending. The administrator decided to just go ahead and renew the account, leaving concerns about reimbursement for later. The new admin account was an email alias on Organization X’s domain, with multiple people assigned to it. For the time being at least, they’re out of the hole. I hope they start doing backups, of course.

Mirror, mirror on the web

Comments Off

Sometimes it’s helpful to have the same Web content at more than one location. Doing this means that it continues to be available if one site goes down for any reason. This can be useful with unpopular material that’s prone to denial-of-service attacks, or with material of lasting importance that shouldn’t go away if its primary maintainer drops the ball. For example, I’m the main force behind the Filk Book Index, a list of the contents amateur songbooks produced by science fiction fans. It’s mirrored, with weekly updates, at If goes away, the index doesn’t, and people can download and save a version which is up to date or nearly so.

A widely used tool for mirroring a site is GNU wget. This article on FOSSwire discusses how to use wget to create a site mirror. Be aware, though, that wget doesn’t do anything but grab files. If your site has dynamic content or if it depends on files that aren’t downloadable, wget won’t help you.

Another tool is HTTrack. Unlike wget, it has a GUI. It’s currently available only for Windows, Linux, and Unix. Like wget, it can only grab publicly available files.

Search engines don’t always deal well with mirrors. If they detect a mirror site, they’ll often list just one version, and they may guess wrong about which one is the primary. This actually happened with the Filk Book Index; for a while, Google and other search sites were listing the index on but not the one on The solution for this is for the mirror site to create a file called robots.txt at the top level of the site directory, with the following content:

User-agent: *
Disallow: /

Be careful not to put that on your primary site, though, or search engines won’t list you at all! wget works best if your primary site doesn’t have a robots.txt at all, since by default it respects its restrictions.

(In case you aren’t familiar with robots.txt: it’s a file syntax which search engines use by convention to determine which files on a site they shouldn’t index. It doesn’t actually prevent any access, so it’s not a security measure, but it’s respected by legitimate web crawlers. Learn more about it here.

What about the increasingly common sites with dynamic content? You can mirror them, but it’s harder. You’ll have to send your collaborator a copy of all your files and make sure that any needed software is on the mirror site. If it depends on a database that’s regularly updated, you may be able to give the mirror site access to it. Of course, if the database goes down all the mirrors go down with it.

A mirror actually doesn’t have to be publicly visible, as long as it can be made visible on short notice. You could, for example, put the mirror in a directory which isn’t available to browsers, and run a periodic script that makes the mirror available if the primary stops being available for an extended period of time. Strictly private mirrors can be useful too, if making them available quickly isn’t an issue; they can prevent content from being lost.

Your digital legacy

Comments Off

On Sunday, February 19, I’ll be on a panel at Boskone 49 on “Digital Estate — Virtual Property OR On the Internet, Nobody Knows You’re Dead.” The other panelists will be security guru Bruce Schneier and technology writer Daniel Dern. This post is, in part, research and practice for the panel.

“All flesh is like grass,” wrote the Apostle Peter, “but the word of the Lord remains forever.” It’s certainly true that words can outlast people. Will yours, if you’ve put them online? Do you always want them to? If you do nothing, Murphy’s Law may prevail. The flames you wrote on Usenet when you were young will survive, but the writing you value most may go down the digital drain.

It’s likely you have resources on many different sites. You might have a blog, a website, and a social networking account, and probably more than one of some of these. If it’s something you’re paying for, it could disappear when the payments stop. If it’s a free site, it might be terminated for inactivity, In either case, there might be material — restricted posts, private data, infrastructure — that no one can capture by looking at your site.

Some social networking and blogging sites allow family members or friends to “memorialize” an account. I’ve successfully requested this on LiveJournal for two friends who died late last year. How this is handled varies from one site to another. LiveJournal retains all posts. Facebook’s policy is to delete all status updates. Facebook’s approach maximizes privacy but could also wipe out information about someone’s last days that isn’t available anywhere else. Yahoo goes even further, giving your heirs no access to your account except the right to request its deletion. (It’s a little morbid to say “you,” but I have to use some pronoun, and it’s your own legacy you have to be most concerned about.)

In the stress and confusion following your death, your information might be lost. Perhaps no one will know what accounts you have or what their passwords are. (Normally the latter is a good thing, but not in this case.) The best plan may be to have a copy of everything that’s valuable on your own computer and to make sure someone in your family knows how to get at it. This is easier than scrambling around multiple websites with multiple accounts. If you do keep information online which you want to survive you, keep a list of accounts and passwords in a secure place, and make sure someone knows it exists.

It may help to put provisions in your will directing the disposal of your important online assets. I’ve seen a claim that Facebook will download the complete contents of a deceased user’s account to an heir, if you’ve specifically requested it. Such provisions may let your heirs override sites’ default policies. Not being a lawyer, I won’t offer any suggestions on how to phrase such directions, and I’m guessing a lot of lawyers don’t know either.

If you’ve written and uploaded stories, poems, or songs, then you might want to take steps to make sure they can legally stay online. A provision in your will to assign your copyrights could help, and — painful as it might be to realize this — your family members aren’t necessarily the best people to assign them to, especially if your works have only literary and not monetary value. Maybe your kids don’t really care for the beautiful fanfic you wrote. If you assign the copyright to someone who does care, it’s less likely to vanish into a legal black hole.

It’s a difficult area to think about and a difficult one to make the right decisions in, but some planning can make a difference.


Whose site is it anyway?

1 Comment

It’s a common scenario with small organizations. Someone in the group sets up a website, then later loses interest in the organization and wanders off — and no one else has access to maintain the site. If it’s a straightforward HTML website, you can just download the content and move it to a new home, but what if it’s a blog or wiki? The best you can do may be to get a snapshot of the site, minus any data bases that control its content. Better than nothing, to be sure, but it could be a disappointment.

If you’re stuck in such a situation, there are tools such as HTTrack for downloading whole websites. That will at least give you the raw material for building a new site. But loss prevention is always better than piecemeal recovery. Many services have backup tools. For example, on WordPress you can export your site as XML. It may take a little work to move that to anything but another WordPress site, but you have all the material. The site maintainer for your organization should do regular backups or exports if that’s an available option. Don’t say you’re afraid they’ll walk out on you, of course; just say “in case you get sick” or whatever the best excuse is for your cultural group.

If the service allows more than one administrator, take advantage of that opportunity. Oh, the two admins shouldn’t be a married couple. Better yet, have the site administered by a group-owned account, with the password securely stored in more than one place. Accounts are usually tied to email addresses, so create an account such as and register the site with that address.

Sometimes people donate their own server space. This can be an attractive offer, since it costs nothing and makes you completely independent of any business that might vanish, change its terms of service, or arbitrarily cancel your account. But people can vanish too. (I’m speaking as someone who’s had two friends die in the past two months.) Weigh the risks, and keep a backup that will stay available.

Suggestions: Don’t keep all your eggs in one basket. Whatever approach your organization takes, make sure it isn’t critically dependent on one person.

You HAD mail

Comments Off

Email is messy. It adheres to standards when it’s being sent down the wire, but there’s no consistency about how it’s stored. This makes it easy to lose. Politicians are especially hard-hit by this problem, and often inadvertently lose all their email, especially when they leave office.

There are several problems with preserving email. The first is that it may not even be on a computer that you control. There are three major ways to get email: (1) POP3, a protocol which delivers all mail to your computer; (2) IMAP, where the mail is kept on the server and delivered to you as needed; and (3) webmail, where messages aren’t delivered to you as such, but are available for reading on your browser. Only in the first case can you directly export and back up all your mail. On the other hand, cases (2) and (3) store the mail on a server, where hopefully it’s backed up professionally.

Let’s look at these one at a time. POP3 (short for Post Office Protocol version 3), which is defined by IETF RFC 1939, defines a particular way of getting mail to your computer, but it says nothing about its format. RFC 822 defines the format of the message as a series of headers (e.g., “To:” and “From:”) followed by a body, but it says nothing about the format of the body except that it’s ASCII text. Yes, that’s right; all the JPEG images, applications, and Unicode Japanese text that you get in your email have to be sent as ASCII. How is this done? With yet another standard, Multipurpose Internet Mail Extensions or MIME. It takes six RFC standards to specify this.

Then with POP3, the MIME-encoded mail with RFC 822 headers arrives on your computer for storage. So there must be yet another standard for this, right? Well … not really. Each email client has its own way of storing your messages. This can be a problem. If you switch from one client application to another, you may not be able to read your old mail any more. Fortunately, many mail clients use MBOX natively (e.g., Thunderbird) or are able to export mail to MBOX format. There’s an IETF standard for this, too; RFC 4155. Unfortunately, it’s more of an attempt to codify the existing chaos than to lay down a standard. It defines a “default” (their quotation marks) MBOX format, and today most client applications support it. It’s possible, though, that you’ll encounter MBOX files which some applications can’t import properly.

With IMAP (Internet Message Access Protocol), the situation is about the same except that your mail may not all be on your computer. You may have to take the extra step of downloading all your messages before you can export them, or exporting may just be very slow as all the messages have to be received. If you don’t keep up to date on this and you lose your account (for example, because a spammer hijacks it and your provider has to shut it down), you lose any unsaved mail.

With webmail, you’ve got the same risks and some others. Read this story of how author James Fallows’ wife experienced having her GMail account hijacked by a spammer, and nearly lost all the mail in the account. Exporting from a webmail account can be difficult unless your service provides a way to do it. Some do; for instance, GMail provides an option for getting all your mail by IMAP or POP.

If your mail is on your computer, you still might not know exactly where it is. Typically it’s buried in some directory you didn’t create. It’s different for each application, and you should find out for the application that you’re using, so you can make sure it’s being backed up.

Outlook, as you might expect, has its own issues. It has an export function but exports into a proprietary format. Fortunately, there are tools for converting this to MBOX format. One clever trick is to use Thunderbird, enhanced by Import-Export Tools, to “export” mail to MBOX format by importing it, even if you don’t otherwise use Thunderbird.

Another reason preserving email is messy is that it’s often 90% or more junk: notes of no lasting importance, spam, and things that could get you into a courtroom or out of your job. A short message might have a huge attachment that you don’t notice and don’t need. Saving all your email can take up an inconvenient amount of space after a while. If you have a good filtering system that organizes your messages into mailboxes, you can decide which boxes are worth saving and which should be thrown away.

Suggestions: With your own mail, make sure you have a copy of your important messages on your own computer and that they’re being backed up. If you’re responsible for site backups, make sure that people’s mail is getting backed up. If policy dictates that people’s mail should be saved then they leave, make sure that when their accounts are closed, their mail is saved in an application-neutral format such as MBOX, and that unauthorized people can’t get at it till it’s been checked for confidential information.


Get every new post delivered to your Inbox.

Join 28 other followers