Sometimes it’s helpful to have the same Web content at more than one location. Doing this means that it continues to be available if one site goes down for any reason. This can be useful with unpopular material that’s prone to denial-of-service attacks, or with material of lasting importance that shouldn’t go away if its primary maintainer drops the ball. For example, I’m the main force behind the Filk Book Index, a list of the contents amateur songbooks produced by science fiction fans. It’s mirrored, with weekly updates, at freemars.org. If massfilc.org goes away, the index doesn’t, and people can download and save a version which is up to date or nearly so.

A widely used tool for mirroring a site is GNU wget. This article on FOSSwire discusses how to use wget to create a site mirror. Be aware, though, that wget doesn’t do anything but grab files. If your site has dynamic content or if it depends on files that aren’t downloadable, wget won’t help you.

Another tool is HTTrack. Unlike wget, it has a GUI. It’s currently available only for Windows, Linux, and Unix. Like wget, it can only grab publicly available files.

Search engines don’t always deal well with mirrors. If they detect a mirror site, they’ll often list just one version, and they may guess wrong about which one is the primary. This actually happened with the Filk Book Index; for a while, Google and other search sites were listing the index on freemars.org but not the one on massfilc.org. The solution for this is for the mirror site to create a file called robots.txt at the top level of the site directory, with the following content:

User-agent: *
Disallow: /

Be careful not to put that on your primary site, though, or search engines won’t list you at all! wget works best if your primary site doesn’t have a robots.txt at all, since by default it respects its restrictions.

(In case you aren’t familiar with robots.txt: it’s a file syntax which search engines use by convention to determine which files on a site they shouldn’t index. It doesn’t actually prevent any access, so it’s not a security measure, but it’s respected by legitimate web crawlers. Learn more about it here.

What about the increasingly common sites with dynamic content? You can mirror them, but it’s harder. You’ll have to send your collaborator a copy of all your files and make sure that any needed software is on the mirror site. If it depends on a database that’s regularly updated, you may be able to give the mirror site access to it. Of course, if the database goes down all the mirrors go down with it.

A mirror actually doesn’t have to be publicly visible, as long as it can be made visible on short notice. You could, for example, put the mirror in a directory which isn’t available to browsers, and run a periodic script that makes the mirror available if the primary stops being available for an extended period of time. Strictly private mirrors can be useful too, if making them available quickly isn’t an issue; they can prevent content from being lost.