Getting preservation out of the monastery

Comments Off on Getting preservation out of the monastery

We’re seeing the beginnings of information on digital preservation for people outside the narrow world of libraries and archives, but so far a lot of it isn’t quite to the point. Rather than addressing the non-specialist, it addresses the idea of addressing the non-specialist. So just for this one post, I have to do the same myself to explain why we need to move further.

The Library of Congress is now offering a personal digital archiving kit. This is a good thing, but it’s not directed at the users and sysadmins; it’s described as “Guidance and resources for information professionals on how to organize and host your own Personal Digital Archiving Day.” Its YouTube videos on digital preservation include a number addressed to the generalist, but they often convey a feeling of embarrassment or talking down. The intent is good; the execution can be difficult.

Other sites likewise make an effort but don’t quite engage the target audience. There’s a site called Personal Archiving, but it’s not exactly about personal archiving; it’s about personal archiving conferences.

There are preservation-related tools which could be useful for a broad range of users but will scare most people away in their present form. DROID is a very useful tool for figuring out what kind of files you’ve got, but even serious geeks will be stumped by a listing that says they have “Tagged Image File Format” and “Portable Document Format” files, at least till they think to look at the first letter of each word and realize they’re just TIFF and PDF respectively.

The LoC’s Personal Archiving page makes a worthy effort, stressing basic practices such as identification, selection, and organization. But beyond this, we need to make serious technical information available in a form that doesn’t require initiation as a specialist. I’m talking about a level of information comparable to what people can easily find to create websites, network computers, and manage databases, information which has detail but isn’t couched in archivists’ jargon.

It isn’t easy to get the word out and get the information out at the same time. Making people aware of the issue of preservation means pounding on the basics, but getting real information out means diving into the technical nitty-gritty.

I’m talking about moving outside the safe boundaries of the National Archives and iPRES and creating courses in the computer science curriculum, books in the mainstream computer market. The title of this blog originally belonged to a book proposal which O’Reilly almost bit on, but then decided didn’t quite have the interest to justify it yet. The interest has to grow, and the technical material on preservation had better become widely available before the news media notice a coming crisis of disappearing cultural materials.

In the old days, preservation of writings was the jobs of the monasteries, and nobody else worried much about it. That can’t work today. Preservation needs to become part of computer literacy, This requires books, courses, websites, and software. There’s a lot of work to be done.

Mirror, mirror on the web

Comments Off on Mirror, mirror on the web

Sometimes it’s helpful to have the same Web content at more than one location. Doing this means that it continues to be available if one site goes down for any reason. This can be useful with unpopular material that’s prone to denial-of-service attacks, or with material of lasting importance that shouldn’t go away if its primary maintainer drops the ball. For example, I’m the main force behind the Filk Book Index, a list of the contents amateur songbooks produced by science fiction fans. It’s mirrored, with weekly updates, at If goes away, the index doesn’t, and people can download and save a version which is up to date or nearly so.

A widely used tool for mirroring a site is GNU wget. This article on FOSSwire discusses how to use wget to create a site mirror. Be aware, though, that wget doesn’t do anything but grab files. If your site has dynamic content or if it depends on files that aren’t downloadable, wget won’t help you.

Another tool is HTTrack. Unlike wget, it has a GUI. It’s currently available only for Windows, Linux, and Unix. Like wget, it can only grab publicly available files.

Search engines don’t always deal well with mirrors. If they detect a mirror site, they’ll often list just one version, and they may guess wrong about which one is the primary. This actually happened with the Filk Book Index; for a while, Google and other search sites were listing the index on but not the one on The solution for this is for the mirror site to create a file called robots.txt at the top level of the site directory, with the following content:

User-agent: *
Disallow: /

Be careful not to put that on your primary site, though, or search engines won’t list you at all! wget works best if your primary site doesn’t have a robots.txt at all, since by default it respects its restrictions.

(In case you aren’t familiar with robots.txt: it’s a file syntax which search engines use by convention to determine which files on a site they shouldn’t index. It doesn’t actually prevent any access, so it’s not a security measure, but it’s respected by legitimate web crawlers. Learn more about it here.

What about the increasingly common sites with dynamic content? You can mirror them, but it’s harder. You’ll have to send your collaborator a copy of all your files and make sure that any needed software is on the mirror site. If it depends on a database that’s regularly updated, you may be able to give the mirror site access to it. Of course, if the database goes down all the mirrors go down with it.

A mirror actually doesn’t have to be publicly visible, as long as it can be made visible on short notice. You could, for example, put the mirror in a directory which isn’t available to browsers, and run a periodic script that makes the mirror available if the primary stops being available for an extended period of time. Strictly private mirrors can be useful too, if making them available quickly isn’t an issue; they can prevent content from being lost.

Your digital legacy

Comments Off on Your digital legacy

On Sunday, February 19, I’ll be on a panel at Boskone 49 on “Digital Estate — Virtual Property OR On the Internet, Nobody Knows You’re Dead.” The other panelists will be security guru Bruce Schneier and technology writer Daniel Dern. This post is, in part, research and practice for the panel.

“All flesh is like grass,” wrote the Apostle Peter, “but the word of the Lord remains forever.” It’s certainly true that words can outlast people. Will yours, if you’ve put them online? Do you always want them to? If you do nothing, Murphy’s Law may prevail. The flames you wrote on Usenet when you were young will survive, but the writing you value most may go down the digital drain.

It’s likely you have resources on many different sites. You might have a blog, a website, and a social networking account, and probably more than one of some of these. If it’s something you’re paying for, it could disappear when the payments stop. If it’s a free site, it might be terminated for inactivity, In either case, there might be material — restricted posts, private data, infrastructure — that no one can capture by looking at your site.

Some social networking and blogging sites allow family members or friends to “memorialize” an account. I’ve successfully requested this on LiveJournal for two friends who died late last year. How this is handled varies from one site to another. LiveJournal retains all posts. Facebook’s policy is to delete all status updates. Facebook’s approach maximizes privacy but could also wipe out information about someone’s last days that isn’t available anywhere else. Yahoo goes even further, giving your heirs no access to your account except the right to request its deletion. (It’s a little morbid to say “you,” but I have to use some pronoun, and it’s your own legacy you have to be most concerned about.)

In the stress and confusion following your death, your information might be lost. Perhaps no one will know what accounts you have or what their passwords are. (Normally the latter is a good thing, but not in this case.) The best plan may be to have a copy of everything that’s valuable on your own computer and to make sure someone in your family knows how to get at it. This is easier than scrambling around multiple websites with multiple accounts. If you do keep information online which you want to survive you, keep a list of accounts and passwords in a secure place, and make sure someone knows it exists.

It may help to put provisions in your will directing the disposal of your important online assets. I’ve seen a claim that Facebook will download the complete contents of a deceased user’s account to an heir, if you’ve specifically requested it. Such provisions may let your heirs override sites’ default policies. Not being a lawyer, I won’t offer any suggestions on how to phrase such directions, and I’m guessing a lot of lawyers don’t know either.

If you’ve written and uploaded stories, poems, or songs, then you might want to take steps to make sure they can legally stay online. A provision in your will to assign your copyrights could help, and — painful as it might be to realize this — your family members aren’t necessarily the best people to assign them to, especially if your works have only literary and not monetary value. Maybe your kids don’t really care for the beautiful fanfic you wrote. If you assign the copyright to someone who does care, it’s less likely to vanish into a legal black hole.

It’s a difficult area to think about and a difficult one to make the right decisions in, but some planning can make a difference.


Video preservation

Comments Off on Video preservation

Planning so that your files will survive for a long time is tricky in general, and video is one of its trickiest areas. When even the designers of HTML5 can’t agree on a video format, what are the odds that your family’s or club’s movies will still be viewable in ten or twenty years? Even at the Library of Congress, there’s considerable uncertainty about digital video preservation strategies, and even big movie studios are at risk of not preserving their now all-digital movies.

Just figuring out what format you have is confusing. There are two things you have to know: the format of the file as a whole, called the “container,” and the way the bits represent the video, called the “encoding.” These are largely independent of each other, and the specifications for each can have multiple options. The same format may be referred to by different names, and different formats may be called by the same name.

Usually you create a video from a camera, and it probably doesn’t give you a lot of format options. If you process it with a video editor, you have more choices about the final format. The file suffix tells you what the container format is supposed to be but not what encoding was used. If it’s .MOV, you have a QuickTime container. If it’s .MP4, you have an MP4 container — which is not synonymous with MPEG-4, but rather with MPEG-4 Part 14. Both are MPEG-4 compliant but not at all compatible with each other.

It’s common to refer incorrectly to other MPEG-4 container files, including audio-only files, as MP4. If it’s not a Part 14 container, it shouldn’t be called MP4. On the other hand, its being a legitimate MP4 file tells you nothing about what encoding it uses, so not all “MP4” files are compatible with each other; likewise for QuickTime files.

Videos produced by current cameras and software usually will use the H.264 encoding, aka MPEG-4 Part 10. You may also run into MPEG-4 Part 2, which is based on H.263. If you have a strong preference for open-source, you may want to go with the Theora codec. The win is available source code and (hopefully) a lack of patent encumbrances, but the risk is that less software supports it. Preservation is always a matter of placing bets. If you use Theora, it should be in an Ogg container, not an MP4 container; the latter combination is technically MPEG-4 compliant, as a “private stream,” but may not be supported in the long term.

An older container format, Audio Video Interleave or AVI, still has strong support. It dates all the way back to Windows 3.1. Its Full Frame encoding option lets you store uncompressed video.

Microsoft’s current entry is Advanced Systems Format (ASF), often in combination with Windows Media Video (WMV) encoding. It’s widely supported but tends to be Windows-specific, so it may not be the best choice for long-term preservation.

Most video encodings are compressed, since they take a lot of space even by modern standards, and usually the compression is lossy (i.e., it isn’t possible to recover the original data without some loss of accuracy). There are ways to get uncompressed or lossless compressed encoding, but here we’re getting into esoteric areas which I’d best not touch.

The video encoding isn’t the whole story. An encoding such as H.264 is only a video encoding, and even if you’re a silent movie fan like me, you probably like sound in a lot of your movies. The audio encoding that goes with H.264 is usually MPEG-4 Part 3 Advanced Audio Encoding, known for short as AAC, but this isn’t required. If it’s something else you could have preservation issues.

As I said, it’s a mess. I’m far from an expert in this area, but this article should give you an idea of the issues to look for.

Suggestions: Current advice varies a lot. Popular options today include a QuickTime (.MOV) or MPEG-4 Part 14 (.MP4) container with H.264 video and AAC audio if you want to go with software popularity, or Ogg with Theora and Vorbis if you value openness more. The older AVI is hardly dead. Pay attention to the encoding, not just the container format. Stay tuned for future developments and be prepared to migrate to new formats.