Interchange formats


Some file formats are good for long-term storage of files, because they’re likely to be usable for a long time. (“A long time” in computer terms means ten or twenty years; if you want files that really last, get a rock, a hammer, and a chisel.) These are preservation formats. There are also file formats which are good for moving files from one application to another. These are interchange formats. (See my last post, “Tied to an Application”, on why these are important.) The two have a lot of overlap but aren’t the same.

The two have things in common. Specifications should be publicly available. The format should represent the information without losing any. There shouldn’t be legal barriers to implementation.

The big difference is that interchange formats have to work right now. A format which is otherwise great doesn’t help most users if they can’t import it into a new application with available software. Interchange formats have to keep editing-related information. PDF is a great format for preservation, but try turning a PDF into an editable file. The results are usually disastrous if the file’s at all complex.

Is Microsoft Word a usable interchange format? Much as it makes me gag, I have to say yes in many cases. You can open Word files with quite a number of different applications. Someone got paid a lot to reverse-engineer the Word format, but the job has been done. A safer bet, though, might be to export from Word to ODF. The current version can do this natively, and the ODF Add-In on SourceForge claims to do it better. That way, whatever application is importing the file is following a published spec, leaving less room for bugs and other surprises.

RTF (Rich Text Format) is nominally an interchange format, but it’s actually a poor one. Its handling of character encoding is miserable and can result in garbled files when an application guesses wrong about a file’s encoding. It isn’t standardized.

Don’t count on any interchange format to give you exactly the same content with a new application. There will almost always be subtle difference in formatting from one application to another. Color profiles may be treated differently. Metadata might not be 100% preserved.

Don’t use JPEG for image interchange. Its lossy compression means there will be spillage along the way. TIFF is good for getting images from one application to another.

When you export a file, keep it in the original format as well, at least till you’re sure you’ve exported it to your satisfaction. If anything goes wrong, that leaves you a chance of exporting again with better tools or settings, or if all else fails of manually moving information over.

PDF/A for the long haul

Comments Off on PDF/A for the long haul

PDF is a useful format. It’s an ISO standard. There’s reliable free software for reading it. It’s widely used and difficult to modify by accident. It can serve as a container for text, illustrations, audio, and video.

If used carelessly, though, it has its risks for long-term preservation. A PDF file isn’t necessarily self-contained. It might depend on external fonts or even require whole files of content. To avoid the risk of dependencies that might cause future problems, you can use PDF/A, a restricted subset of PDF. Good software for generating PDF usually includes a PDF/A option.

On Macintosh OS/X doing this is a little weird. You start by selecting “Print” from your application’s File menu. There will be a “PDF” button in the Print dialog. Clicking on it brings up a menu. Choose not “Save as PDF…” but “Save as Adobe PDF.” This launches an application to save the PDF. You’ll see a dialog like the following:

Macintosh PDF save dialog

In the “Adobe PDF settings” select one of the PDF/A options.

If you have Acrobat, conversion to PDF/A is much simpler.

Correction (29-Jan-2012): If you don’t have Acrobat installed on your computer, you don’t get the “Save as Adobe PDF” option. Sorry about that.

PDF/A documents are even harder to edit than regular PDF’s, so you should defer converting a document to PDF/A if you’re still tweaking it.

There are two versions of PDF/A, called PDF/A-1 and PDF/A-2. A-1 is based on PDF 1.4, and A-2 is based on 1.7. PDF 1.7 is fully backward compatible with 1.4, so you’re safe using either. There are two compliance levels, a and b, with a being the stricter. From a preservation standpoint, there isn’t much difference. Converting to Level b may be easier, since Level a requires you, or the generating software, to produce a tagged structure. The structure helps to make the document self-explanatory but may not be necessary.

There are quite a few restrictions which a file must satisfy to conform to PDF/A, including these:

  • All fonts must be embedded in the document, and there are some restrictions on font implementation. Conversion software will generally refuse to put fonts flagged as non-embeddable into PDF/A.
  • Color profiles must be used to guarantee device-independent color.
  • The document may not be encrypted.
  • There may not be audio, video, or JavaScript.
  • XMP metadata must be included.
  • Forms are allowed with restrictions.

Clearly there are legitimate things to do in PDF that can’t be done in PDF/A. Audio and video can’t be included. The requirements for fonts and profiles can make PDF/A files bigger than they’d otherwise be. But if it’s important to give files the best chance for long-term readability and the restrictions don’t impact necessary content, PDF/A is a good choice.

Suggestions: Use PDF/A for documents that you expect to be needed for a long time. If you’re using PDF as a general-purpose container, though, it may not be a good choice.

Useful links: