www.Plesums.com (logo)

Durable Format Documents

Storing your business records for the long term

© 2001, 2007 by Charles A. Plesums, Austin, Texas

There is no absolute right or wrong format for "permanent" storage of a document. Recently experts have started talking about durable formats (I like the term).

ASCII, the American Standard Code for Information Interchange, includes special codes that perform functions rather than display symbols, such as

"Line Feed", LF, moves down one line on the page. ASCII Character 10, Hex 0A

"Carriage Return", CR, moves to the left margin. ASCII Character 13, Hex 0D

"Form Feed", FF, to start a new page, ASCII character 12, Hex 0C

In the early teletype days both a carriage return (move to the left margin) and a line feed (move down one line) were required to start a new line. (Carriage Return without line feed and reprinting part of the line was sometimes used to simulate bold output.) Some of today's programs accept any combination of CR, LF, or both, to start a new line.

Text files, defined in the old DOS days of PCs, or before, are very durable since they are so simple, especially if you don't try to get tricky with them (like too many lines per page, or making a line so long that it cannot be displayed as a single line). There is an issue whether a "Line Feed" or a "Carriage Return", or both, should be at the end of each line. There is such as strong desire to support "all" ASCII coded text files that many programs can handle or convert any combination of CR and LF, but for greatest compatibility I recommend using CR followed by LF to indicate a new line. Microsoft Windows officially ignored the Form Feed, and assumed programs (or printers) would automatically start a new page when necessary, typically after about 55 lines. Amazingly, the text file format is so durable that most programs properly handle the Form Feed (start a new page) character, despite the lack of official Windows support.

HTML (Hypertext Markup Language, used in web pages on the Internet) looks good as a durable format. It is basically a text file with special codes (tags) to describe how the data should be formatted - large headings, smaller headings, paragraphs, and so forth. There have been many extensions to HTML, but the earliest features are still supported in the latest browsers. Some of the e-mail systems and help files use HTML - it is used for more than just web pages. There is a feature that is both an advantage and disadvantage to HTML. Users can override the default format in the document, allowing the document to be reformatted - for example, enlarged for the visually impaired. Therefore HTML cannot be used where the exact format of a document must be reproduced. (For example, many insurance regulators specify that certain parts of a policy must appear in a particular font, size, and style.)

XML (Extensible Markup Language) is a relatively new format, but is built on the same "SGML" foundation as HTML. It is like HTML in the use of text files to store the data, and tags to describe the data. However, in HTML the tags define the appearance but say nothing about the contents. In XML the tags define the contents, and a separate process defines the appearance. For example the HTML <H1> tag defines the largest heading, while <H2> is the next smaller heading. In XML the value of the tags might be <title> and <author>. It is easy to later specify that the <title> should be displayed as <H1>, but it would be much harder to figure out which <H1> line was the <title>. The rules of XML are simple, so people can "look inside an XML file" and generally interpret what it means. The intent of XML is that the tags are understandable to humans (like <title> and <author>). XML is being received with great excitement as a durable records storage format for documents where the content is important but the exact formatting does not have to be preserved. Records managers sometimes say the archeology of XML is good - if they discovered an XML file, they could probably figure out the meaning of the data.

PDF - The Adobe Acrobat Portable Document Format is a great way to publish a document. The "free" Acrobat Viewer allows the document to be viewed on practically any computer, exactly as prepared by the author. It is probably the best tool for that purpose available today. That free viewer is not without problems - it is a large, slow program by the standards of earlier computers, but not significant in today's larger faster machines. (A few years ago, one company found that 8% of their customers who wanted to install the viewer could not run it on their computer.) Earlier versions of the viewer could not always print the documents (a large memory or postscript feature was required on laser printers). It is an evolving technology - some documents coded in Acrobat version 2 were not readable using the version 4 and later viewers. At one company, thousands of documents created in version 4 were not readable by the version 5 viewer. This is not a big deal if you are working with dozens or hundreds of documents, have retained the original source documents (word processing files) and can reprocess the documents. It is a huge deal if this is the primary long-term storage format for millions of documents. These early problems have become less significant, but it does illustrate the vulnerability of a high-tech format. Adobe has cooperated with the standards community to produce a subset of the PDF format for archive use, PDF/A, but the larger files and performance issues reduce it's acceptance. Although Adobe would disagree with me, I do not recommend PDF format as a durable way to store records.

A bitmap (scanned) image is pretty durable, but what is the file format and compression? Tagged Image File Format, TIFF looks good as far into the future as we can see. It isn't a standard (it is owned by Adobe), but the specification is openly published, freely available, and widely accepted and implemented. Of course, you need to choose the "right" options (of the thousands of combinations available) in TIFF. And TIFF supports many compression schemes, so you must choose the right compression scheme as well. The T.6 compression standard defined for Group 4 fax by the ITU-T (International Telecommunications Union, Telecommunications section, formerly CCITT)is widely accepted and supported for binary (black and white) documents. It may no longer be the most efficient, but as a well-established, widely accepted international standard, it certainly seems durable. IBM's Mixed Object Document Content Architecture, MO:DCA was great image format, at one time ahead of TIFF, but the documentation was terrible, so acceptance was poor, and support has faded. Proponents can argue that MO:DCA was better than TIFF, just as Beta was better than VHS videotape.

Color images can be compressed using the widely accepted JPEG standard (JPG). The new JPEG 2000 (JP2) has numerous advantages, but is not yet widely supported (vendors are still charging a high price for early programs), and the processing load for compression still is high enough to be too slow for some systems. Many experts hoped that JPEG 2000 would catch on, be included in most programs, and become a durable format, but the original JPEG is still dominant. The multilayer JPEG 2000 (JPM) looks like it will have even more advantages for document imaging, but is not yet routinely available - definitely not yet a durable format. The biggest issue with color images is the file format rather than compression. Each of the JPEG compression standards also has a separate file format for one image per file. TIFF files support multiple pages in a single file, and also support JPEG compression, but there have been issues with various implementations of JPEG in TIFF files. Thus there is not an obvious good answer for color images, especially if multiple pages are required in a single file.

Companies often want to keep files from Word, Excel, and other application programs "permanently." From personal experience, some documents only a few (5?) years old have become unusable, or change format so that use is very difficult to use the document. Sometimes the problem is that the file format has changed. Other times, the problem is that a necessary font or style sheet or other supporting element is no longer available. After many years of compatibility, many were feeling comfortable with the Microsoft .doc format, but that is being changed in Office 2007. Spreadsheet data stored as CSV (comma separated variable) files, or word processing documents stored as RTF (rich text format) files are more likely to be durable, but the native word processing and spreadsheet files are certainly NOT a durable format.

COLD technology captures computer output for long term storage, retrieval, and presentation. Some COLD vendors convert the input print stream into a proprietary database so that the data is readily available for analysis, or generation of new reports. I do not consider the database oriented COLD storage, that requires the programs from a particular vendor to interpret the data, to be a durable format. Other COLD vendors store data in the original printer format. Nobody claims a printer format is a standard, but since there are hundreds of thousands of printers that use a format, most people consider it durable - the printers aren't likely to go away soon.


Back to the home page at www.plesums.com

Back to the Document Imaging index at www.plesums.com

Back to the Records Management index at www.plesums.com

Send e-mail comments to Charlie@Plesums.com


©2001, 2007 by Charles A. Plesums, Austin, Texas USA. ALL RIGHTS RESERVED.