XML for document preservation
17:12, 26 Jan 2001 UTC | Eric van der Vlist

Isn't it surprising to find a recommendation, not yet three years old and unapproved by any official body, appraised by engineers, lawyers and archivists invited by their government to debate the long term preservation of digital documents?

Yet this was just the case in a meeting organized by the MTIC [1], for the French Prime Minister, to present a "guide for the preservation of digital documents." It included presentations by Alain Bensoussan, a lawyer specializing in the issues of digital documents, and Catherine Dhérent representing the "Archives Nationales."

Despite their virtual nature, digital documents are threatened by the lack of long-term stability of their media. The French standard NF Z42-013 and law on the validity of digital documents as formal proof require that documents be written on non-rewritable media, guaranteed only over ten years -- a very brief period of time from the archivists' point of view.

This physical deterioration is aggravated by the short life cycle of the logical formats used to represent documents.

The long-term preservation of digital documents thus requires the setup of a dynamic process to schedule, run and audit the physical and logical migrations needed to keep documents alive.

In this context, XML can be used for different purposes:

  • XML is a format that meets the requirements defined by the MTIC -- it's an open recommendation, easy to transform, that should be easy to migrate.
  • XML allows the separation of content from the presentation, and separate storage.
  • The guide recommends defining a XML envelope for the documents, that would contain the description of the document, its requirements for preservation, access control and the history of its migrations.
  • XML is a good candidate for describing the metadata associated with the document -- possibly as a part of its envelope. The MTIC will setup a specific working group for this issue.

The presentation by Alain Bensoussan focused on the legal issues, showing that the presentation does also carry a semantic value that may be needed to courts and that one should keep documents with all the "drivers" needed to visualize them.

The issue could be controversial, since the configuration used by the author of a document and its readers are usually different. A litigation on a contract edited as (X)HTML with tools from a supplier A, and displayed with missing text by a browser B by a customer would probably be difficult to judge.

Any webmaster knows that such things are just too easy to reproduce, and this example gives a new perspective on the legal implications of the lack of conformance to standards in tools.

Copies of the presentations and of the guide should be available online soon.

[1] Mission interministérielle de soutien technique pour le développement des technologies de l'information et de la communication dans l'administration

Re: XML for document preservation (Maxime Coulon - 10:29, 29 May 2003)

dear miss mister i'm a student a the institut for information in amsterdam . i'm busy with a research about xml. the research is for a printing company. is it possible to store digital books in a logical xml structure. thank you in advance

Re: XML for document preservation (Eric van der Vlist - 10:45, 28 Jan 2001)

My headlines are not always free from the rubish teasing of the author who'd like to attract more people reading his stories ;=) ...

In this specific case, though, the introduction was founded on a table (that I wish they'll publish soon) where XML was on a left column entitled "recommended formats" while SGML was in the right columns entitled "other formats" together with older or proprietary formats.

This is showing that they are opposing the two formats probably not based on the history nor on their technical affiliation, but on the ease of read, process and transform these formats that is key to preserve documents.

Since a XML document can be read not only by all the SGML tools, but also by the all the "XML only" tools that are flourishing, I think that this aspect does need to be taken into account and that going further on this road they could as well recommend a "safe" subset of XML --I don't know if they are willing to do so, though.

Re: XML for document preservation (Rick Jelliffe - 04:41, 27 Jan 2001)

Rubbish (qualified)! XML is a profile of ISO 8879 (WebSGML). An example of how to describe it is given by ISO 8879 Annex L. SGML was created, in part, to allow archiving. Any government who wants to mandate XML for archiving needs to just specify ISO 8879 (WebSGML) as the base specification with the appropriate SGML Declaration and Additional Requirements document (for which they can turn to James Clark's note.)

However, full SGML is better than XML at modeling compound document sets: one can add attributes to entities for example.

No matter what kind of SGML is used, there will always be the need for additional requirements: which graphics formats can be used (and exactly which versions of the formats), which naming conventions, which compression, which stylesheets, which schemas, and which hyperlinking and locating mechanism. SGML/XML enables this kind of superstructure to be built.

But the qualification is this: if "XML" is being applied loosely to mean "everything being done at the W3C" then certainly Eric's point is fair--there are lots of layers (layers subsequent to parsing) that are not from international standards.

xmlhack: developer news from the XML community

Front page | Search | Find XML jobs

Related categories