Presentation matters
12:24, 27 May 2002 UTC | Eric van der Vlist

For many text editing tools, only presentation matters: making it difficult for applications to understand the structure of documents. David Slocombe explained at XML Europe 2002 how, on the contrary, presentation can be read by the Exegenix converter as a universal convention for extracting structure from documents.

The ideas behind Exegenix' approach are simple: "just as we do not expect people to read XML-tagged documents directly, we should not expect them to write them either" and "if humans can understand the structure of a document by having a look at its presentation in two dimensions, why computers couldn't do it too".

Starting from there, the company has developed a converter which does a two dimensional analysis of PDF and Postscript documents to generate structured XML documents, without any assumption of the tools and formats used to produce these documents. During his presentation and demonstration, Slocombe explained that their tests show that the conventions used to present documents seem to be universal and stable -- a property which is not shared by the tools and markup used to author them.

The main advantage of such an approach is that for the converter too, only presentation matters and that it will detect a "heading 1" title whatever trick has been used to produce it, unlike any markup based conversion, which would rely on the good will of the author. It may seem surprising, but the easiest way to convert your Word documents into Docbook might well be to start printing them as Postscript files!

Other stories:

xmlhack: developer news from the XML community

Front page | Search | Find XML jobs

Related categories