For many text editing tools, only presentation
matters: making it difficult for applications to understand the structure of documents. David Slocombe explained
at XML Europe 2002 how, on the contrary, presentation can be read by the Exegenix converter as a universal
convention for extracting structure from documents.
The ideas behind Exegenix' approach are simple: "just as we do not expect
people to read XML-tagged documents directly, we should not expect them to
write them either" and "if humans can understand the structure of a document
by having a look at its presentation in two dimensions, why computers
couldn't do it too".
Starting from there, the company has developed a converter which does a
two dimensional analysis of PDF and Postscript documents to generate
structured XML documents, without any assumption of the tools and formats used
to produce these documents. During his presentation and demonstration,
Slocombe explained that their tests show that the conventions used to present
documents seem to be universal and stable -- a property which is not shared by
the tools and markup used to author them.
The main advantage of such an approach is that for the converter too, only
presentation matters and that it will detect a "heading 1" title whatever
trick has been used to produce it, unlike any markup based conversion, which
would rely on the good will of the author. It may seem surprising, but the
easiest way to convert your Word documents into Docbook might well be to
start printing them as Postscript files!
Other stories: