Eric
Raymond has released version 1.0.0 of doclifter, a Python 2.2 utility that
converts man pages and other troff/nroff/groff documents to
DocBook XML and SGML.
Both a source tarball and RPM package are available. A NEWS file provides a record of
changes.
The man page for the utility has more details
about its features:
-
converts man, mandoc, ms, me, and TkMan page
sources
-
parses command and C function synopses and converts them into
DocBook markup (using the cmdsynopsis, arg,
replaceable, etc. elements)
-
recognizes 'stereotyped' patterns of markup and
content (such as the use of italics in a FILES
section to mark filenames) and 'lifts' them into
DocBook markup
-
recognizes things such as
URLs, email addresses, man page references, and C
program listings, and lifts them into DocBook
markup
-
maintains a record of semantic 'hints' that it
picks up from analyzing source documents (especially
from parsing command and function synopses), and
provides a means to edit, add to, and save that
record
Raymond writes that
doclifter does not do a
perfect job, only about 90% of one; the last 10% has to
be applied by a human recognizing patterns too subtle for
a computer. But doclifter will almost always produce
translations that are good enough to be usable before
hand-hacking
The NEWS file in the distribution says that doclifter
was tested on all 5548 man pages in a full Red Hat 7.3
workstation install, and that only 5 percent of the
converted files required any post-conversion manual
correction. A TODO file in the distribution provides a
list of man pages that it is currently not able to convert
perfectly, and, for each man page, lists the reason why it
fails.
(It seems that around 65 percent of the conversion failures are
due to markup errors in the 'roff source for the pages, 20
percent or so are due to the presence of parenthentical
comments in synopses -- which aren't supported in DocBook
synopses -- and the remaining 15 percent or so are due to
current deficiencies in doclifter.)
Note that there is also a bug in part of the
implementation that doclifter uses for dealing with ISO
character entities: In some XML instances, it generates
internal DTD subsets that include entity declarations
which reference the SGML versions of the ISO character-entity sets instead of the XML versions.
A workaround for the bug is simply to delete the ISO character entity declarations from generated XML
documents. The declarations are actually redundant at
best, because both the DocBook XML and SGML DTDs already reference the appropriate sets.
|