guinix international

Computing Without Borders

 

XpinDoc

An XML parse instance, document processing utility library.

Introduction

XpinDoc is a programmer's library in C providing a framework for processing XML documents. This XpinDoc implementation may be built "on top" of either James Clark's expat parser, or Richard Tobin's RXP parser. (The implementation is designed to be easily adapted to other parsers; further work in this area is coming soon.)

The XpinDoc library is suitable for many common XML processing applications, such as document translation and data extraction. The progammer's interface offers a familiar, event-driven, user-definable callback architecture. The purpose of this interface it to simplify the application development effort to three steps:

  1. define methods for specific events in the document
  2. install them in the XpinDoc object
  3. let XpinDoc rip

The idea is to allow the developer to focus on the "logic" of the document itself. Compared to building an application up from scratch with a generic api such as SAX/SAX2, the XpinDoc library provides some advantages:

  • built-in dispatch mechanism to user-designated handlers
  • node container objects, with all element, attribute, namespace, and content data readily available
  • accessible hierarchical list of parent element nodes, from current element to document root
  • installable chardata input filter
  • flexible capture or output modes; even in output mode, application can capture chardata within the "text" object provided by the current node
  • stack-based output "writer" object: direct chardata to output, file, pipe, text object, or null
  • "tagpath" element node descriptor, using fnmatch()-style wildcard and comparison methods
  • supports alternative/customized "xpin_reader" interfaces; source input need not even be XML
  • etc...

Using expat as the default XML parser, XpinDoc is designed primarily for processing stand alone documents. However, documents with external entities and default attribute definitions are easily pre-parsed with a supplemental external parser, such as xmllint from Daniel Veillard's libxml, rxp from Richard Tobin's RXP, or osx from the OpenSP/OpenJade project (originally the SP/nsgmls parser by James Clark). Such external parsers may also be useful to validate the document against an external DTD--or other validation scheme--as the document is authored. Final processing of the document may then be passed to the XpinDoc application. As an example, consider the following command pipe where xpinman is a XpinDoc application:

osx xpindoc_manual.xml | xpinman | tidy > xpindoc_manual.html

Here osx is used to validate the XML document and resolve external entities, xpinman performs the translation, and tidy further cleans up the html output.

Strengths:

  • for those who prefer C
  • lightweight, portable, fast
  • namespace support
  • ability to "lock-in" XML processing applications and distribute standalone binaries
  • may be built with alternative parsers
  • tested on FreeBSD, OpenBSD and Linux platforms

Limitations:

  • present implementation does not support "wide" characters
  • build/install engineering needs work
  • this documentation is pathetic

Synopsis

#include 

void
my_html_filter(XpinDoc X, const char *data, int len)
{
    char       *c = (char *)data;

    while(len){
        if(*c == '<') X->put_string(X, "<");
        else if(*c == '>') X->put_string(X, ">");
        else if(*c == '&') X->put_string(X, "&");
        else X->put_char(X, *c);
        ++c;
        --len;
    }

    return;
}


void
my_startdoc(XpinDoc xpin)
{
   xpin->put_string(xpin, "\n");
   // ...
}


void
my_title(XpinDoc xpin)
{
    int       event_type = xpin->event_type(xpin);

    if(event_type == XPIN_START_ELEMENT){
        xpin->put_string(xpin, "");
    }

    if(event_type == XPIN_END_ELEMENT){
        xpin->put_string(xpin, "\n");
    }
}

/* ... */

void
my_enddoc(XpinDoc xpin)
{
    xpin->put_string(xpin, "");
}


int
main(int argc, char **argv)
{
  XpinDoc  xpin = new_xpindoc(XPIN_DEFAULT);

  if(xpin == NULL)
      my_die("error creating xpindoc");

  /* set default mode to output: */
  xpin->set_datamode(xpin, XPIN_OUTPUT);

  /* filter incoming chardata content for html output: */
  xpin->set_filter(xpin, &html_filter);

  /* set the event handlers: */
  xpin->set_handler(xpin, "Start_Document", &my_startdoc);
  xpin->set_handler(xpin, "", &my_title);
  xpin->set_handler(xpin, "/*/para/emph", &my_bolder);
  xpin->set_handler(xpin, "<?html-css?>", &my_css);
  xpin->set_handler(xpin, "End_Document", &my_enddoc);
  // ...

  /* run the parse: */
  xpin->parse_stream(xpin, stdin);

  /* clean up: */
  xpin->free(xpin);
  return 0;
}
</pre>
</div><p>The above application snippet, saved in the file
<tt>testxpin.c</tt>, may be compiled and linked with the XpinDoc
library as follows:</p>
<div class="commandline"><p><tt>gcc -Wall -O2 -o testxpin testxpin.c -lxpindoc</tt></p></div><hr class="section"></hr><h2 class="section">Installation</h2>
<p>Compile the code and install the library. You are now ready to
write XpinDoc applications.</p>
<p>In the current release, the Makefile targets include:</p>
<ul><li>"make library", to build a static library</li>
<li>"make xpinman", to build the XpinDoc application for processing
the documentation</li>
<li>"make manual", to translate the manual into HTML format</li></ul><p>Note: make targets for building a shared library, or for
installing the library, documentation, etc., are not provided in
this release. These steps may be easily performed "by hand"
according to one's platform and system preferences.</p>
<p>Note also that the current release requires at least one of the
supported XML parsers, expat or RXP, which are available
separately. These libraries should be built before building
XpinDoc. The default Makefile with XpinDoc supports the expat
library; see Makefile.rxp and README.rxp for working with the RXP
parser.</p>
<hr class="section"></hr><h2 class="section">Getting Started</h2>
<p>A XpinDoc application is accessed and controlled through a
top-level <tt><b>XpinDoc</b></tt> object:</p>
<div class="codeblock">
<pre>
#include <xpindoc.h>

int
my_main()
{
    int      flags = XPIN_NAMESPACE | XPIN_QNAME;

    XpinDoc  X = new_xpindoc(flags);

    // ...
}
</pre>
</div><p>The flags argument may be used to control the features of the
parser object. Flags may be combined (OR'd) as shown. The flags
currently implemented include:</p>
<dl><dt><b>XPIN_DEFAULT</b></dt>
<dd>No flags are set.</dd>
<dt><b>XPIN_NAMESPACE</b></dt>
<dd>Enables namespace-aware parsing.</dd>
<dt><b>XPIN_QNAME</b></dt>
<dd>If namespace processing is enabled, the qualified element name
(prefix:local_name) is used for the tagname and tagpath expressions
of elements in a foreign namespace; otherwise, the local_name is
used.</dd></dl><p>A XpinDoc parse may be configured to operate in one of two
modes, through the set_datamode() method:</p>
<div class="codeblock">
<pre>
int
my_main()
{
    int mode = XPIN_OUTPUT;

    // ... 

    X->set_datamode(X, mode);

    // ...
}
</pre>
</div><p>The mode argument may take one of the following values:</p>
<dl><dt><b>XPIN_CAPTURE</b></dt>
<dd>Content is "captured" to Text object of current node
(default)</dd>
<dt><b>XPIN_OUTPUT</b></dt>
<dd>Content is output through the current Writer object</dd></dl><p>The set_mydata() method may be used to pass any arbitrary
supplementary data to event handlers, where it may be retrieved by
the mydata() method:</p>
<div class="codeblock">
<pre>
void
my_handler(XpinDoc X)
{
    //...

    mydata = (struct mydata *)X->mydata(X);

   //...
}


int
my_main()
{
    struct mydata *mydata = NULL;
    // ... 

    X->set_mydata(X, mydata);

    // ...
}
</pre>
</div><p>The heart of a XpinDoc application is the set_handler()
method:</p>
<div class="codeblock">
<pre>
void
my_handler(XpinDoc X)
{
    //...
}


int
my_main()
{
    // ... 

    err = X->set_handler(X, keystr, &my_handler);

    // ...
}
</pre>
</div><p>Where the keystr argument is a nul-terminated constant character
string taking one of the following forms:</p>
<dl><dt><b>"<tag>"</b></dt>
<dd>handler is installed for start element events with tagname
matching tag</dd>
<dt><b>"</tag>"</b></dt>
<dd>handler is installed for end element events with tagname
matching tag</dd>
<dt><b>"<tag/>"</b></dt>
<dd>handler is installed for start and end element events with
tagname matching tag</dd>
<dt><b>"/tagpath"</b></dt>
<dd>handler is installed for start and end element events with a
tagpath matching "/tagpath"</dd>
<dt><b>"<?pi?>"</b></dt>
<dd>handler is installed for processing instruction event with
target matching pi</dd></dl><p>Additionally, several default handlers may be installed by
specifying keystr as one of the following exact (case-insensitive)
strings:</p>
<dl><dt><b>"Start_Document"</b></dt>
<dd>handler is installed for the start of the document</dd>
<dt><b>"End_Document"</b></dt>
<dd>handler is installed for the end of the document</dd>
<dt><b>"Start_Element"</b></dt>
<dd>handler is installed as the default handler for start element
events</dd>
<dt><b>"End_Element"</b></dt>
<dd>handler is installed as the default handler for end element
events</dd>
<dt><b>"NS_Start_Element"</b></dt>
<dd>handler is installed as the default handler for start element
events for elements having a non-null namespace URI (requires
namespace processing enabled)</dd>
<dt><b>"NS_End_Element"</b></dt>
<dd>handler is installed as the default handler for end element
events for elements having a non-null namespace URI (requires
namespace processing enabled)</dd>
<dt><b>"Processing_Instruction"</b></dt>
<dd>handler is installed as the default handler for processing
instruction events</dd>
<dt><b>"Start_CDATA"</b></dt>
<dd>handler is installed for the start of CDATA blocks</dd>
<dt><b>"End_CDATA"</b></dt>
<dd>handler is installed for the end of CDATA blocks</dd>
<dt><b>"ERROR"</b></dt>
<dd>handler is installed for error events raised by XpinDoc</dd></dl><hr class="section"></hr><h2 class="section">Handler Dispatch Logic</h2>
<p>To develop a XpinDoc application, it is necessary to understand
the simple dispatch logic used in calling the installed handlers.
By way of illustration, the following (ugh!) ascii chart sketches
the flow of control that XpinDoc uses for processing a start
element event:</p>
<div class="codeblock">
<pre>
_
    Test         Description                   Action
    ---------    --------------------------    --------------------

1.  NAMESPACE    is node namespace non-null
                 and handler installed ?       yes --> call handler --+
                                                                      |
                      no                                              |
                                                                      V
                      |
                      |
                      V

2.  "/tagpath"   is handler installed for
                 node matching tagpath
                 expression ?
                 (LIST search)                 yes --> call handler --+
                                                                      |
                      no                                              |
                                                                      V
                      |
                      |
                      V

3.  "<tag>"      is handler installed for
                 node matching tag ?
                 (HASH search)                 yes --> call handler --+
                                                                      |
                      no                                              |
                                                                      V
                      |
                      |
                      V

4.  default      is a default handler
                 installed ?                   yes --> call handler --+
                                                                      |
                      no                                              |
                                                                      |
                      |                                               |
                      |                                               V
                      |
                      +-->  (do nothing)   -->            continue parse
</pre>
</div><p>A brief explanation and rationale for the dispatch logic:</p>
<p>At most one handler will be called for each element event.</p>
<p>(1.) If namespace-aware parsing is on, and if a
"NS_Start_Element" handler is defined, and if the current element
has a non-null namespace URI, the defined handler will be called
for the element. That is, as described in the following XpinDoc
snippet:</p>
<div class="codeblock">
<pre>
Xpin_Node  N = X->node(X);

  if((X->ns_parser(X) != 0) && (N->ns_uri(N) != NULL)){
     //...
</pre>
</div><p>This allows an application, if it chooses, to "filter out"
(ALL!) elements not belonging to the native namespace, for a
special handler.</p>
<p>(2.) If a handler is defined for a tagpath expression matching
the current element, this handler will be called for the element.
Tagpath expressions are able to specify an element's position and
relationship to other elements in a document more specifically than
the tagname. This allows fine-grained control to be applied before
more general control.</p>
<p>As an example, consider a handler installed for the tagpath
expression "/*/emph/emph", and another installed for the tagname
"<emph>". The tagpath handler will be called for nested
"<emph>" elements, while the tagname handler will catch other
instances.</p>
<p>Note that tagpath handlers are installed in a list object, and
items are tested in the same order as they are inserted. The first
matching handler will be used for the element. This means that the
application should install more specific tagname handlers before
less specific handlers. That is, a handler for "/*/item/list"
should be installed before "/*/list".</p>
<p>Note also that if namespacing processing is enabled with usage
of qualified names (XPIN_NAMESPACE | XPIN_QNAME), the tagpath
expression for elements in a foreign namespace will include the
namespace prefix. The application may install handlers for elements
in a foreign namespace by specifying the prefix in the tagpath
expression, such as "/*/book:para" or "/*/groff:tbl", etc. All
elements with a particular namespace prefix may be handled by using
a wildcard tagpath expression such as "/*/db:*".</p>
<p>A handler installed with a tagpath expression of "/*" will act
as a default handler for any elements not previously matched. Note
that such a handler would effectively prevent any tagname handler
from being called.</p>
<p>(3.) If a handler is defined for the element's tagname, this
handler will be called for the element. Tagname handlers are
installed in hash objects, so handlers may be installed in any
order.</p>
<p>Note also that if namespacing processing is enabled with usage
of qualified names (XPIN_NAMESPACE | XPIN_QNAME), the tagname for
elements in a foreign namespace will include the namespace prefix.
The application may then install handlers for elements in a foreign
namespace by specifying the prefix in the tagname, such as
"<html:table>", "<poem:verse>", etc.</p>
<p>(4.) Finally, if no handler for the element has yet been found,
and a default handler has been installed for "Start_Element", then
this handler will be used for the element. Otherwise the
application will not call any handler for the element, and parsing
will continue to the next event.</p>
<hr class="section"></hr><h2 class="section">XpinDoc Objects</h2>
<p>A XpinDoc object provides:</p>
<ul><li>a parser for the document instance</li>
<li>a registry for callback handlers</li>
<li>access to objects generated during the parse</li></ul><p>During the course of a parse, a XpinDoc application may access
one or more of the following objects:</p>
<dl><dt><tt>Xpin_Event</tt></dt>
<dd>An XML event and its associated data, such as:
<ul><li>start element</li>
<li>end element</li>
<li>processing instruction</li>
<li>etc.</li></ul></dd><dt><tt>Xpin_Node</tt></dt>
<dd>Container object with access methods to the character data
within an XML element, also providing:
<ul><li>tagname, and namespace information</li>
<li>XpinDoc "tagpath"</li>
<li>element attributes</li>
<li>element contents</li></ul></dd><dt><tt>Xpin_PI</tt></dt>
<dd>An XML processing instruction.</dd></dl><p>Each of these objects is described in its own section below.</p>
<p>To be continued...</p>
<hr class="section"></hr><h2 class="section">History</h2>
<p>XpinDoc isn't particularly innovative or ground-breaking.
Historically, XpinDoc follows from an earlier SGML utility of mine
called "SpinDoc" implemented in Python. This, in turn, was
influenced primarily by David Megginson's SGMLS.pm library in Perl.
(Megginson's work, of course, going on to be highly influential in
the development of the SAX.) XpinDoc has also been influenced by
instant/transpec, Cost, and other SGML/XML tools.</p>
<hr class="section"></hr><h2 class="section">Conclusions</h2>
<p>Please see the
<a href="../../software/xpindoc.html">source distribution</a>
for additional documentation
and sample XpinDoc applications.</p>
<!-- ======================= -->
<!-- ##### END CONTENT ##### -->
<!-- ======================= -->
</td>
</tr>
<!-- copyright notice: -->
<tr>
<td>
<hr />
<small>Copyright © 2002 - 2005, Wayne Marshall.
  All rights reserved.<br />
  Last edit 2005.03.07, wcm.
</small>
</td>
</tr>
</table>

</tr>
</table>

</body>
</html>