Publishing

22 September 2011

[Prev][Next]

Document Management

Norman Walsh

MarkLogic Corporation

Agenda

Introduction

[Prev][Next]

Learning Objectives

[Prev][Next]

An understanding of the challenges and opportunities afforded by the promise of reusable XML documents.

Modern publishing environments demand reuse and repurposing of content to maximize its value.

  • What do we mean by reuse and repurpose?

  • Non-technical challenges

  • Technical challenges

Some reuse scenarios

[Prev][Next]

Reuse is using the same content in different documents.

  1. Write two documents that share several common figures

  2. Write two books that share several chapters

  3. Write two help sets that share several topics

  4. Write two web pages that share the same boilerplate (copyrights, legal notices, etc.)

Some repurposing scenarios

[Prev][Next]

Repurposing is presenting the same content in different media.

  1. Publish a document on US Letter and A4 paper

  2. Publish a document in print and on the web

  3. Publish a document in print, on the web, and as an EPUB

  4. Publish a document in print and as an “app”

  5. Publish a document as an iPhone app and an Android app

Some reuse involves repurposing, some repurposing involves reuse. These words don't have a strict, technical meaning.

Non-technical challenges

[Prev][Next]

Understanding context

[Prev][Next]
  • Most writing happens in a particular context.

  • That context is based on an expectations about the reader:

    • If you're reading chapter 5, you've read chapters 1-4

    • If you're reading chapter 5, it's preceded by chapters 1-4

    • If you're reading the Unix guide, you're on a Unix system

More context

[Prev][Next]

The author may have other context in mind

  • The document is printed on paper

  • Figures are always on the right hand side of a spread

  • Procedures never break across page boundaries

  • The document is printed in black-and-white

etc.

Context impacts reuse

[Prev][Next]
  • Reuse and repurposing places content in new contexts

  • In the worst case, into contexts that are incompatible with the context in which they were written:

    • “In the preceding chapter, we…”

    • “As the figure on the right shows, …”

  • To avoid the worst case, reuse is limited by context

Interlude

[Prev][Next]

Do these notions of context seem coherent? Are there other notions of context (on the document management side, as distinct from the delivery side) that have been overlooked?

Discuss.

Authors are revolting

[Prev][Next]
  • Maximizing reuse requires learning to write differently

  • Sometimes it requires using new tools

  • Sometimes it breaks established boundaries of authorship

    • Writing books becomes writing topics

  • Sometimes it breaks established boundaries of control

    • Presentation and formatting are often removed from the author's control

It may be challenging to convince authors that the necessary changes have benefits that justify the costs

Solutions to the non-technical challenges

[Prev][Next]

Solving these problems is highly dependent on the particular circumstances.

  • Edicts from above?

  • Chocolate?

To the largest extent possible, make sure everyone who will be impacted by a project are involved in the planning and development stages to assure that everyone's committed to the goals.

Reuse in XML documents

[Prev][Next]

The mechanics of reuse

[Prev][Next]
  • Store the components you want to reuse in separate “files”

  • Write the “main” document so that it references those components

  • Resolve those references and process the resulting document

In the discussion that follows, we'll mostly be talking about a single composite document. If you can build one, you can build more than one with the same techniques.

Reusing graphics

[Prev][Next]

Graphics, and other non-XML resources, are the easy case:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>...</title>
</head>
<body>
...
<img src="somegraphic.png" alt="Some graphic" />
...
</body>
</html>

Or, in DocBook:

<mediaobject>
  <alt>Some graphic</alt>
  <imageobject>
    <imagedata fileref="somegraphic.png"/>
  </imageobject>
</mediaobject>

Why are graphics easy?

[Prev][Next]
  • Graphics aren't properly part of the “XML content” of the document

    imgdoc head head title title head->title body body elipsis1 ... body->elipsis1 img img body->img elipsis2 ... body->elipsis2 graphic somegraphic.png img->graphic html html html->head html->body
  • Most XML processes don't care

  • Some processes (XML to PDF) will care

Reusing XML

[Prev][Next]

In the XML case, “resolving references” is a transformation:

imgdoc doc1 doc chap1 chapter doc1->chap1 incl include doc1->incl file chapter incl->file file.xml para11 para file->para11 para12 para file->para12
imgdoc doc2 doc chap2 chapter doc2->chap2 subdoc chapter doc2->subdoc para21 para subdoc->para21 para22 para subdoc->para22

XML reuse techniques

[Prev][Next]

There are roughly three ways to reuse XML:

  • XML entities

  • XInclude

  • Construction from stand-off markup

    • DITA maps

    • DocBook assemblies

Or some proprietary mechanism likely to be like one of those.

Aside: How XML tools work

[Prev][Next]

Octets (bits on disk) are interpreted as characters (based on some media type), those characters are parsed to produce some sort of a data model, and most XML tools work on that data model.

xmltools octets octets characters characters octets->characters parse parse characters->parse infoset infoset parse->infoset xslt xslt infoset->xslt xinclude xinclude infoset->xinclude xquery xquery infoset->xquery more1 ... xslt->more1 xslt2 xslt xinclude->xslt2 more3 ... xquery->more3 more2 ... xslt2->more2

Reuse with XML entities

[Prev][Next]

XML entities are resolved by the parser. They operate at a much lower level than other techniques:

<!DOCTYPE doc [
<!ENTITY chap2 SYSTEM "chap2body.xml">
]>
<doc>
<chapter>First chapter...</chapter>
<chapter>
&chap2;
</chapter>
</doc>

Where chap2body.xml contains (an extParsedEnt):

<para>paragraph</para>
<para>paragraph</para>

XML entities after parsing

[Prev][Next]

After parsing, this is the document other tools see:

<doc>
<chapter>First chapter...</chapter>
<chapter>
<para>paragraph</para>
<para>paragraph</para>
</chapter>
</doc>

XML entities

[Prev][Next]
  • Work with almost any parser

  • Are invisible to most XML processes

  • Are a kind of textual substitution

  • Require a doctype declaration and processors which read “external markup declarations”.

  • Apply validation to the entire, expanded document if validation is applied

XML entity constraints

[Prev][Next]
  • The document you start with must have a literal root element

  • Included documents cannot have their own doctype declarations

  • Expansion must succeed; errors are fatal

  • The entity you include can have multiple root nodes

  • Can only include whole files

Reuse with XInclude

[Prev][Next]

XInclude processing takes place after parsing.

<doc xmlns:xi="http://www.w3.org/2001/XInclude">
<chapter>First chapter...</chapter>
<xi:include href="chap2.xml"/>
</doc>

Where chap2.xml contains:

<chapter>
<para>paragraph</para>
<para>paragraph</para>
</chapter>

After XInclude

[Prev][Next]

After applying XInclude processing, this is the document other tools see:

<doc xmlns:xi="http://www.w3.org/2001/XInclude">
<chapter>First chapter...</chapter>
<chapter xml:base="chap2.xml">
<para>paragraph</para>
<para>paragraph</para>
</chapter>
</doc>

XInclude

[Prev][Next]
  • Requires an XInclude processor (or appropriate configuration option)

  • Is logically a transformation like any other. There's a pre-XIncluded document and a post-XIncluded document.

  • Operates on two or more distinct, separate documents (well, usually)

  • May apply validation to either the individual documents, or the composite document, or both.

    • N.B. DTD validation cannot practically be applied to the composite document

  • Can address subsections of a file via XPointer

  • Is recursive: all or nothing

XInclude constraints

[Prev][Next]
  • Both the including and the included document must be well-formed

  • XInclude is agnostic to the presence or absence of doctype declarations

XInclude Fallback

[Prev][Next]

Fallback can be used to recover from resource errors:

<doc xmlns:xi="http://www.w3.org/2001/XInclude">
<xi:include href="http://example.com/tiger.svg">
  <xi:fallback>
    <xi:include href="kitten.svg">
      <xi:fallback></xi:fallback>
    </xi:include>
  </xi:fallback>
</xi:include>
</doc>

XIncluding plain text

[Prev][Next]

One common use of XInclude (in software documentation anyway) is to include examples. Sometimes that means you want the text of the document, not its XML essence.

<programlisting language="xml">
  <xi:include href="chap2body.xml" parse="text"/>
</programlisting>

XInclude challenge

[Prev][Next]

Suppose we wanted to accurately reproduce the original entities example:

<para>paragraph</para>
<para>paragraph</para>

Where chap2.xml actually contains:

<chapter>
<para>paragraph</para>
<para>paragraph</para>
</chapter>

XInclude + XPointer

[Prev][Next]

XPointer lets you reach into a document.

<doc xmlns:xi="http://www.w3.org/2001/XInclude">
<chapter>First chapter...</chapter>
<chapter>
  <xi:include href="chap2.xml"
              xpointer="xpath(/*/*)"/>
</chapter>
</doc>

After XInclude

[Prev][Next]

After applying XInclude processing in this case, other tools see:

<doc xmlns:xi="http://www.w3.org/2001/XInclude">
<chapter>First chapter...</chapter>
<chapter>
  <para xml:base="chap2.xml">paragraph</para>
<para xml:base="chap2.xml">paragraph</para>
</chapter>
</doc>

XPointer schemes

[Prev][Next]

Standard schemes:

  • #foo or id(foo), the element with the ID “foo”

  • element(/1/2), the second child of the root element

  • element(foo/2/3), the third child of the second child of the element with the ID “foo”

  • xmlns(db=http://docbook.org/ns/docbook), defines a namespace for a subsequent expression

A registry of extension schemes is maintained at http://www.w3.org/2005/04/xpointer-schemes/.

  • There are a bunch...but support is on a per-implementation basis

  • Of them, xpath is probably the most widely supported

Reuse with stand-off markup

[Prev][Next]
  • DITA and DocBook provide stand-off markup for building documents from components

    • DITA calls them maps

    • DocBook calls them assemblies

  • There's no “root” document that pulls in the components

  • Instead, the assembly describes how the pieces are pulled together

DocBook assembly markup

[Prev][Next]
<assembly xmlns="http://docbook.org/ns/docbook">
  <resources>
    <resource xml:id="r" fileref="rsrc.xml"/></resources>

  <structure xml:id="h" type="helpsystem"></structure>
  <structure xml:id="b" type="book"></structure>

  <relationships></relationships>
  <transforms></transforms>
</assembly>

DocBook assembly example

[Prev][Next]
<assembly xmlns="http://docbook.org/ns/docbook"
          xmlns:xlink="http://www.w3.org/1999/xlink">
  <resources>
    <resource xml:id="xidi.overview"
              fileref="xidi-overview.xml"/>
    <resource xml:id="scr.book.build"
              fileref="scr-book-build.xml"/></resources>

  <structure xml:id="xidi.help.system"
             type="helpsystem"
             defaultformat="helpsystem">
    <output format="pdf" file="xidi-help-system.pdf"/>
    <output format="helpsystem ohj"/>
    <filterout condition="manual.only"/>
    <title>XIDI Help System</title>
    <info>
      <abstract>
        <para>This is the help system…
        </para>
      </abstract>
    </info>
    <revhistory>
      <revision>
        <revnumber>0.1</revnumber>
        <date>1 August 2009</date>
      </revision>
    </revhistory>
    <module>
      <output file="sys-toc.html"/>
      <toc/>
      <toc role="procedures"/>
    </module>
    <module xml:id="help.xidi.overview" >
      <output file="overview.html"/>
      <title>XIDI Help System Overview</title>
      <module resourceref="help.overview.intro"
              contentonly="true" omittitles="true"/>
      <module resourceref="xidi.overview">
        <output file="ovr-xidi.html"/>
      </module>
    </module>
  </structure>

  <structure xml:id="user.guide" type="book">
    <output renderas="book"/>
    <output format="html"
            file="xidi-user-guide.html"/>
    <output format="pdf"
            file="xidi-user-guide.pdf"/>
    <title>XIDI User Guide</title>
    <toc/>
    <toc role="figures"/>
    <toc role="tables"/>
    <toc role="procedures"/>
    <module resourceref="xidi.overview"
            renderas="chapter"/>
    <module resourceref="xidi.create.intro"
            renderas="chapter"/>
  </structure>

  <relationships>
    <relationship linkend="xidi.help.system"
                  type="path">
      <association>New User Introduction</association>
      <instance linkend="help.xidi.overview"/>
      <instance linkend="help.svn.overview"/>
      <instance linkend="help.ex.new.help.sys"/>
    </relationship>

    <relationship type="collection">
      <association>Advanced User Topics</association>
      <instance linkend="xidi.parameters.syntax"/>
      <instance linkend="svn.properties"/>
    </relationship>
  </relationships>

  <transforms>
    <transform grammar="dita"
               fileref="dita2docbook.xsl"/>
    <transform name="tutorial"
               fileref="docbook2tutorial.xsl"/>
  </transforms>
</assembly>

Blind interchange

[Prev][Next]
  • In most cases, in order for partners to exchange documents, both partners must understand all of the markup in the exchanged documents.

  • In other words, I can't usefully exchange DocBook with someone expecting TEI.

  • Blind interchange describes the situation where partners exchange documents without knowledge

  • It requires adhering to a set of constraints that allow one element to be a “subtype” of another with the guarantee that processing the subtype like its “supertype” will do something useful

  • It is a feature of DITA

Validation

[Prev][Next]

Why validate?

[Prev][Next]
  • Most processes, especially in publishing, are transformative: XML to HTML, XML to PDF, XML to EPUB, etc.

  • Those transformations are written by people who believe they understand the structure of the documents to be transformed

  • If the structure differs from expectations, the results will be ugly at best, catastrophically misleading at worst

  • The more complex the process, the more important it is to understand the incoming markup

  • Validation is the easiest way to catch markup errors

When do you validate?

[Prev][Next]
  • Ideally, while you're typing your documents

  • Absolutely, before you do anything else with them!

Schema Languages

[Prev][Next]

There are three significant grammar-based schema technologies:

  • Document Type Definitions (DTDs)

  • W3C XML Schemas

  • RELAX NG grammars

There are other, non-grammar-based technologies, of which

  • Schematron

Is probably the best known.

DTDs

[Prev][Next]
  • Widely available (supported by almost all tools)

  • Normatively part of the XML specification

    • But validation is optional

  • Not written in XML-document syntax

    • Poor support for documentation

    • Not usable in some environments

  • Supports entities (a text-based macro language)

  • Not namespace aware

  • Very limited data type support

W3C XML Schema

[Prev][Next]
  • Supported by many tools

  • Also developed at the W3C

  • Written in XML-document syntax

  • Namespace aware

  • Extensive but not extensible data type support

  • Hierarchical data types (typed object graphs)

  • Grammars must be unambiguous

RELAX NG

[Prev][Next]
  • Supported by some tools

  • Developed at OASIS

  • Written in XML-document syntax

    • With a very popular, official compact (non-XML) syntax

  • Namespace aware

  • Supports all the XML Schema data types, plus is extensible

  • Grammars may be ambiguous

  • No obvious support for typed object graphs

A Document Example

[Prev][Next]
<doc xmlns="http://www.xmlsummerschool.com/example/ns"
     status="draft">
<head>
  <title>A Sample Document</title>
  <date>2011-09-22T09:00:00+01:00</date>
  <author>Norman Walsh</author>
</head>
<body>
  <p>Paragraph. <em>Important</em> paragraph.</p>
  <p>Paragraph.<fn><p>Redundant, ain't he?</p>
  </fn></p>
</body>
</doc>

But what are the rules?

[Prev][Next]

What makes one of our documents one of ours and not something else? When is a purchase order not a cocktail recipe?

  • A doc consists of a head and a body, in that order

  • A head contains a title, date, and author, in any order

  • A body only contains p elements

  • A p contains text, em, or fn elements mixed together

More rules

[Prev][Next]

The “rules” about a document exist in a spectrum from simple, structural rules all the way to business process/workflow rules.

  • Paragraphs in footnotes can't themselves have footnotes

  • Dates have to be real (ISO 8601) dates

  • Dates have to be expressed in UTC

  • Documents can have at most four footnotes

  • Documents with the status “final” can only be published on Thursdays

  • Author names have to be in the master author database

  • Documents can have at most four footnotes per page

DTD Rules

[Prev][Next]
  • <!ELEMENT doc (head, body)>
    <!-- Documentation, what documentation? -->
  • <!ATTLIST doc
              xmlns   CDATA          #FIXED
                "http://www.xmlsummerschool.com/example/ns"
              status  (draft|final)  #IMPLIED>
  • <!ELEMENT head (title, date, author)>
  • <!ELEMENT head (title, date, author)>
  • <!ELEMENT head (title & date & author)>
  • <!ELEMENT head (title & date & author)>
  • <!ELEMENT head (title | date | author)>
  • <!ELEMENT head (title | date | author)>
  • <!ELEMENT head (title | date | author)+>
    • Allows multiple titles, dates, and authors; doesn't require one of each.

DTD Rules (continued)

[Prev][Next]
  • <!ELEMENT title (#PCDATA)*>
    <!ELEMENT date (#PCDATA)*>
    <!ELEMENT author (#PCDATA)*>
    • Allows any string as a date

  • <!ELEMENT body (p+)>

DTD Rules (continued)

[Prev][Next]
  • <!ELEMENT p (#PCDATA|em|fn)*>
  • <!ELEMENT em (#PCDATA|em|fn)*>
  • <!ELEMENT fn (p+)>
  • But is this really sufficient?

  • <p>This is some text.
    <fn><p>With footnote text.
        <fn><p>Which is also text.</p></fn></p>
    </fn>
    Is that what we intended?</p>
  • There's nothing in DTDs to exclude nesting.

XML Schema Rules

[Prev][Next]

XML Schemas are XML documents, so they have to have a root element.

<schema xmlns="http://www.w3.org/2001/XMLSchema"
 xmlns:d="http://www.xmlsummerschool.com/example/ns"
 elementFormDefault="qualified"
 targetNamespace="http://www.xmlsummerschool.com/example/ns">

<annotation>
  <documentation>
    <p xmlns="http://www.w3.org/1999/xhtml">
      This is documentation.
    </p>
  </documentation>
</annotation>

<!-- declarations go here -->
</schema>

XML Schema Rules (continued)

[Prev][Next]
<complexType name="Document">
  <sequence>
    <element name="head" type="d:Head"/>
    <element name="body" type="d:Body"/>
  </sequence>
  <attribute name="status" type="d:Status"/>
</complexType>

XML Schema Rules (continued)

[Prev][Next]
<simpleType name="Status">
  <restriction base="string">
    <enumeration value="draft"/>
    <enumeration value="final"/>
  </restriction>
</simpleType>

XML Schema Rules (continued)

[Prev][Next]
<complexType name="Head">
  <all>
    <element name="title" type="string"/>
    <element name="date" type="dateTime"/>
    <element name="author" type="string"/>
  </all>
</complexType>

<complexType name="Body">
  <sequence minOccurs="0" maxOccurs="unbounded">
    <element ref="d:p"/>
  </sequence>
</complexType>

XML Schema Rules (continued)

[Prev][Next]
<element name="p">
  <complexType mixed="true">
    <choice minOccurs="0" maxOccurs="unbounded">
      <element ref="d:em"/>
      <element ref="d:fn"/>
    </choice>
  </complexType>
</element>

<element name="em">
  <complexType mixed="true">
    <choice minOccurs="0" maxOccurs="unbounded">
      <element ref="d:em"/>
      <element ref="d:fn"/>
    </choice>
  </complexType>
</element>

XML Schema Rules (continued)

[Prev][Next]
<element name="fn">
  <complexType mixed="true">
    <choice minOccurs="1" maxOccurs="unbounded">
      <element name="p">
        <complexType mixed="true">
          <choice minOccurs="0" maxOccurs="unbounded">
            <element ref="d:em"/>
          </choice>
        </complexType>
      </element>
    </choice>
  </complexType>
</element>

(Anyone see the bug?)

RELAX NG Rules

[Prev][Next]

RELAX NG grammars are XML documents, so they have to have a root element.

<?xml version="1.0" encoding="UTF-8"?>
<grammar xmlns="http://relaxng.org/ns/structure/1.0"
 ns="http://www.xmlsummerschool.com/example/ns"
 datatypeLibrary
   ="http://www.w3.org/2001/XMLSchema-datatypes">
<div>
  <p xmlns="http://www.w3.org/1999/xhtml">This
  is some documentation.
  The div wrapper is just for grouping.
  </p>

  <start>
    <ref name="doc"/>
  </start>
</div>

<!-- declarations go here -->
</grammar>

RELAX NG Rules (continued)

[Prev][Next]
<define name="doc">
  <element name="doc">
    <attribute name="status">
      <choice>
        <value>draft</value>
        <value>final</value>
      </choice>
    </attribute>
    <group>
      <ref name="head"/>
      <ref name="body"/>
    </group>
  </element>
</define>

RELAX NG Rules (continued)

[Prev][Next]
<define name="head">
  <element name="head">
    <interleave>
      <ref name="date"/>
      <ref name="title"/>
      <ref name="author"/>
    </interleave>
  </element>
</define>

<define name="date">
  <element name="date">
    <data type="dateTime"/>
  </element>
</define>

RELAX NG Rules (compact syntax)

[Prev][Next]

One of the appealing features of RELAX NG is its compact syntax.

default namespace
  = "http://www.xmlsummerschool.com/example/ns"
namespace h = "http://www.w3.org/1999/xhtml"

[
   h:p [ "This is some documentation. The div"
         " wrapper is just for grouping." ]
]
div {
   start = doc
}

RELAX NG Rules (compact syntax, continued)

[Prev][Next]
doc =
   element doc {
      attribute status { "draft" | "final" },
      (head, body)
   }

head =
   element head {
      (date & title & author)
   }

date   = element date   { xsd:dateTime }
title  = element title  { text }
author = element author { text }

RELAX NG Rules (compact syntax, continued)

[Prev][Next]
body =
   element body {
      (p+)
   }

p =
   element p {
      (text | em | fn)*
   }

em =
   element em {
      (text | em | fn)*
   }

RELAX NG Rules (compact syntax, continued)

[Prev][Next]
fn =
   element fn {
      (limitedp+)
   }

limitedp =
   element p {
      (text | limitedem)*
   }

limitedem =
   element em {
      (text | limitedem)*
   }

Schematron

[Prev][Next]
  • Schematron can be used to evaluate extra-grammatical constraints

  • Essentially arbitrary XPath expressions are evaluated in the context of appropriate elements

  • Schematron rules can be embedded in RELAX NG and XML Schema documents for convenience

Schematron example

[Prev][Next]

Recall our earlier constraints:

  • Dates have to be expressed in UTC

  • Documents can have at most four footnotes

<s:schema
 xmlns:s="http://purl.oclc.org/dsdl/schematron">
 <s:ns prefix="ex"
    uri="http://www.xmlsummerschool.com/example/ns"/>

Schematron example (continued)

[Prev][Next]

Dates in UTC:

<s:pattern name="Dates in UTC">
     <s:rule context="ex:date">
       <s:assert
    test="timezone-from-dateTime(xs:dateTime(.))
            = xs:dayTimeDuration('PT0H')"
       >Dates must be expressed in UTC.</s:assert>
     </s:rule>
   </s:pattern>

Schematron example (continued)

[Prev][Next]

At most four footnotes.

<s:pattern name="At most four footnotes">
      <s:rule context="/*">
         <s:assert test="count(//fn) &lt;= 4"
         >At most four footnotes are allowed.
         </s:assert>
      </s:rule>
   </s:pattern>

Revisiting XInclude

[Prev][Next]

Consider this document:

<doc xmlns="http://www.xmlsummerschool.com/example/ns"
     xmlns:xi="http://www.w3.org/2001/XInclude"
     status="draft">
<head>
  <title>A Sample Document</title>
  <date>2011-09-22T09:00:00+01:00</date>
  <author>Norman Walsh</author>
</head>
<xi:include href="body.xml"/>
</doc>

Is it valid?

  • Before XInclude processing?

  • After XInclude processing?

  • Both before and after?

Managing documents

[Prev][Next]

What's to manage?

[Prev][Next]
  • A single book might consist of a few dozen “resources”

  • A set of books might consist of a few hundred resources

  • The documentation for three products across 14 languages and six configurations might consist of many thousands of resources

Options

[Prev][Next]
  • Filesystem

  • Source code control system

  • Database

  • Content management system

Filesystem management

[Prev][Next]
  • Conceptually easy and familiar

  • Search, backup, etc. all work exactly like the other files on your system

  • Versioning, locking, conflict resolution all absent

SCCS management

[Prev][Next]
  • Familiar to programmers, source code control systems provide a layer of versioning, locking, and conflict resolution on top of the filesystem

  • Examples: Subversion is centralized; mercurial and git are decentralized.

  • Works mostly like the filesystem

Database management

[Prev][Next]
  • Databases provide a whole new range of capabilities: indexing, searching, etc.

  • Not generally like a filesystem, may require new practices

  • Traditional relational databases are not a good fit for XML. Just. Don't. Go. There.

  • XML and (some) NoSQL databases are a better fit.

  • MarkLogic, ahem, makes an excellent database for XML.

Content management system

[Prev][Next]
  • Usually built on top of a database

  • Provide yet more features for management and workflow

  • Often provide features for designing and implementing management workflows (for example, no document can be published until Q/A has signed off on it)

Workflow

[Prev][Next]

Options

[Prev][Next]
  • Part of your management system

    • For example, MarkLogic Document Library Services & Content Processing Framework

    • RSuite or another CMS

  • Make or Ant

  • XSLT or XQuery

  • XProc

Make

[Prev][Next]
  • Traditional unix, text-based tool

  • Filesystem based

  • Tracks dependencies and keeps things “up-to-date”

  • Drives command-line tools

all: publish.html

webtech.html: webtech.inc dbstyle.xsl \
              graphics/figure1.png
	$(XSLT) $< dbstyle.xsl $@

webtech.inc: webtech.xml
	$(XINCLUDE) < $@< > $@

ant

[Prev][Next]
  • Java and XML based tool

  • Filesystem based

  • Allows authors to build flow graphs

  • Drives Java or command-line tools; extensible in Java

<project name="example" default="pubdoc" basedir=".">
  <description>An example ant file</description>

  <property name="build.dir" value="output"/>

  <target name="init">
    <mkdir dir="${build.dir}"/>
  </target>

  <target name="pubdoc" depends="init,xinclude">
    <xslt in="webtech.inc" style="dbstyle.xsl"
          out="${build.dir}/webtech.html"/>
  </target>

  <target name="xinclude">
    <xslt in="webtech.xml" style="xinclude.xsl"
          out="webtech.inc"/>
  </target>
</project>

XSLT or XQuery

[Prev][Next]
  • XML technologies

  • Capable of nearly arbitrary transformation

  • Built into some databases and content management systems

XProc

[Prev][Next]
  • XML based, designed for XML processing

  • Allows authors to write simple, mostly declarative pipelines with a rich, and extensible, vocabulary of steps

<p:pipeline xmlns:p="http://www.w3.org/ns/xproc"
            version="1.0">

<p:xinclude/>

<p:xslt>
  <p:input port="stylesheet">
    <p:document href="dbstyle.xsl"/>
  </p:input>
</p:xslt>

</p:pipeline>

Technical hodge-podge

[Prev][Next]

URI and entity resolvers

[Prev][Next]
  • Entities and URIs are accessed via URIs

  • Proxies and resolvers can intercede

  • Most resolvers use XML Catalogs

<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"
         prefer="public">

  <system systemId
="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
          uri="/share/doctypes/xhtml1-strict.dtd"/>

  <system systemId
="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
          uri="/share/doctypes/xhtml1-transitional.dtd"/>

</catalog>

Mixing namespaces

[Prev][Next]
  • XML namespaces provide a global naming mechanism

  • This facilitates the mixing of different vocabularies:

    • MathML and SVG in DocBook

    • XInclude in TEI

    • Recipe markup in a purchase order

  • Generally speaking, this requires tools to understand the mixture

NVDL

[Prev][Next]
  • How do you validate documents that use multiple namespaces?

  • One approach is to include the mixtures in the schema: the DocBook 5.0 schema knows that MathML can occur in equations, for example

  • NVDL, Namespace-based Validation Dispatching Language is another approach

  • An NVDL document describes how to decompose a mixed document into individual documents that can be validated independently

NVDL Example

[Prev][Next]
<rules xmlns="http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0"
       startMode="docbook">

<mode name="docbook">
  <namespace ns="http://docbook.org/ns/docbook">
    <validate schema="rng/docbook.rng"
              useMode="attach"/>
    <validate schema="sch/docbook.sch"
              useMode="attach"/>
  </namespace>
</mode>

<mode name="attach">
  <anyNamespace>
    <attach/>
  </anyNamespace>
</mode>

</rules>

Q&A

[Prev][Next]

The floor is yours...