Managing XML Assets in a Developing Environment

Norman Walsh

Sun Microsystems, Inc.

Contents

About the speaker

  • Norm is an XML Standards Architect at Sun Microsystems, Inc.

  • Has more than a decade of experience with SGML and XML publishing systems.

  • Elected member of the W3C Technical Architecture Group; also chair of the XML Processing Model Working Group, co-chair of the XML Core Working Group, and member of, and editor for, the XSL Working Group at the W3C.

  • Chair of the DocBook Technical Committee at OASIS. Also editor for the Entity Resolver Technical Committee and a member of the RELAX NG Technical Committee.

  • Specification lead for JSR 206, the Java API for XML Processing in the Java Community Process. Occasional

  • Original developer and project lead for the DocBook DSSSL and DocBook XSL Stylesheet projects. Creator and contributor to numerous open-source projects.

Goals

  • Understanding the information that you have

  • Adding structure to your information

    • Off the shelf or roll your own?

    • What kind of markup and how much?

  • Planning for the future

    • Business needs evolve

    • Change happens

Credit where credit is due

Developing SGML DTDs Terry Allen, Jon Bosak, Paul Grosso, Eve Maler, Murray Maloney


How do we interact with documents?

  • Creation and modification

  • Storage and archiving

  • Use

Creating and modifying documents

  • Authoring

  • Editing

  • Validation

  • Review

  • Conversion

  • Transformation

Storing and archiving documents

  • Classification

  • Assembly

  • Reuse

  • Exchange

Using documents

  • Printing

    • Navigational structures

    • Indexes

    • Tables of Contents

  • Reading online

    • Navigational structures

    • Searching

  • Extraction

  • Analysis

What is markup for?

  • Enforce requirements about the structure and meaning of information assets

  • Improve management of whole and partial documents

  • Support applications that format, index, and otherwise process documents

  • Provide metadata about the content of documents

Kinds of markup

  • Procedural

    • Linear flow with embedded formatting commands

    • troff, RTF, WordStar “dot commands”; TeX tends in this direction, though LaTeX demonstrates that you can do declarative markup with TeX too

    • Office documents without using “styles”

  • Declarative markup

    • Hierarchical structure with semantic identifiers

    • SGML, XML

    • Office documents with rigorous use of carefully designed styles

Contextual markup

  • Explicit rules about what goes where

  • Structural integrity

  • Searching

  • Cross-referencing

Inventing markup

  • It's as much art as it is science

  • It's a collaboration between domain experts, technologists, managers, and other groups that have responsibility for information assets

  • It's very dependent on the nature of the information involved and the ways that it is now (and may in the future) be used

  • Sometimes it makes sense to roll your own

  • Sometimes it makes sense to adopt a standard

Roll your own?

  • Fits your needs explicitly

    • At least, to the extent that your analysis and design identified and codified those needs

  • It's hard work

  • Remember the whole toolchain

Use a standard?

  • Provides the benefit of an existing community of users

  • Has significant tool advantages

  • Possibly not quite the right markup

  • When does one-size-fits-all fit you?

Example: A fine margarita

1½ oz   Tequila
½ oz   Paula's Texas Orange
1 oz   Lime juice

Rub the rim of a margarita glass with the rind of a lime and dip rim in salt. Shake ingredients with ice and strain into glass.

Margarita recipe (Recipe Markup)

<beverage virgin="false">
  <name>Margarita</name>
  <source>Paula Angerstein</source>
  <ingredientList>
    <ingredient>
      <quantity units="oz">1.5</quantity>
      <name>Tequila</name>
    </ingredient>
    <ingredient>
      <quantity units="oz">0.5</quantity>
      <name>Paula’s Texas Orange</name>
    </ingredient>
    <ingredient>
      <quantity units="oz">1</quantity>
      <name>Lime Juice</name>
    </ingredient>
  </ingredientList>
  <preparation>
    <p>Rub the rim of a margarita glass with the rind of a lime and
dip rim in salt. Shake ingredients with ice and strain into
glass.</p>
  </preparation>
</beverage>

Margarita recipe (HTML Markup)

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Margarita</title>
<meta name="source" content="Paula Angerstein"/>
<meta name="virgin" content="false"/>
</head>
<body>
<h1>Margarita</h1>
<dl>
<dt><span class="q"></span> <span class="u">oz</span> Tequila</dt>
<dt><span class="q">½</span> <span class="u">oz</span> Paula's Texas Orange</dt>
<dt><span class="q">1</span> <span class="u">oz</span> Lime juice</dt>
</dl>
<p>Rub the rim of a margarita glass with the rind
of a lime and dip rim in salt. Shake ingredients
with ice and strain into glass.</p>
</body>
</html>

Embrace and extend

  • Select a standard that looks close and customize it

  • Some interchange and tool advantages

  • Markup that more closely fits your needs

  • Customization can be hard too

Margarita recipe (Extended DocBook Markup)

<article xmlns="http://docbook.org/ns/docbook" version="5.0-extension recipes">
<info>
<title>Margarita</title>
<releaseinfo role="virgin">false</releaseinfo>
<author><personname>Paula Angerstein</personname>
</author>
</info>
<ingredientlist>
<ingredient units="oz" quantity="1.5">Tequila</ingredient>
<ingredient units="oz" quantity="0.5">Paula's Texas Orange</ingredient>
<ingredient units="oz" quantity="1">Lime juice</ingredient>
</ingredientlist>
<section xml:id="preparation">
<title>Preparation</title>
<para>Rub the rim of a margarita glass with the rind of a lime and dip
rim in salt. Shake ingredients with ice and strain into glass.</para>
</section>
</article>

How much markup?

  • Costs/benefits

    • More markup = more cost

    • More markup = more benefit?

  • Markup results

    • Insufficient markup

    • Too much markup

    • Incorrect markup

  • Remember your authors

CRLF (Insufficient markup)

<glossentry>crlf: /ker´l@f/, /kru´l@f/, /C·R·L·F/, n.

    (often capitalized as ‘CRLF’) A carriage return (CR, ASCII
0001101) followed by a line feed (LF, ASCII 0001010). More loosely,
whatever it takes to get you from the end of one line of text to the
beginning of the next line. See newline. Under Unix influence this
usage has become less common (Unix uses a bare line feed as its
‘CRLF’).</glossentry>

From The Jargon File, version 4.4.7 by Eric S. Raymond

CRLF (Reasonable markup)

<glossentry xml:id="crlf">
<glossterm>crlf</glossterm>
<!-- pronunciation? -->
<glossdef>
<para>(often capitalized as ‘CRLF’) A carriage return (CR,
<acronym>ASCII</acronym> 0001101) followed by a line feed (LF,
<acronym>ASCII</acronym> 0001010). More loosely, whatever it takes to
get you from the end of one line of text to the beginning of the next
line. <xref linkend="newline"/>. Under <productname>Unix</productname>
influence this usage has become less common
(<productname>Unix</productname> uses a bare line feed as its
‘CRLF’).</para>
</glossdef>
</glossentry>

CRLF (Too much markup?)

<glossentry xml:id="crlf">
<glossterm>crlf</glossterm>
<glossdef>
<para><parenthetical-remark>(often capitalized as
‘<acronym>CRLF</acronym>’)</parenthetical-remark> A <charname>carriage
return</charname> <parenthetical-remark>(<acronym>CR</acronym>,
<acronym>ASCII</acronym>
<binary>0001101</binary>)</parenthetical-remark> followed by a
<charname>line feed</charname>
<parenthetical-remark>(<acronym>LF</acronym>, <acronym>ASCII</acronym>
0001010)</parenthetical-remark>. More loosely, whatever it takes to
get you from the end of one line of text to the beginning of the next
line. <xref linkend="newline"/>. Under <productname>Unix</productname>
influence this usage has become less common
<parenthetical-remark>(<productname>Unix</productname> uses a bare
<charname>line feed</charname> as its
‘<acronym>CRLF</acronym>’).</para>
</glossdef>
</glossentry>

CRLF (Wrong markup)

<glossentry xml:id="crlf">
<glossterm>crlf</glossterm>
<glossdef>
<para>(often capitalized as ‘CRLF’) A carriage return (CR,
<acronym>ASCII</acronym> 0001101) followed by a line feed (LF,
<abbrev>ASCII</abbrev> <acronym>0001010</acronym>). More loosely,
whatever it takes to get you from the end of one line of text to the
beginning of the next line. <link linkend="foo">newline</link>. Under
<trademark>Unix</trademark> influence this usage has become less
common (<trademark>Unix</trademark>> uses a bare
<foreignphrase>line feed</foreignphrase> as its
‘CRLF’).</para>
</glossdef>
</glossentry>

Thinking about markup

  • How big is the content your identifying?

    • Word sized?

    • Phrase sized?

    • Paragraph sized?

    • Container sized?

  • Enforce useful structural rules

Tree diagrams

  • A visual model for XML structures

Recipe tree diagram

Defining your markup

  • Different validation technologies impose different design constraints

  • Common constraint languages

    • DTD

    • W3C XML Schema

    • RELAX NG Grammar

    • Schematron

DTD

  • Widely understood (at least by us old timers)

  • Supported by every validating XML parser (by definition)

  • Support entities

  • No ambiguity allowed*

  • No co-constraints*

  • Not in XML document syntax

  • Almost no data types

* We'll come back to ambiguity and co-constraints in a moment.

W3C XML Schema

  • Supports typed object graphs

  • Supports scoped identity constraints

  • No ambiguity allowed

  • No co-constraints

  • Widely considered hard to understand

RELAX NG Grammar

  • Easy to customize

  • Allows ambiguity

  • Supports co-constraints

  • Supports simple data types

  • Not (yet) as widely supported in tools

Schematron

  • Allows very sophisticated validation with co-constraints, etc.

  • Not grammar based

  • Ideal in combination with one of the other languages

Ambiguity

If you know where you are in a grammar, can you tell what must come next?

Consider the case where you want to allow an optional documentation element to come either before or after an optional product element. One compact notation for this content model would be: documentation*, product?, documentation*

  • This is ambiguous

  • If you see a documentation element, you can't tell if a product element will be next or not.

Ambiguity (continued)

  • Sometimes it's possible to restate ambiguous content models unambiguously: (a,b)|(a,c) is the same as a,(b|c).

  • Sometimes doing so really complicates the content model

  • Sometimes it can't be done without relaxing or otherwise changing the constraints

Co-constraints

  • This attribute or this content model.

  • If this attribute, then also this attribute.

  • If this attribute value, then this content model.

  • If this attribute value, then also this attribute.

Define your scope

  • Why are you doing this?

  • What are your goals?

  • What documents are you interested in?

Why are you investing in markup?

  • Better validation?

  • Greater productivity?

  • Multiple presentation formats?

  • Improved searching?

  • Personalization?

What documents are part of your project?

  • Technical documentation

  • Reference materials

  • Correspondence

  • Purchase orders/business documents

Design background

  • Identify potential needs; define them thoroughly

  • Classify them into categories

  • Validate your needs against similar data

    • Don't worry about the angle brackets now

Document analysis

  • Identify the basic structures you need to encode

    • Legacy documents

      • Books/articles/whitepapers

      • Text/tables/lists/graphics/video/equations/etc.

      • Multiple languages

      • Character sets

  • Classify the structures into logical groups

  • Validate the analysis

Recognizing what to model

  • Structural components

  • Content components

  • Presentational components

  • Metadata components

Structural components

  • Books

  • Chapters

  • Tables/figures/examples

  • Lists/list items

  • Paragraphs

Content components

  • Part numbers, measurements

  • Quantities, prices

  • Postal addresses, phone numbers

  • Commands and functions

  • Descriptions

Presentational components

  • Special formatting (emphasis or verbatim)

  • Required or forbidden line/paragraph/page breaks

  • Indented regions

  • Boxes, borders, and shading

Metadata components

  • Necessary metadata

  • Cross references and other links

  • Co-occurrence constraints

Maximize value

  • Use semantic indicators (discard formatting)

  • Avoid duplication of data (headers/footers, ToCs, release dates)

  • Identify content already maintained somewhere else

  • Identify labeled containers

  • Look for wrappers

  • Look for block vs. inline containers

Schema design

  • Keep processing expectations in mind

  • Select the structures that the schema should address

  • Build the models

  • Populate the locations where authors have choices (what's allowed where?)

    • Word sized

    • Phrase sized

    • Paragraph sized

    • Container sized

Schema design (continued)

  • Be generous, don't exclude similar elements without good cause

  • Make connections within the model

    • Document assembly instructions

    • Implicit cross references

    • Explicit cross references

  • Make connections to the outside world

  • Validate the model

Top-down or bottom-up?

  • Overall document hierarchy

  • Mid-level elements

  • Low-level elements

(Probably a little of both.)

Validation

  • Balance costs and benefits

  • Be realistic

  • Enforce restrictions

    • Don't encourage “tag abuse” (we'll come back to that)

  • Understand the benefits

    • Make sure your authors do too

Avoid tag abuse

One of the hardest aspects of markup design is choosing markup that's “rich enough” without encouraging authors to resort to tag abuse:

  • Choosing markup for its formatting effect

  • Inconsistent markup because only formatting is considered important

  • Using the wrong markup (paragraph that begins "Note:" instead of note)

  • Using elements without really understanding what they mean (copyright vs. trademark)

Complete the implementation

  • Write your schema (harness the power of XML elements, attributes, etc.)

    • Angle brackets at last!

  • Consider using a set of related schemas

  • Modularity

  • Extensibility

  • Planning for change

XML markup

  • Elements

  • Attributes

  • Processing instructions

  • Comments

  • Text

  • Identifiers

Elements

  • Element content

  • Simple content (maybe typed values, depending on schema language)

  • Mixed content

Attributes

  • Delimited list of values

  • Simple types (selection depends on schema language)

  • Common attributes

    • xml:id

    • xml:base

Processing instructions

  • Allowed anywhere

  • Not checked by grammar-based validation

  • Very flat internal structure:

    <?pitarget some content goes here?>
    
  • Often presented like pseudo-elements:

    <?pitarget this="that" that="the other"?>
    

    But that's just a convention; there's usually no validation.

  • Use sparingly

Comments

  • Usually ignored

  • Often used to “comment out” blocks of text

    • They don't nest!

  • Avoid structured comments

    • Use processing instructions instead

Identifiers

  • Globally unique

  • Locally unique

Globally unique identifiers

What does global mean?

  • In markup, it often means “document wide”. (But consider your assembly technologies.)

  • XML provides an attribute type, “ID”, for this purpose. W3C XML Schema and RELAX NG use that type too.

  • The XML attribute type “IDREF” is for pointing to things with IDs.

  • Spell the name of your ID attributes this way: “xml:id”.

  • It's easiest to point to things that have IDs. Think carefully about how you want to manage them.

  • The easy answer: allow them optionally everywhere.

Locally unique identifiers

  • Unique within a particular context: for example, all recipe titles must be unique within a single cookbook

  • Some schema languages support this better than others

Using multiple schemas

  • Interchange

  • Reference

  • Authoring

  • Conversion

  • Presentation

Designing for modularity

  • Entities

  • XInclude

  • Macro processing

Entities

  • Can be used for both simple content and structural content

  • External parsed entities can have multiple root elements (but that makes them hard to validate independently)

  • They're supported by all XML parsers

  • ID/IDREF checking works as expected

  • They're an artifact of DTDs and use DTD syntax:

    <!DOCTYPE book [
    <!ENTITY chap01 SYSTEM "chapter-01.xml">
    ]>
    <book xmlns="...">
      ...
      &chap01;
      ...
    </book>
    

XInclude

  • Transcludes one document into another

  • Transcluded document must be a well-formed document

  • No ID/IDREF checking across document boundaries

  • Requires namespace support; is widely supported.

“Macro Processing”

  • You can “roll your own” here, too

  • Tools like XSLT, especially XSLT 2.0, provide all the power necessary to define your own macro processing.

  • Such solutions are usually much like XInclude: they require well-formed documents for transclusion and provide no automatic ID/IDREF checking across document boundaries.

  • (Use a standard vs. roll your own again)

Things to watch for

  • Suspiciously similar elements (div1, div2, div3 vs divtitle1, divtitle2, divtitle3)

  • Schizophrenic elements (a single list element that can contain either items or messages)

  • Recursive elements

  • Elements that reinvent other markup (entities, XInclude)

  • Baroque content models

Things to watch for (continued)

  • Ambiguous content models

  • Elements that can mistakenly be empty

  • Limited occurrences

  • Problematic mixed content

Modeling considerations

  • One element in different contexts

  • One element with attribute values

  • Containers versus flat structures

  • Documents as data

  • Generated text

  • Reuse (reader context)

Document conversion

  • How many legacy formats?

    • Legacy paper?

  • Consistency of original information

  • Automated conversion or rekeying?

  • Best case scenario

  • Worst case scenario

Best case scenario

  • The legacy documents have a lot of implicit or explicit structure

  • The legacy format exposes those structures

  • Those structures are used with complete regularity

In this case, programmatic conversion will be very valuable.

Worst case scenario

  • The legacy documents have no useful implicit or explicit structure

  • The explicit structures that they do have (e.g., “styles”) are used in wildly inconsistent ways

  • There's no way to extract any useful structure from the legacy format

In this case, programmatic conversion won't help much.

Delivery

  • Output to paper

  • Output to the web

  • Specialty requirements

    • Government/industry standards

So you still want to design your own markup language?

  • No, you don't, really.

  • Atom, DocBook, HTML, NLM, TEI, etc.

    • Subsets

    • Supersets

  • Well, if you must...

How to fail

  • Skip the markup design phase

  • Don't consider the markup design as important as the other tools

  • Don't plan for the future

  • Develop it in isolation from the rest of the production system

  • Don't understand your goals

  • Don't keep good records

How to succeed

  • Accept change

  • Accept that no model is perfect

  • Analyze first, model later

    • No angle brackets until the analysis is done

    • Everyone has to understand the analysis and design

How to succeed (continued)

  • Record everything

    • Write down the rationale for all decisions

How to succeed (continued)

  • Choose names carefully

  • Be systematic

  • Set limits (avoid too much markup)

  • Establish useful markup (vs. absolutely correct markup)

Example: Absolutely correct markup

A DocBook msgset describes a set of possibly related messages.

<msgset>
  <title>Some messages</title>
  <msgentry>
    <msg>
      <msgmain>
        <msgtext>Some message</msgtext>
      </msgmain>
      <msgrel>
        <msgtext>Some related message</msgtext>
      </msgrel>
      <msgsub>
        <msgtext>Some sub-message</msgtext>
      </msgsub>
    </msg>
    <msginfo>
      <msgaud>Network administrators</msgaud>
      <msglevel>Error</msglevel>
      <msgorig>NIC hardware</msgorig>
    </msginfo>
    <msgexplan>
      <para>…</para>
    </msgexplan>
  </msgentry>
  <!-- ... -->
</msgset>

Example: Useful markup

In practice, that much detail is overwhelming to authors.

<msgset>
  <title>Some messages</title>
  <simplemsgentry msgaud="netadmin" msglevel="error"
                  msgorig="NIChardware">
    <msgtext>Some message</msgtext>
    <msgexplan>
      <para>…</para>
    </msgexplan>
  </simplemsgentry>
  <!-- ... -->
</msgset>

How to succeed (continued)

  • Record everything

    • Write down the rationale for all decisions

How to succeed (continued)

  • Abstract away from presentation

  • Go beyond the legacy

    • What else would be useful in the future?

    • Leverage domain experts as much as possible

  • Iterate over the design

Be Adaptable

  • Define a reporting procedure

  • Encourage feedback

  • Define a stable and sane update policy

  • Take maintenance seriously

  • Be responsive

Q&A