XProc: An XML Pipeline Language

XProc Development
Working Group Goals
What's New?
Use Cases for XProc
Common features
Hasn't this been done already?
Design Goals

W3C XML Processing Model Working Group started in late 2005.
Many familiar names: Erik Bruchez, Andrew Fang, Paul Grosso, Rui Lopes, Murray Maloney, Alex Milowski, Michael Sperberg-McQueen, Jeni Tennison, Henry Thompson, Richard Tobin, Alessandro Vernet, Norman Walsh (Chair), Mohamed Zergaoui

According to its charter, the goals of the XML Processing Model Working Group are to develop two Recommendation Track documents:

An XML Processing Language (XProc)
An XML Processing Model which defines or describes the default processing for an XML document.

I thought we'd finish before our charter expired on 31 Oct 2007 :-)
But we didn't :-(
But we did get to Last Call :-)
But it didn't stick :-(
Latest draft published on 29 November 2007 :-)
I think we'll finish this winter/spring :-)

A defaulting story for syntactic simplicity
A revised mechanism for dealing with parameters
A mechanism for dealing with complex namespace issues
Support for XPath 1.0 and XPath 2.0
A revised approach to XSLT
A few new steps

From XML Processing Model Requirements and Use Cases:

Apply a Sequence of Operations
XInclude Processing
Parse/Validate/Transform
Document Aggregation
Single-file Command-line Document Processing
Multiple-file Command-line Document Generation
Extracting MathML
Style an XML Document in a Browser
Run a Custom Program
XInclude and Sign
…

Start with a document or documents
Apply one or more processes, perhaps conditionally, perhaps iteratively
Catch and recover from errors, if they occur
Produce a document or documents

Well, yes: Apache Ant, Cocoon Sitemaps, GNU JAXP Library: Package gnu.xml.pipeline, Jelly : Executable XML, MT Pipeline Overview, NetKernel - Service Oriented MicroKernel and XML Application Server, Oracle XML Developer's Kit Home, Re-Interpreting the XML Pipeline Note: Adding Streaming and On-Demand Invocation, Schemachine (a pipelined Xml validation framework), ServingXML, smallx: Project Home Page, Strawman: bringing the framework within the schemas, SXPipe: Simple XML Pipelines, Xerces Native Interface, XML-ECHO, XML Pipeline Definition Language Version 1.0, XML Pipeline Language (XPL) Version 1.0 (Draft), XPipe

Standardization, not design by committee
Able to support a wide variety of steps
Prepared quickly
Few optional features
Relatively declarative
Amenable to streaming
“The simplest thing that will get the job done.”

Pipeline Concepts
Atomic Pipeline Steps
The Pipeline Analogy
What flows through pipes?
Steps can be grouped into pipelines
Compound Pipeline Steps
Pipelines are Steps

Pipelines are composed of steps; steps perform specific processes
Steps are connected together so the output of one step can be consumed by another
Steps may have options and parameters
XPath expressions are used to compute option and parameter values, identify documents or portions of documents to process, and to select what steps are performed.

Most steps are atomic, black boxes that perform a task:

Document¹ → XInclude → Document²
Load → Document
Document¹, Stylesheet → XSLT 1.0 → Document²
Documents^i…k, Stylesheet → XSLT 2.0 → Documents^m…n
Document → Render-to-PDF
Document¹, Document² → Compare → Document³

XML¹ → XInclude → XML², XSL → XSLT → XML³

If you think of the steps as physical boxes and the connections between them as actual pipes, you can imagine water flowing through the boxes:

The “water” of our pipelines are XML documents.
In the specific case of XProc, we mean documents conceptually:
- Not SAX events or StAX streams
- Not DOM elements or XOM nodes
- Not XDMs or PSVIs
(or rather, any of those.)

Consider the XInclude+XSLT steps from the earlier slide:

→ XInclude → XML², XSL → XSLT →

You can construct a pipeline that performs these steps:

→ XInclude → XML², XSL → XSLT →

Some steps contain other steps, the task they perform is at least partly determined by the steps they contain.
These steps provide the basic control structures of XProc:
- Grouping
- Conditional evaluation
- Exception handling
- Iteration
- Selective processing
- Pipelines

Pipelines can be called as atomic steps:

Document¹ → XInclude+XSLT → Document²

A Little Terminology
A Little Syntax
Declaration for p:xinclude
A two step pipeline fragment
Some observations
A two step pipeline fragment, simplified
Anatomy of an atomic step
Anatomy of an atomic step (2)
Anatomy of an atomic step (3)
Anatomy of an atomic step (4)
Inputs
Anatomy of an atomic step (5)
Options
Anatomy of an atomic step (6)
Parameters
Compound steps
Anatomy of a compound step
Anatomy of a compound step (2)
Outputs
Compound step, simplified
Pipelines
A sample pipeline
Language Constructs
Conditional processing
p:choose
What version of XPath is that, anyway?
Iteration
p:for-each
Selective processing
p:viewport
Exception handling
p:try
Building libraries
p:pipeline-library
Standard step library
What version of XSLT is that, anyway?
Namespaces, why did it have to be namespaces?
Namespace pain
Namespace pain (2)
Namespace pain (3)
Namespace “relief”
Implementations
Q&A

Steps have a type and a name (for example, an “XSLT” step named “db2html”)
Steps have named ports
The names of the ports on any given step are a fixed part of its signature (the XSLT step has two input ports, “source” and “stylesheet” and two output ports “result” and “secondary”)
Inputs and outputs bind document streams to ports (the “stylesheet” input port of the “db2html” step is bound to a particular document).

In a pipeline document, the step element identifies the type of the step. The name attribute provides its name (<p:xslt name="db2html">)
Inputs and output bindings are associated with a particular port with the p:input and p:output elements
- The p:pipe element provides a binding to another step
- The p:document element provides a binding to a particular URI
The p:declare-step element defines the signature for an atomic step.

  1 <p:declare-step type="p:xinclude">
  2    <p:input port="source"/>
       <p:output port="result"/>
  4    <p:option name="fixup-xml-base" value="false"/>
       <p:option name="fixup-xml-lang" value="false"/>
  6 </p:declare-step>

The XProc specification includes the declarations for all the standard components. Implementors can provide additional steps and may provide facilities that allow users to write their own.

  1 <p:xinclude name="expand">
  2   <p:input port="source">…</p:input>
    </p:xinclude>
  4 
    <p:xslt name="db2html">
  6   <p:input port="source">
        <p:pipe step="expand" port="result"/>
  8   </p:input>
      <p:input port="stylesheet">
 10     <p:document href="docbook.xsl"/>
      </p:input>
 12 </p:xslt>

Many pipelines are linear or mostly linear.
Many steps have a pretty obvious “primary” input and “primary” output
Taken together, these to observations allow us to introduce a simple syntactic shortcut: in two adjacent steps, in the absence of an explicit binding, the primary output of the first step is automatically connected to any unbound input(s) of the second.

(In fact, the specification goes a little further, providing the concept of a default readable port.)

  1 <p:xinclude name="expand"/>
  2 
    <p:xslt name="db2html">
  4   <p:input port="stylesheet">
        <p:document href="docbook.xsl"/>
  6   </p:input>
    </p:xslt>

  1 <p:xslt name="db2html" version="2.0">
  2   <p:input port="source">
        <p:pipe step="expand" port="result"/>
  4   </p:input>
      <p:input port="stylesheet">
  6     <p:document href="docbook.xsl"/>
      </p:input>
  8   <p:option name="initial-mode" select="$imode"/>
      <p:parameter name="home"
 10                value="http://example.com/"/>
    </p:xslt>

  1 <p:xslt name="db2html" version="2.0">
  2   <p:input port="source">
        <p:pipe step="expand" port="result"/>
  4   </p:input>
      <p:input port="stylesheet">
  6     <p:document href="docbook.xsl"/>
      </p:input>
  8   <p:option name="initial-mode" select="$imode"/>
      <p:parameter name="home"
 10                value="http://example.com/"/>
    </p:xslt>

  1 <p:xslt name="db2html" version="2.0">
  2   <p:input port="source">
        <p:pipe step="expand" port="result"/>
  4   </p:input>
      <p:input port="stylesheet">
  6     <p:document href="docbook.xsl"/>
      </p:input>
  8   <p:option name="initial-mode" select="$imode"/>
      <p:parameter name="home"
 10                value="http://example.com/"/>
    </p:xslt>

  1 <p:xslt name="db2html" version="2.0">
  2   <p:input port="source">
        <p:pipe step="expand" port="result"/>
  4   </p:input>
      <p:input port="stylesheet">
  6     <p:document href="docbook.xsl"/>
      </p:input>
  8   <p:option name="initial-mode" select="$imode"/>
      <p:parameter name="home"
 10                value="http://example.com/"/>
    </p:xslt>

A p:input identifies input to a port; its subelements identify a document or sequence of documents:

<p:document href="uri"/> reads input from a URI.
<p:inline>...</p:inline> provides the input as literal content in the pipeline document.
<p:pipe step="stepName" port="portName"/> reads from a readable port on some other step.
<p:empty/> is an empty sequence of documents.
Unbound inputs automatically bind to the default readable port.

  1 <p:xslt name="db2html" version="2.0">
  2   <p:input port="source">
        <p:pipe step="expand" port="result"/>
  4   </p:input>
      <p:input port="stylesheet">
  6     <p:document href="docbook.xsl"/>
      </p:input>
  8   <p:option name="initial-mode" select="$imode"/>
      <p:parameter name="home"
 10                value="http://example.com/"/>
    </p:xslt>

Semantically, the version attribute is indistinguishable from:

  <p:option name="version" value="2.0"/>

Steps only accept the options that they declare.
Steps can compute option values using XPath.
Options are strings.

  1 <p:xslt name="db2html" version="2.0">
  2   <p:input port="source">
        <p:pipe step="expand" port="result"/>
  4   </p:input>
      <p:input port="stylesheet">
  6     <p:document href="docbook.xsl"/>
      </p:input>
  8   <p:option name="initial-mode" select="$imode"/>
      <p:parameter name="home"
 10                value="http://example.com/"/>
    </p:xslt>

Steps only accept parameters if they are declare a parameter input port.
Any number of parameters can be passed to steps.
Parameter names are not known in advance.
Pipelines can manipulate sets of parameters for different steps.
There's a special defaulting rule for the common case that parameters passed to the p:pipeline should automatically be passed to the steps that the pipeline contains.

Compound steps contain other steps (subpipelines)
Compound steps don't have separate declarations; the number of inputs, outputs, options, and parameters that they accept can vary on each instance.
There's no mechanism in XProc V1.0 for user-defined compound steps.

  1 <p:group name="xincxform">
  2   <p:output port="result">
        <p:pipe step="db2html" port="result"/>
  4   </p:output>
    
  6   <p:xinclude/>
      <p:xslt name="db2html">
  8     <p:input port="stylesheet">
          <p:document href="docbook.xsl"/>
 10     </p:input>
      </p:xslt>
 12 </p:group>

  1 <p:group name="xincxform">
  2   <p:output port="result">
        <p:pipe step="db2html" port="result"/>
  4   </p:output>
    
  6   <p:xinclude/>
      <p:xslt name="db2html">
  8     <p:input port="stylesheet">
          <p:document href="docbook.xsl"/>
 10     </p:input>
      </p:xslt>
 12 </p:group>

A p:output binds the output port of a compound step only. Outputs of atomic steps are never bound.

<p:document href="uri"/>, <p:inline>...</p:inline>, and <p:empty/> provide the specified content as the output on the port.
<p:pipe step="stepName" port="portName"/> provides the specified output as output on the port. It must be bound to a step in the subpipeline.
Unbound outputs automatically bind to the primary output port of the last step (in document order) in the contained pipeline.

  1 <p:group>
  2   <p:output port="result"/>
    
  4   <p:xinclude/>
    
  6   <p:xslt>
        <p:input port="stylesheet">
  8       <p:document href="docbook.xsl"/>
        </p:input>
 10   </p:xslt>
    </p:group>

A p:pipeline is a compound step. It has several special properties:

It's the thing that the pipeline processor actually runs. Processing always begins with a pipeline.
A pipeline can import other pipelines or pipeline libraries.
A pipeline can call an imported pipeline (or itself) as an atomic step.

  1 <p:pipeline name="main">
  2   <p:input port="document" primary="true"/>
      <p:input port="stylesheet"/>
  4   <p:output port="result"/>
    
  6   <p:xinclude/>
    
  8   <p:xslt version="2.0">
        <p:input port="stylesheet">
 10       <p:pipe step="main" port="stylesheet"/>
        </p:input>
 12   </p:xslt>
    </p:pipeline>

Conditional processing: p:choose
Iteration: p:for-each
Selective processing: p:viewport
Exception handling: p:try/p:catch
Building libraries: p:pipeline-library

Choose one of a set of subpipelines based on runtime evaluation of an XPath expression
The XPath context can be any document, even documents generated by preceding steps.
Constraint: all of the subpipelines must have the same number of inputs and outputs, with the same names. This makes the actual subpipeline selected at runtime irrelevant for static analysis.

  1 <p:choose>
  2   <p:when test="/root[@version=2]">
        <p:output port="result"/>
  4     ...
      </p:when>
  6   <p:when test="/root">
        <p:output port="result"/>
  8     ...
      </p:when>
 10   <p:otherwise>
        <p:output port="result"/>
 12     ...
      </p:otherwise>
 14 </p:choose>

(Or, the reason our last call didn't stick...)

Pipeline authors can specify either XPath 1.0 or XPath 2.0
Implementors can implement either XPath 1.0 or XPath 2.0
If XPath 1.0 is requested and the processor uses XPath 2.0, it must use XPath 1.0 compatibility mode
If XPath 2.0 is requested and the processor uses XPath 1.0, it must only evaluate expressions that it knows would give the same result.

Apply the same subpipeline to a sequence of documents
A sequence can be constructed in several ways:
- As the result of a previous step
- By selecting nodes from a document or set of documents
- Literally in the p:input binding
XProc doesn't support counted iteration (do this three times) or iteration to a fixed point (do this until some condition is true)

  1 <p:for-each>
  2   <p:iteration-source select="//chapter"/>
      <p:output port="result"/>
  4 
      <p:xslt>
  6     <p:input port="stylesheet">
          <p:document href="docbook.xsl"/>
  8     </p:input>
      </p:step>
 10 </p:for-each>

Apply a subpipeline to some subtree in a document
In other words, process just a “data island” in a document, without changing the surrounding context

  1 <p:viewport match="h:div[@class='chapter']">
  2   <p:output port="result"/>
    
  4   <p:insert position="first-child">
        <p:input port="insertion">
  6       <p:inline>
    	<hr xmlns="http://www.w3.org/1999/xhtml"/>
  8       </p:inline>
        </p:input>
 10   </p:insert>
    </p:viewport>

Try to run the specified subpipeline
If something goes wrong, catch the error and try the recovery subpipeline
All the output from the initial pipeline must be discarded
If something goes wrong in the catch, the try fails
Constraint: both the subpipelines must have the same number of inputs and outputs, with the same names. This makes the actual subpipeline that produces the output irrelevant for static analysis.

  1 <p:try name="tryit">
  2   <p:group>
        <p:output port="out"/>
  4     ...
      </p:group>
  6   <p:catch>
        <p:output port="out"/>
  8     ...
          <p:input ...>
 10         <p:pipe step="tryit" port="errors"/>
          </p:input>
 12     ...
      </p:catch>
 14 </p:try>

Useful pipelines can be stored in a library.
Libraries can be imported to provide that functionality in a new pipeline.

  1 <p:pipeline-library namespace="...">
  2 
      <p:import href="os-library.xml"/>
  4 
      <p:pipeline name="xinclude-and-style">...
  6 
      <p:pipeline name="xinclude-and-db2html">...
  8 
      <p:pipeline name="get-tide-information">...
 10 
    </p:pipeline-library>

30 required steps
- add-attribute, add-xml-base, compare, count, delete, directory-list, error, escape-markup, http-request, identity, insert, label-elements, load, make-absolute-uris, namespace-rename, pack, parameters, rename, replace, set-attributes, sink, split-sequence, store, unescape-markup, string-replace, unwrap, wrap, wrap-sequence, xinclude, xslt
10 optional steps
- exec, hash, uuid, validate-with-relax-ng, validate-with-schematron, validate-with-xml-schema, www-form-urldecode, www-form-urlencode, xquery, xsl-formatter

(Or, the other reason our last call didn't stick...)

We used to have two steps, a required XSLT 1.0 step and an optional XSLT 2.0 step
Now we have a single XSLT step with a version attribute
Authors can request the version they want, implementors can provide the version they want, with rules for what to do when they don't line up.
Because automic steps have a fixed signature, there are bits of the XSLT signature that don't make a lot of sense in the XSLT 1.0 case, but it's mostly ok.

QNames in XPath expressions are just strings.
The binding of prefixes to namespace names depends on context.
Options, containing XPath expressions that contain QNames, can be passed from one context to another.
Doctor, doctor! It hurts when I do that!

(Deep breath everyone, this is not pretty.)

Imagine that I have a configuration file that contains XPath expressions:

  1 <my:config xmlns:my="http://www.example.com/ns/my"
  2            xmlns:c="http://www.example.com/ns/contacts">
      <my:filter test="c:person/c:name='Jones'" />
  4 </my:config>

(Credit to Jeni Tennison for this example.)

Now imagine that I want to use elements from that context in an option to a step:

  1 <p:matching-documents>
  2   <p:option name="test"
        select="concat('/h:html/h:head/rdf:RDF[',
  4                    /my:config/my:filter/@test, ']')">
        <p:pipe step="top" source="config" />
  6   </p:option>
    </p:matching-documents>

That's equivalent to passing the following string as the test option:

"/h:html/h:head/rdf:RDF[c:person/c:name='Jones']"

But what if the namespace binding for “c:” isn't in this pipeline; or worse, what if it has a different binding? (Remember, the string came from a different context.)

And for the worst case of all, suppose this step is in a pipeline that you imported from a library and can't change?

We can fix this with the p:namespaces element. Here's that step again:

  1 <p:matching-documents>
  2   <p:option name="test"
        select="concat('/h:html/h:head/rdf:RDF[',
  4                    /my:config/my:filter/@test, ']')">
        <p:pipe step="top" source="config" />
  6     <p:namespaces element="/my:config/my:filter"/>
      </p:option>
  8 </p:matching-documents>

Now the bindings for all of the prefixes used in @test will be explicitly provided.

XProc implementations are starting to surface:

My own, XML Pipeline Processor, https://xproc.dev.java.net
yax, http://yax.sourceforge.net/
xprocxq, http://code.google.com/p/xprocxq/
XSP.NET, http://xsp-net.sourceforge.net/
Cocoatron, http://cocoatron.com/

Plus several others not yet publicly announced, to the best of my knowledge.

We're making progress
Second, and hopefully final, last call after the holidays
Recommendation this spring?
Your feedback is solicited!

XProc: An XML Pipeline Language

Norman Walsh

Sun Microsystems, Inc.

Contents

Background

XProc Development

Working Group Goals

XProc Status

What's New?

Use Cases for XProc

Common features

Hasn't this been done already?

Design Goals

Pipeline Concepts

Pipeline Concepts

Atomic Pipeline Steps

The Pipeline Analogy

What flows through pipes?

Steps can be grouped into pipelines

Compound Pipeline Steps

Pipelines are Steps

XProc Pipelines

A Little Terminology

A Little Syntax

Declaration for p:xinclude

A two step pipeline fragment

Some observations

A two step pipeline fragment, simplified

Anatomy of an atomic step

Anatomy of an atomic step (2)

Anatomy of an atomic step (3)

Anatomy of an atomic step (4)

Inputs

Anatomy of an atomic step (5)

Options

Anatomy of an atomic step (6)

Parameters

Compound steps

Anatomy of a compound step

Anatomy of a compound step (2)

Outputs

Compound step, simplified

Pipelines

A sample pipeline

Language Constructs

Conditional processing

p:choose

What version of XPath is that, anyway?

Iteration

p:for-each

Selective processing

p:viewport

Exception handling

p:try

Building libraries

p:pipeline-library

Standard step library

What version of XSLT is that, anyway?

Namespaces, why did it have to be namespaces?

Namespace pain

Namespace pain (2)

Namespace pain (3)

Namespace “relief”

Implementations

Q&A

`p:choose`

`p:for-each`

`p:viewport`

`p:try`

`p:pipeline-library`