XDM Text Serialization

NOTE: This page is currently a work in progress, it is not complete and is available currently for collaboration purposes only.

A proposal for text serialization of XDM for purposes of data interchange and interoperability.

Abstract

The XQuery 1.0 and XPath 2.0 Data Model (XDM) references a Serialization format for XDM. This format was intended to serialize either full XML Documents, or Parsed External Entities. If given arbitrary XDM, the result can not be reconstructed into the original XDM. In particular, for purposes of data interchange, the losses are severe; sequences are "normalized" into a single rooted XML tree, adjacent atomic values are concatenated and type information is discarded.

This proposal is to define a common format for serializing XDM data for purposes of data interchange and interoperability preserving more of the XDM information.

Intent

The intent of this specification is to provide a common format for interchange of data between XML tools which produce and consume XDM data. Examples of XML tools which produce or consume XDM data include XPath (2.0), XSLT , XQuery, but also include other tools such as XProc, xmlsh and countless custom written programs using the XDM data model.

In current implementations there is no standard model for XDM data either within the same environment and language, or across languages and environements. For example, suppose an XQuery operation produces a sequence and it is desired to provide that sequence as a parameter to XSLT transformation, there is no standardized way to exchange the data. In practice in order to accomplish this, either the same vendor tools must be used within the same language and process, or the results must be serialized in a proprietary format and reconstituted in the target using the same proprietary format. Even with the same vendors implementations interchange is not always easy due to differences in internal data formats , languages, or transferring data across process or machine boundaries.

The existing text serialization for XDM proposal XDM is inadequate for data interchange.

Sequences are transformed into a single rooted XML tree.
Adjacent atomic values are converted to text and concatenated.
Type information of parentless atomic values is discarded.
Parentless attributes are not serialized.
The distinction between element and document is lost.
The empty sequence () and the empty string "" are serialized identically.

These and other limitations make the existing XDM serialization proposal unsuitable for data interchange.

This proposal provides for a standard serialization format so that XDM data can be interchanged across tools, vendors, languages, environments and machines while maintaining most of the original XDM information.

Definitions

For the purposes of this document the following definitions apply

XDM XQuery 1.0 and XPath 2.0 Data Model
XDM Tool A program, module, function or API which can produce or consume XDM Data in some form.
XDM Consumer An XDM Tool which can consume XDM data (e.g as context item, parameters, external variables).
XDM Producer An XDM Tool which can produce XDM data (e.g as a query result, result document, return value)
Environment An instance of the runtime of a single language with native language types, typically a single process.
XDM data An instance of the XDM Data Model
XDM Stream An representation of a single instance of XDM data (a sequence zero or more items) as a sequence of characters (Unicode codepoints).

Goals

This proposal expresses multiple goals, not all of which may be possible to achieve. The use cases describe concrete examples of many of the goals, while this summary provides the intent.

A common text representation of XDM data preserving as much of the XDM model as reasonable including
- Individuality of sequence items.
- Types of atomic items.
- Parentless attributes.
- Support for all seven types of XDM nodes [document, element, attribute, text, namespace, processing instruction, and comment.]
A representation that can be easily implemented using existing vendors XML technology.

Some purposes for which this format could be used include

Exchange of XDM data between XDM Tools in different environments
Exchange of XDM data between XDM Tools from different vendors in the same environment
Exchange of XDM data between XDM Tools from the same vendor in the same environment where it is difficult to preserve the vendors native data structure
Exchange of XDM data between XDM Tools and tools which are not XDM capable, or with limited XDM capability.
Provide a human readable and editable format of XDM data
Output of XDM data from test cases using XDM Tools with the purposes of validation and compare
A format for use in XML Pipeline Processors so that steps can be implemented by different vendors or in different languages.
Incremental appending of XDM data to a file or stream without having to go back and rewrite or remove end markers.
Incremental reading of XDM data. Each item in the sequence should be readable or skipped without consuming the entire stream.

XDM Information Preserved and Lost

Preserving all of the information in the XDM is very difficult, and likely why a serialization model for XDM has not been specified. This proposal recognizes that not all XDM information is equally important. In the context of the Use Cases, and with the goals of reasonable implementation with existing vendor libraries this proposal aims at preserving some XDM information at the expense of others.

XDM Information preserved

The XDM Model defines values as a sequence of zero or more items. Each item is one of the following types

Atomic Type
Node type
- document, element,attribute, text, namespace, processing , instruction, comment.

Each type has a value. Atomic types have string values, and node types have XML values.

An XDM serialization format should preserve the following attributes

Sequences
- Sequences should not be normalized. Sequences should preserve the individuality, count and type of items. Adjacent atomic item should not be concatenated (normalized).
Atomic Types and values for parentless atomic types
- Atomic types are preserved with the expanded QName for the type and the string value
Nodes
- Nodes values are preserved for each 7 of the XDM Node types

Serialized XDM will retain information about the descendants of nodes in the sequence being serialized, but it will not retain information about their ancestors.

XDM Information NOT preserved

Ancester information. An XDM Node serialized will NOT maintain information about its ancestors. For example if a node $a is serliazed then $a/.. is NOT maintained.
Serialized XDM will not retain information about node identity: that is, the recipient of the serialized XDM will not be able to determine whether two serialized elements originated from the same node or merely from two nodes that were deep-equal to each other.

Schema and Type information. XDM Serialization does NOT transfer type defininitions. The consumer of the serialized XDM is assumed to have access to the same schema as the producer of the serialized XDM: that is, a QName identifying a type is assumed to have the same meaning to both the producer and consumer.
Type annotation is NOT preserved explicitly for child nodes , however by associating a schema with elements or documents type annotation can be reconstructed using a schema-aware parser.

Use Cases

Use cases are concrete examples that demonstrate the goals.

Use Case 1 Exchange of XDM data between XDM Tools in different environments
Use Case 2 Exchange of XDM data between XDM Tools from different vendors in the same environment
Use Case 3 Exchange of XDM data between XDM Tools from the same vendor in the same environment where it is difficult to preserve the vendors native data structure
Use Case 4 Exchange of XDM Data between XDM Tools and tools which are not XDM capible.
Use Case 5 Provide a human readable and editable format of XDM data
Use Case 6 Output of XDM data from test cases using XDM Tools with the purposes of validation and compare
Use Case 7 A format for use in XML Pipeline Processors so that steps can be implemented by different vendors or in different languages.
Use Case 8 Creating and reading expanding files, such as logging data.

Serialization Format

The serialization format is described in terms of a stream of unicode characters (not bytes). The conversion to and from bytes to characters is an Encoding property. In memory representations likely need no encoding. File formats should be stored in UTF-8 format with no leading Byte Order Marker (BOM).

Serialization Format Properties

The Goals and Use Cases provide rationale for which the desired properties of a serialization format can be derived.

An In-memory and on-"disk" (file) XDM Stream should be easily convertible. Use Case 1 Use Case 4
An XDM Stream must be able to represent all unicode codepoints expressed in XDM (and by inference XML).
An XDM stream containing one sequence of 1 item should be represented the same as an XDM stream containing 1 item.
A text or binary concatenation of two XDM streams should produce the same format as the equivalent XDM Stream composed of a concatenation of the 2 XDM sequences Use Case 8 .
Serialized XDM must be viewable in standard text file tools. Use Case 5
Serialized XDM must survive line ending character sequence changes which may be introduced when copying the file or saving the file from a text editor. Use Case 1 Use Case 5
An XDM Stream must be able to be incrementally generated without writing an end marker (indicating end of the entire stream), and read without having to wait for an EOF marker Use Case 8.
An XDM Stream should be able to be parsed such that each item can be efficiently recognized and extracted without having to parse the content of the item. Use Case 8.

Abstract Form

The Serialization Format Properties suggest a possible Abstract Form comprised of a sequence of zero 0r more of the following markup. (where "+" means one or more and "*" means zero or more).

START WHITESPACE+ KIND WHITESPACE+ ITEM WHITESPACE*

The empty sequence being represented by an empty stream.

START is the start marker character sequence
WHITESPACE is any whitespace character (space, tab , CR , NL)
KIND is the item kind as a enumeration.
ITEM is the serialized form of the item. Within ITEM (the serialized form of the XDM Item) any occurring of the START character sequence must be entity escaped.

START

The START marker is a character sequence which represents the beginning of a single XDM Item. This sequence is not allowed to occur anywhere else in the XDM stream. This allows an XDM Consumer to perform some kinds of processing of XDM streams without having to parse the ITEM values. For example, a large XDM stream can be split into multiple streams with only needing to recognize a single character sequence.

WHITESPACE

WHITESPACE is any whitespace character (space, tab , CR , NL). Normalization of CR/NL sequences to NL is not required outside of ITEM serialization (as is in XML serialization) because in the production syntax wherever WHITESPACE occurs it, multiple adjacent occurances have the same meaning so it is not necessary to distinguish CR/NL from NL.

KIND

KIND is a character sequence which identifies the 8 XDM item kinds, seven types of XDM nodes [document, element, attribute, text, namespace, processing instruction, and comment), and atomic types.

ITEM

ITEM is the serialized form of a single XDM item. (see Item Serialization)

Item Serialization

Each of the 8 XDM kinds (7 node types and atomic types) are serialized into character sequences as follows.

document node

Document nodes start with <?xml?> and serialized as specified in XML Output

element node

Element nodes are serialized as specified in XML Output

attribute node

Attribute nodes are serialized as

['{' namespace uri '}'] [ prefix ] ':' name WHITESPACE value

If there is no namespace uri then it is omitted

If there is no prefix then it is omitted

"value" is the value of the attribute, quoted as per the specifications of XML Serialization [TBD where is attribute serialzaiton defined ? ]

Example

{http://www.test.com}test:attr="value"

text node

Text nodes are serialized as quoted strings [TBD define quoting mechanism, is it the same as the attribute values ?]

Example:

"Some text in a text node"

namespace node

TBD: support for namespace nodes optional ? I cannot create or generte a namespace node in XQuery even though it is part of XDM.

processing instruction node

Processing instructions are serialized identically to the XML Serialization. [ TBD Reference to PI serialization spec]
Example

<?This is a processing instruction?>

comment nodes

Comment nodes are serialized identically to the XML Serialization for comments. [TBD Reference to comment serialization spec]

Example

atomic values

Atomic values are serialized in the following form

[ '{' type uri '}' ] prefix ':' typename WHITESPACE value

For XML Schema built-in types no "type uri" is required

For user defined types a 'type uri' is required

Prefix is required. The "xs" prefix is predeclared to be "http://www.w3.org/2001/XMLSchema"

"value" is the string representation of the atomic value as defined by http://www.w3.org/TR/xpath-functions/#casting as if casted to "xs:string". The resulting value is then quoted. [ TBD reference to quoting spec]

XML Home : XDMSerialize