Revision [120]

This is an old revision of XDMSerialize made by DavidLee on 2011-01-10 05:35:43.
 

XDM Text Serialization


A proposal for text serialization of XDM for purposes of data interchange and interoperability.

Abstract

The XQuery 1.0 and XPath 2.0 Data Model (XDM) references a Serialization format for XDM. This format is "lossy" and does not reconstruct into the original XDM. In particular, for purposes of data interchange, the losses are severe; sequences are "normalized" into a single rooted XML tree, adjacent atomic values are concatenated and type information is discarded.
This proposal is to define a standard format for serializing XDM data for purposes of data interchange and interoperability preserving more of the XDM information.

Intent

The intent of this specification is to provide a standard format for interchange of data between XML tools which produce and consume XDM data. Examples of XML tools which produce or consume XDM data include XPath (2.0), XSLT , XQuery, but also include other tools such as XProc, xmlsh and countless custom written programs using the XDM data model.

In current implementations there is no standard model for XDM data either within the same environment and language, or across languages and environements. For example, suppose an XQuery operation produces a sequence and it is desired to provide that sequence as a parameter to XSLT transformation, there is no standardized way to exchange the data. In practice in order to accomplish this, either the same vendor tools must be used within the same language and process, or the results must be serialized in a proprietary format and reconstituted in the target using the same proprietary format. Even with the same vendors implementations interchange is not always easy due to differences in internal data formats , languages, or transferring data across process or machine boundaries.

The existing text serialization for XDM proposal XDM is inadequate for data interchange.

These and other limitations make the existing XDM serialization proposal unsuitable for data interchange.
This proposal provides for a standard serialization format so that XDM data can be interchanged across tools, vendors, languages, environments and machines while maintaining most of the original XDM information.

Definitions

For the purposes of this document the following definitions apply

Goals


This proposal expresses multiple goals, not all of which may be possible to achieve. The use cases describe concrete examples of many of the goals, while this summary provides the intent.


Some purposes for which this standard could be used include

XDM Information Preserved and Lost


Preserving all of the information in the XDM is very difficult, and likely why a serialization model for XDM has not been specified. This proposal recognizes that not all XDM information is equally important. In the context of the Use Cases, and with the goals of reasonable implementation with existing vendor libraries this proposal aims at preserving some XDM information at the expense of others.

XDM Information preserved

The XDM Model defines values as a sequence of zero or more items. Each item is one of the following types

Each type has a value. Atomic types have string values, and node types have XML values.

An XDM serialization format should preserve the following attributes



XDM Information NOT preserved



Use Cases


Use cases are concrete examples that demonstrate the goals.


Serialization Format

The serialization format is described in terms of a stream of unicode characters (not bytes). The conversion to and from bytes to characters is an Encoding property. In memory representations likely need no encoding. File formats should be stored in UTF-8 format with no leading Byte Order Marker (BOM).

Serialization Format Properties

The Goals and Use Cases provide rationale for which the desired properties of a serialization format can be derived.



Abstract Form


The Serialization Format Properties suggest a possible Abstract Form comprised of a sequence of zero 0r more of the following markup. (where "+" means one or more and "*" means zero or more).

START WHITESPACE+ KIND WHITESPACE+ ITEM WHITESPACE*
The empty sequence being represented by an empty stream.



START

The START marker is a character sequence which represents the beginning of a single XDM Item. This sequence is not allowed to occur anywhere else in the XDM stream. This allows an XDM Consumer to perform some kinds of processing of XDM streams without having to parse the ITEM values. For example, a large XDM stream can be split into multiple streams with only needing to recognize a single character sequence.

WHITESPACE

WHITESPACE is any whitespace character (space, tab , CR , NL). Normalization of CR/NL sequences to NL is not required outside of ITEM serialization (as is in XML serialization) because in the production syntax wherever WHITESPACE occurs it, multiple adjacent occurances have the same meaning so it is not necessary to distinguish CR/NL from NL.

KIND

KIND is a character sequence which identifies the 8 XDM item kinds, seven types of XDM nodes [document, element, attribute, text, namespace, processing instruction, and comment), and atomic types.

ITEM

ITEM is the serialized form of a single XDM item. (see Item Serialization)

Item Serialization

Each of the 8 XDM kinds (7 node types and atomic types) are serialized into character sequences as follows.

document


Documents are serialized as specified in XML Output




There are no comments on this page. [Add comment]

Valid XHTML 1.0 Transitional :: Valid CSS :: Powered by WikkaWiki