DOM Parsing and Serialization

DOMParser, XMLSerializer, innerHTML, and similar APIs

W3C Editor's Draft

More details about this document
This version:
https://www.downtownmelody.com/_x/dzNjLmdpdGh1Yi5pbw/DOM-Parsing/
Latest published version:
https://www.downtownmelody.com/_x/d3d3LnczLm9yZw/TR/DOM-Parsing/
Latest editor's draft:
https://www.downtownmelody.com/_x/dzNjLmdpdGh1Yi5pbw/DOM-Parsing/
History:
https://www.downtownmelody.com/_x/d3d3LnczLm9yZw/standards/history/DOM-Parsing/
Editor:
(Microsoft)
Feedback:
[email protected] with subject line DOM-Parsing (archives)
Test Suites
https://www.downtownmelody.com/_x/dzNjLXRlc3Qub3Jn/domparsing/
https://www.downtownmelody.com/_x/dzNjLXRlc3Qub3Jn/html/syntax/
Participate
We are on Github.
Bugzilla Bug list.
Github Issues.
Commit history.
Mailing list.

Abstract

This specification defines APIs for the parsing and serializing of HTML and XML-based DOM nodes for web applications.

Status of This Document

This section describes the status of this document at the time of its publication. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.downtownmelody.com/_x/d3d3LnczLm9yZw/TR/.

This document was published by the Web Applications Working Group as an Editor's Draft.

Publication as an Editor's Draft does not imply endorsement by W3C and its Members.

This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 03 November 2023 W3C Process Document.

Candidate Recommendation Exit Criteria

This specification will not advance to Proposed Recommendation before the spec's test suite is completed and two or more independent implementations pass each test, although no single implementation must pass each test. We expect to meet this criteria no sooner than 24 October 2014. The group will also create an Implementation Report.

Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The IDL fragments in this specification must be interpreted as required for conforming IDL fragments, as described in the Web IDL specification. [WEBIDL]

Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and terminate these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.

Conformance requirements phrased as algorithms or specific steps may be implemented in any manner, so long as the end result is equivalent. (In particular, the algorithms defined in this specification are intended to be easy to follow, and not intended to be performant.)

User agents may impose implementation-specific limits on otherwise unconstrained inputs, e.g. to prevent denial of service attacks, to guard against running out of memory, or to work around platform-specific limitations.

When a method or an attribute is said to call another method or attribute, the user agent must invoke its internal API for that attribute or method so that e.g. the author can't change the behavior by overriding attributes or methods with custom properties or functions in ECMAScript. [ECMA-262]

Unless otherwise stated, string comparisons are done in a case-sensitive manner.

If an algorithm calls into another algorithm, any exception that is thrown by the latter (unless it is explicitly caught), must cause the former to terminate, and the exception to be propagated up to its caller.

Extensibility

Vendor-specific proprietary extensions to this specification are strongly discouraged. Authors must not use such extensions, as doing so reduces interoperability and fragments the user base, allowing only users of specific user agents to access the content in question.

If vendor-specific extensions are needed, the members should be prefixed by vendor-specific strings to prevent clashes with future versions of this specification. Extensions must be defined so that the use of extensions neither contradicts nor causes the non-conformance of functionality defined in the specification.

When vendor-neutral extensions to this specification are needed, either this specification can be updated accordingly, or an extension specification can be written that overrides the requirements in this specification. Such an extension specification becomes an applicable specification for the purposes of conformance requirements in this specification.

1. Introduction

A document object model (DOM) is an in-memory representation of various types of Nodes where each Node is connected in a tree. The [HTML5] and [DOM4] specifications describe DOM and its Nodes is greater detail.

Parsing is the term used for converting a string representation of a DOM into an actual DOM, and Serializing is the term used to transform a DOM back into a string. This specification concerns itself with defining various APIs for both parsing and serializing a DOM.

For example: the innerHTML API is a common way to both parse and serialize a DOM (it does both). If a particular Node, has the following in-memory DOM:
HTMLDivElement (nodeName: "div")
┃
┣━ HTMLSpanElement (nodeName: "span")
┃  ┃
┃  ┗━ Text (data: "some ")
┃
┗━ HTMLElement (nodeName: "em")
   ┃
   ┗━ Text (data: "text!")
And the HTMLDivElement node is stored in a variable myDiv, then to serialize myDiv's children simply get (read) the Element's innerHTML property (this triggers the serialization):
var serializedChildren = myDiv.innerHTML;
// serializedChildren has the value:
// "<span>some </span><em>text!</em>"

To parse new children for myDiv from a string (replacing its existing children), simply set the innerHTML property (this triggers parsing of the assigned string):

myDiv.innerHTML = "<span>new</span><em>children!</em>";

This specification describes two flavors of parsing and serializing: HTML and XML (with XHTML being a type of XML). Each follows the rules of its respective markup language. The above example shows HTML parsing and serialization. The specific algorithms for HTML parsing and serializing are defined in the [HTML5] specification. This specification contains the algorithm for XML serializing. The grammar for XML parsing is described in the [XML10] specification.

Round-tripping a DOM means to serialize and then immediately parse the serialized string back into a DOM. Ideally, this process does not result in any data loss with respect to the identity and attributes of the Node in the DOM. Round-tripping is especially tricky for an XML serialization, which must be concerned with preserving the Node's namespace identity in the serialization (wereas namespaces are ignored in HTML).

Consider the XML serialization of the following in-memory DOM:
Element (nodeName: "root")
┃
┗━ HTMLScriptElement (nodeName: "script")
   ┃
   ┗━ Text (data: "alert('hello world')")
An XML serialization must include the HTMLScriptElement Node's namespace in order to preserve the identity of the script element, and to allow the serialized string to round-trip through an XML parser. Assuming that root is in a variable named root:
var xmlSerialization = new XMLSerializer().serializeToString(root);
// xmlSerialization has the value:
// "&lt;root&gt;&lt;script xmlns="https://www.downtownmelody.com/_x/d3d3LnczLm9yZw/1999/xhtml"&gt;alert('hello world')&lt;/script&gt;&lt;/root&gt;"

The term context object means the object on which the API being discussed was called.

The following terms are understood to represent their respective namespaces in this specification (and makes it easier to read):

2. APIs for parsing and serializing DOM

2.1 The DOMParser interface

The definition of DOMParser has moved to the HTML Standard.

2.2 The XMLSerializer interface

The definition of XMLSerializer has moved to the HTML Standard.

2.3 The InnerHTML mixin

The definition of InnerHTML has moved to the HTML Standard.

2.4 Extensions to the Element interface

The definition of outerHTML has moved to the HTML Standard.

The definition of insertAdjacentHTML has moved to the HTML Standard.

2.5 Extensions to the Range interface

The definition of createContextualFragment has moved to the HTML Standard.

3. Algorithms for parsing and serializing

3.1 Parsing

The definition of fragment parsing algorithm has moved to the HTML Standard.

3.2 Serializing

The definition of fragment serializing algorithm has moved to the HTML Standard.

3.2.1 XML Serialization

An XML serialization differs from an HTML serialization in the following ways:

  • Elements and attributes will always be serialized such that their namespaceURI is preserved. In some cases this means that an existing prefix, prefix declaration attribute or default namespace declaration attribute might be dropped, substituted or changed. An HTML serialization does not attempt to preserve the namespaceURI.
  • Elements not in the HTML namespace containing no children, are serialized using the empty-element tag syntax (i.e., according to the XML EmptyElemTag production).

Otherwise, the algorithm for producing an XML serialization is designed to produce a serialization that is compatible with the HTML parser. For example, elements in the HTML namespace that contain no child nodes are serialized with an explicit begin and end tag rather than using the empty-element tag syntax.

Note

Per [DOM4], Attr objects do not inherit from Node, and thus cannot be serialized by the XML serialization algorithm. An attempt to serialize an Attr object will result in an empty string.

To produce an XML serialization of a Node node given a flag require well-formed, run the following steps:

  1. Let namespace be a context namespace with value null. The context namespace tracks the XML serialization algorithm's current default namespace. The context namespace is changed when either an Element Node has a default namespace declaration, or the algorithm generates a default namespace declaration for the Element Node to match its own namespace. The algorithm assumes no namespace (null) to start.
  2. Let prefix map be a new namespace prefix map.
  3. Add the XML namespace with prefix value "xml" to prefix map.
  4. Let prefix index be a generated namespace prefix index with value 1. The generated namespace prefix index is used to generate a new unique prefix value when no suitable existing namespace prefix is available to serialize a node's namespaceURI (or the namespaceURI of one of node's attributes). See the generate a prefix algorithm.
  5. Return the result of running the XML serialization algorithm on node passing the context namespace namespace, namespace prefix map prefix map, generated namespace prefix index reference to prefix index, and the flag require well-formed. If an exception occurs during the execution of the algorithm, then catch that exception and throw an "InvalidStateError" DOMException.

Each of the following algorithms for producing an XML serialization of a DOM node take as input a node to serialize and the following arguments:

The XML serialization algorithm produces an XML serialization of an arbitrary DOM node node based on the node's interface type. Each referenced algorithm is to be passed the arguments as they were recieved by the caller and return their result to the caller. Re-throw any exceptions. If node's interface is:

Element
Run the algorithm for XML serializing an Element node node.
Document
Run the algorithm for XML serializing a Document node node.
Comment
Run the algorithm for XML serializing a Comment node node.
Text
Run the algorithm for XML serializing a Text node node.
DocumentFragment
Run the algorithm for XML serializing a DocumentFragment node node.
DocumentType
Run the algorithm for XML serializing a DocumentType node node.
ProcessingInstruction
Run the algorithm for XML serializing a ProcessingInstruction node node.
An Attr object
Return an empty string.
Anything else
Throw a TypeError. Only