XPath Embedded DSL

The XML XPath specifications allows navigation of XML documents via a DSL that describes routes through a document using a combination of axe, steps and predicates. It has a limited number of these abstractions but together they create a powerful direct, whilst remaining simple to use, querying language.

Scales provides this power via both a traditional string based approach and an embedded DSL that leverages the power of Scalas syntactical flexibility to mimic the XPath syntax.

The DSL uses the existing Scales abstractions to the full, and works via a zipper over the XmlTree itself. Each navigation step through the tree creates new zippers and new paths through the tree.

In every case possible (with the exception of the namespace:: axis) the range of behaviours closely follows the specification, like for like queries matching 100%. Instead of matching on prefixes Scales uses fully qualified expanded QNames (qualifiedName in the QName Functions) to match against, not requiring a prefix context within which to evaluate.

Internally, perhaps unsurprisingly, XPath is implemented as a combination of filter, map and flatMap. When retrieving results (e.g. converting to an Iterable) the results are sorted into Document order, this can be expensive for large result sets (see Unsorted Results for alternatives).

Simple Usage Examples

Given the following document:

  val ns = Namespace("test:uri")
  val nsa = Namespace("test:uri:attribs")
  val nsp = nsa.prefixed("pre")

  val builder = 
    ns("Elem") /@ (nsa("pre", "attr1") -> "val1",
      	    	   "attr2" -> "val2",
		   nsp("attr3") -> "val3") /(
      ns("Child"),
      "Mixed Content",
      ns("Child2") /( ns("Subchild") ~> "text" )
    )

we can easily query for the Subchild:

  // top produces a Path from a Tree, in this case an XPath
  val path = top(builder)

  val res = path \* ns("Child2") \* ns("Subchild")
  res.size // 1

  string(res) // text
  qname(res) // Subchild

XPath Crash Course

Scales Xml follows the XPath spec fairly closely and accordingly represents the concepts of context, location steps and axe, full details of which can be found in the XPath Standard.

The context, which can be thought of as current "place" in the document, is represented by the following:

Location steps are a combination of axe, node test and predicates e.g. /*fred which represents the child axe, element node test and a predicate against a no-namespace local name of "fred".

As the XPath adds more axe, steps and predicates the context changes, reducing or expanding possible matches as it develops. Scales Xml's XPath DSL represents that context with the XPath class, where each operation on that class returns another immutable instance for the next context.

As with XPath, Scales Xml predicates, axe and node tests can be chained with the current context (the self axe in XPath) always represented by the resulting Scales XPath object. Only when the underlying results are used (for example by string or qname functions) do they leave the XPath object and get transformed into a, by default, ordered list of matching nodes.

XPath Axe

Scales supports the complete useful XPath axe, each of which can be used against a given context (an instance of Scales XPath), for the full XPath axe details find the spec here:

XPath AxisScales DSLDetails
ancestorancestor_::All the parents of this context
ancestor-or-selfancestor_or_self::All the parents of this context and this node
attribute*@All the attributes for a given context, is often combined directly with a name
child\ or \+ to expand XmlItemsChildren of this context. NB: \ alone in Scales DSL simply removes the initialNode setting required by \\. If the children should be expanded (e.g. to use .filter directly) then \+ will "unpack" the child nodes.
descendantdescendant_::All children, and their children
descendant-or-selfdescendant_or_self_::This node and all descendants, also known as \\
followingfollowing_::All nodes that follow this context in document order without child nodes of this context
following-siblingfollowing_sibling_::All direct children of this contexts parent node that follow in document order.
parent\^The parent context of this context. For elements it represents the parent eleemnt and for attributes the containing element.
precedingpreceding_::All nodes that precede this context in document order excluding the parent nodes
preceding-siblingpreceding_sibling_::All previous children of the parent in the current context in document order.
selfThe XPath object itself via .The current context node within a document.

A commonly used abbreviation not listed above is of course \\, which means descendant_or_self_::. The difference being that \\ also supports possible eager evaluation and as per the spec the notion of \\ in the beginning expression.

NB Scales Embedded XPath DSL does not support the namespace axis - if you have a requirement for it then it can be looked at (please send an email to the mailing list to discuss possible improvements)

Node Tests

Scales embedded XPath DSL views the majority of node tests as predicates

XPath Node TestScales DSLDetails
node().\+Returns a new context for all the children below a given context
text().textReturns a new context for all the text and cdata below a given context
comment().commentReturns a new context for all the comments below a given context

Scales XML also adds:

Predicates

There are three areas allowing for predicates within XPaths:

The first two are special cased, as in the XPath spec, as they are the most heavily used predicates (using the above example document):

  // QName based match
  val attributeNamePredicates = path \@ nsp("attr3")
  string(attributeNamePredicates) // "val3"
  
  // predicate based match
  val attributePredicates = path \@ ( string(_) == "val3" )
  qualifiedName(attributePredicates) // {test:uri:attribs}attr3

  // Find child descendants that contain a Subchild 
  val elemsWithASubchild = path \\* ( _ \* ns("Subchild"))
  string(elemsWithASubchild) // text
  qualifiedName(elemsWithASubchild) // {test:uri}Child2

In each case the XmlPath (or AttributePath) is passed to the predicate with a number of shortcuts for the common QName based matches and positional matches for elements:

  val second = path \*(2) // path \* 2 is also valid but doesn't read like \*[2]
  qname(second) // Child2

The developer can chose to ignore namespaces (not recommended) by using the *:* and *:@ predicates instead (equivalent to string xpath /*= "x").

Predicate Construction

All the predicates in Scales are built from two simple building blocks:

  1. XmlPath => Boolean - via the XPath.filter function
  2. AttributePath => Boolean - via the AttributeAxis.*@ function

The various base node types and filters are based on these functions, for example the element predicate * is implemented as:

def *(pred : XmlPath => Boolean) : XPath[T] = 
  filter(x => x.isItem == false && pred(x))

In turn \* can be seen as a combination of the \ child step and the * predicate (via xflatMap) and is provided as syntactic sugar.

Similarly text is implemented using filter.

All of the standard set of predicates (and axis combinations) can be found in the XPath ScalaDoc. Clicking the right arrow for many of the functions will lead you to the Definition Classes docs and their code.

Chaining Predicates

Predicates can be chained on the context itself, i.e. the XPath object, for example:

val pathsCombinedPredicates =
    root.\*(ns("Child")).
      *(_.\@( nsp("attr3") )) // context is still Child matches, but has additionally reduced it to only items with an attribute of attr3

This represents /root/*ns:Child[.\@nsp:attr3] where the * Scales Xml element predicate allows matching on the self axis. The same chaining is available on the attribute axis represented by the AttributePaths class.

Positional Predicates

XPath Position FunctionScales DSLDetails
position()pos_<, pos_==, pos() and pos_>Functions to work against the current position within a context
last()last_<, last_== and last_>Functions that work against the size of a given context
position() == last()pos_eq_lastTake the last item in a context

These, more difficult to model, positional tests can be leveraged the same way as position() and last() can be in XPath.

So, for example:

  // /*[position() = last()]
  val theLast = path.\.pos_eq_last
  qname(theLast) // Elem

  // //*[position() = last()]
  val allLasts = path.\\*.pos_eq_last
  allLasts map(qname(_)) // List(Elem, Child2, Subchild)

  // all elems with more than one child
  // //*[ ./*[last() > 1]]
  val moreThanOne = path.\\*( _.\*.last_>(1) )
  qname(moreThanOne) // Elem

  // all elems that aren't the first child
  // //*[ position() > 1]
  val notFirst = path.\\*.pos_>(1)
  qname(notFirst) // Child2

Direct Filtering

The xflatMap, xmap, xfilter and filter methods allow extra predicate usage where the existing XPath 1.0 functions don't suffice.

The filter method accepts a simple XmlPath => Boolean, whereas the other varieties work on the matching sets themselves.

It is not recommended to use these functions for general use as they primarily exist for internal re-use.

Unsorted Results and Views

In order to meet XPath expected usage results are sorted in Document order and checked for duplicates. If this is not necessary - but speed of matching over a result set is (for example lazy querying over a large set) - then the raw functions (either raw or rawLazy) are good choices.

The viewed function however uses views as its default type and may help add further lazy evaluation. Whilst tests have shown lazy evaluation takes place its worth profiling your application to see if it actually impacts performance in an expected fashion.

See the XmlPaths trait for more information.

Scales Xml 0.5.0

Generated Documentation

Documentation Highlights

First Steps
Xml Model
Accessing and Querying Data
Parsing XML
Serializing & Transforming XML
Xml Equality
Technical Details