<article>
<artheader>
<title>XML challenges to programming language design</title>
<authorgroup>
<author>
<firstname>Per </firstname><surname>Bothner</surname>
<affiliation>
<address>
<email>per@bothner.com</email>
</address>
</affiliation>
</author>
</authorgroup>
</artheader>
<abstract>
<para>
People are using XML-based languages for a number of applications.
The paper discusses what we can learn from this, and various
programming language ideas and features inspired by XML
applications and tools.
Some but not all of the ideas have prototype implementations
within Kawa.
</para>
</abstract>

<sect1><title>Introduction</title>
<para>
A number of popular <quote>programming</quote> languages use XML's syntax.
Some languages are used for generating XML or HTML.
Since these need to be able to construct XML/HTML fragments,
it is reasonable to use XML syntax for constructors and perhaps
the whole language.  This category includes Sun's JSP and W3C's XSLT.
Other languages are used to lay out documents and GUI
interfaces.  These applications are characterized by nested graphical
regions that may have different properties, such as size, font,
and color.  Using XML attributes to specify optional properties
works quite well, and matches what people are used to from HTML.
Examples of such languages include
Mozilla's XUL <xref linkend="XUL"/>, Microsoft's XAML <xref linkend="XAML"/>,
and W3C's XSL-FO.
These and other languages succeed in spite of XML's clumsy syntax,
even though Lisp-like languages
have long provided a more flexible and compact notation for
expressing nested data structures.
While the XML world could learn from the Lisp world,
there are also some interesting things for Lisp people to learn from XML.
Of course Lisp people have long provided libraries and ideas for XML
processing.<!-- FIXME [list examples here]--></para>
<para>
In this article I will make my own suggestions.
Some of these ideas and features have been implemented with the
Kawa <xref linkend="Kawa"/> framework,
which is both a Scheme implementation written in
Java, and a compiler framework for compiling high-level languages to
Java bytecodes.
See particularly my KRL dialect, which is based on Bruce Lewis's BRL,
and my new Q2 proto-language.</para>
</sect1>

<sect1 id="Constructors">
<title>Data constructor syntax</title>
<para>
We need a good syntax for constructing nested objects.
In many simple programs most of the program consists of constructors,
so it is good to have an efficient syntax, and avoid noise
keywords such as <literal>new</literal> or <literal>make</literal>.
We need a convenient syntax for attributes or keyword parameters.</para>
<para>
We also need syntax for function calls, as well as syntax for
control structures such as loops and conditionals.
JSP uses XML syntax for all of these, and there are popular
<quote>tag libraries</quote> which are essentially function libraries
whose functions are called using XML syntax.
The XQuery <xref linkend="XQuery"/> language, while
powerful and elegant in many ways, has a more traditional mixed-style syntax,
with XML constructors in XML syntax, but function calls and expressions
closer to C/Java syntax.  There are some advantages to using different
syntactic styles for different operations, but it does make it harder
when you need to change or refactor your code.</para>
<para>
Lisp languages of course has a unified syntax for function application
and data structures, but we still need a way to distinguish the two.
The traditional mechanism is to use quotation or quasi-quotation
for data, and some XML-in-Lisp embeddings use quotation conventions.
However, quotation is a read-time operation, and returns values that
are fixed early.  Too early if you want to associate richer
type information with the data structures, perhaps because you want
to do schema validation, or use different object types
for diffent element tags.</para>
<para>
My suggestion is to distinguish function calling from
object construction not by syntax, but to distinguish them using name lookup.
This requires a declaration of the element type, though I show below
how use of namespaces makes this convenient.</para>
<programlisting>
(define (process-para p) ...)
;; Define para as a constructor for a
;; type with mixed child content.
(define-element para mixed)

(process-para
  (para (string-upcase "data")))
</programlisting>
<para>
This yields, in XML syntax:</para>
<programlisting>
&lt;para&gt;DATA&lt;/para&gt;
</programlisting>
</sect1>

<sect1 id="Nodes">
<title>Typed node values</title>
<para>
Some XML/HTML-producing languages, such as JSP and PHP,
generate textual XML output.  I.e. web pages are produced
by explicitly or implicitly writing strings to a special file.
This is OK when the goal is just generating web pages,
but awkward if we want to be able process existing XML data,
Furthermore, the result of a constructor should have a real type,
distinct from lists or vectors.
In other words, an element constructor returns an object.
</para>
<para>
The <quote>standard</quote> representation of XML elements and
other node types is W3C's DOM interface, but that may not always
be the best representation.  We may want to customize the
class used to represent a specific element type (as in JAXB).
In that case XML is just a special input/output representation,
and the structure of the XML file might not directly match
the internal representation.
</para>
<para>
Both DOM and the more abstract XQuery/XSLT data model allow you to
directly access a parent element from a child node.
This requires a lot of either re-parenting or (conceptual) copying,
which is awkward.  Lisp's model where you can get at the children
from a parent but not vice versa seems preferable.
Instead, we can reference a node indirectly using
the <firstterm>path</firstterm> we used to access it, rather like a
Unix filesystem, which distinuishes file names from inodes.
A path is a pointer, but it also includes a trail of parent pointers, so
we can easily get to parent and sibling nodes.</para>
</sect1>

<sect1 id="Namespaces">
<title>Namespaces</title>
<para>
XML, like Common Lisp's package system, uses two-level names:
A <firstterm>qualified name</firstterm> consists of a local (simple) name,
and a globally-unique URI (URL).  The URI corresponds
to a Common Lisp package, but a URI is a string, rather than an object.
Qualified names (QNames) match if both the local name and the
URI are the same, which in practice is not very different from interned
Common Lisp symbols, which are the same if the local name and the
package are the same.</para>
<para>
In a Lisp program we write symbols using package prefixes, while in
an XML we write qualified names using a namespace prefix, a colon,
and a local name.  A difference is that in Lisp the set of package
names (and nicknames) is maintained in a global table, while
in XML there is a local mapping from namespace prefix to URI.
This mapping is part of the lexical scope rather than the
global scope.
I think that is a better solution, as it it lets
different modules use different short prefixes.
Note that in Common Lisp resolving a name to a symbol is a read-type
operation.  I suggest it should be done at the same time
as macros and special forms are recognized.  This allows
<literal>let</literal>-forms to define local namespace aliases (nicknames).
</para>
<para>
The Kawa keyword <literal>define-namespace</literal>
defines a name as a prefix that aliases a namespace URI.
At the same time it implicitly declares all the functions
that use the same prefix to be constructors that when called
create node objects.
It is reasonable to also provide a <literal>let-namespace</literal> keyword,
though Kawa does not currently do so.</para>
<programlisting>
(define-namespace xhtml
  "http://www.w3.org/1999/xhtml")

(xhtml:p "some text "
  (let-namespace
    (math "http://.../MathML")
    (math:xx))
</programlisting>
<para>
If you print this using XML syntax you might get:</para>
<programlisting>
&lt;xhtml:p
  xmlns:xhtml="http://www.w3.org/1999/xhtml"&gt;
  some text
  &lt;math:xx
    xmlns:math="http://.../MathML"/&gt;
&lt;/xhtml:p&gt;
</programlisting>
<para>
An imported module can be bound to a namespace prefix:</para>
<programlisting>
(import mat &lt;matrix-functions&gt;)
(mat:transpose (mat:zero 2 3))
</programlisting>

</sect1>

<sect1 id="Sequences">
<title>Sequences</title>
<para>
The XQuery 1.0 and XSLT 2.0 languages use <firstterm>sequences</firstterm> in an
interesing way.  Such a sequence differs from the Common Lisp concept
of the same name in that you cannot directly nest sequences.
Equivalently, there is no difference between a value and a singleton
sequence consisting of just that value.  In that respect a sequence
is similar to Lisp/Scheme's <quote>multiple values</quote>.
However, a sequence is a first-class value in that a variable or parameter
can be bound to a sequence.
(Kawa represents an XQuery sequence and Scheme multiple values
the same way.)</para>
<para>
One might think that non-nestable sequences are too limiting,
and of course a language needs a way to provide nested data structures.
We'll discuss arrays and nodes later.
Non-nestable sequences do provide a nice functional model for composing
program fragments, as we'll see.</para>
</sect1>

<sect1 id="Expressions">
<title>Statements are expressions</title>
<para>
Scheme and Common Lisp are expression-oriented <quote>mostly-functional</quote>
languages, but the looping and <literal>prog</literal> constructs
don't fit very smoothly into this.</para>
<para>
First consider the Scheme <literal>&lt;body&gt;</literal> syntax,
which is one or more declarations or expressions, and whose
result is that of the final expression.
We can modify the definition so that the result is the sequence resulting from
concatenating <emphasis>all</emphasis> the sequences resulting from
the sub-expressions.
We also define declarations, assignments, and similar statements
to return zero values.  Most existing
code should work as is.  When needed, a <literal>discard</literal>
function can be used to ignore all its argument values, returning zero values.
The concatenation operator becomes the same as the
statement separator operator; in a non-Lisp-syntax language both
might use semi-colon or line separators.</para>
<para>
If evaluating a loop results in the concatenation of the values from
each iteration, then the value of a loop is the same as unfolding the
loop to yield a statement sequence, which is very intuitive behavior.
Using sequence concatenation in
this way yields pleasant and natural semantics for expression languages.
As the XQuery language shows, it also it is very convenient for
processing XML and similar data structures.</para>
<para>
Here is an example:
</para>
<programlisting>
(define x
  (let r ((i 0))
    (+ 100 i)
    (if (&lt; i 5)
      (r (+ i 1)))
    i))
</programlisting>
<para>
Each time <literal>r</literal>'s body it evaluated it yields two values:
initially the result of <literal>(+ 100 i)</literal>
and finally the parameter <literal>i</literal>.
In between is a recursive call.
All the values are appended, yielding a sequence of 10 values:</para>
<programlisting>
100 101 102 103 104 105 5 4 3 2 1 0
</programlisting>
<para>The FLWOR-expression of XQuery is a powerful and elegant
way to map over sequences.  Loosely, a FLWOR-expressions
iterates over a sequence, and for each item in the sequence it binds a
local variable, and evaluates an expression within that scope.  The
result is the concatenation of all the results.
Common Lisp has a whole set of mapping functions,
including <literal>mapcar</literal> which is the list of
results of applying a function, and <literal>mapcan</literal>
is the concatentation of lists resulting from applying a function.
We don't need this if sequences are unnested.
A simple Scheme syntax:
</para>
<programlisting>
(do-each (<replaceable>var</replaceable> <replaceable>sequence</replaceable>)
  <replaceable>body</replaceable>)
</programlisting>
<para>
This evaluates <replaceable>sequence</replaceable>.
Then each value yielded by <replaceable>sequence</replaceable>
is bound to <replaceable>var</replaceable>,
and <replaceable>body</replaceable> is evaluated.
The value of the <literal>do-each</literal> is the concatenation
of the result of each evaluation of <replaceable>body</replaceable>.
</para>
</sect1>

<sect1 id="Arrays">
<title>Arrays</title>
<para>
Since sequences don't nest, we need a real data structure that
supports nesting.
An array is single value that contains a multi-dimensional
mapping from integer tuples to sequences. Usually each component
of an array is a single value, but there seems to be no reason to
disallow sequences of other lengths, especially for modifiable arrays.
If our language is like Scheme in supporting first-class functions
using the same namespace as other values, then it seems reasonable
to use function call syntax for array indexing.
APL-like array operations correspond to higher-order functions.</para>
<para>
The primary operations on sequences are concatenation and iteration;
the primary operation on arrays is indexing.
It follows that a string should be a sequence of characters,
not an array of characters:  Random access in a string is not
a semantically meaningful operation.
</para>
</sect1>

<sect1 id="Attributes">
<title>Attributes and keywords</title>
<para>
XML elements may have named string-valued attributes,
which are useful for specifying optional properties.
Such attributes are similar to keyword parameters,
so it makes sense to use the same syntax for both.
XML attributes come before the <quote>body</quote>
or <quote>children</quote> of the element, while in
Common Lisp (and many other languages with keyword parameters)
the keyword parameters come after the unnamed parameters.
Listing the attributes first makes sense when attributes tend
to be shorter, or their value may influence the processing of the
main contents.  These concerns suggest we follow XML conventions.</para>
<para>
XML attribute values are restricted to string values, while
Lisp keyword parameters may be arbitrary values.  However, it is
worth noting that the <quote>meaning</quote> of an attribute may be
defined by a schema as having a typed value (<quote>hatsize</quote>
being the canonical example).  In any case our XML-friendly language
will of course allow arbitrary expressions yielding arbitrary
values.</para>
</sect1>

<sect1 id="Patterns">
<title>Patterns</title>
<para>
In XQuery a parameter list is a tuple of parameters, each of which
may be bound to a sequence.
An alternative model is to make the entire parameter list be a sequence.
This makes it easier for functions to have a variable number of
parameters - essentially they have a single sequence parameter.
Thus there is no need for Scheme's separate <literal>apply</literal>
or <literal>call-with-values</literal> methods.
On the other hand, you cannot have two parameters both of
which takes a sequence.</para>
<para>
If there is logically only a single parameter, then the function
definition needs a way to split the sequence up.
ML-style pattern matching is an elegant solution.
Extending these to nested regular patterns,
as in some XML-oriented functional langues like CDuce <xref linkend="CDuce"/>,
is very elegant and powerful.  The syntax of such patterns
is an open question, especially in a Lisp-like language,
but here is one possibility:</para>
<programlisting>
(define (map-body fun
                  (xhtml:html
                   (xhtml:head h)
                   (xhtml:body b)))
  (xhtml:html (xhtml:head h)
              (xhtml:body (fun b))))
</programlisting>
<para>
This assumes that the prefix <literal>xhtml:</literal> prefix
has been decleared such that functions in that namespace
are element constructors.
The function <literal>map-body</literal> is defined to match
against a sequence of two values, where the first value is
a function that gets bound to the variable <literal>fun</literal>.
An element constructor in a pattern matches against an actual parameter
value constructed using that constructor, so the
second argument value must match an <literal>html</literal> element
that contains a <literal>head</literal> child followed by a
<literal>body</literal> child.  The formal parameter variables
<literal>h</literal> and <literal>b</literal> are matched against
the contents of that <literal>head</literal> and
<literal>body</literal> elements.
The body of the function applies the function <literal>fun</literal>
to the <literal>b</literal> value, and constructs modified
<literal>head</literal>, <literal>body</literal>, and <literal>html</literal>
elements.
</para>
<para>
Fitting keyword parameters into this model matching can be done different ways.
We could follow Common Lisp in treating a keyword parameter
as a two-element sequence consisting of a keyword and a value.
However, taking apart a parameter list using a pattern is
probably easier if we treat a keyword-value-pair as a
combined <quote>attribute value</quote>.
This allows a keyword parameter to be a sequence.
In this model:
</para>
<programlisting>
(foo font: "Helvetica"
     style: (values 'bold 'italic))
</programlisting>
<para>
is syntactic sugar for:
</para>
<programlisting>
(foo (attribute 'font <if-twocolumn>
                </if-twocolumn>"Helvetica")
     (attribute 'style <if-twocolumn>
                </if-twocolumn>(values 'bold 'italic)))
</programlisting>
</sect1>

<sect1 id="Graphics">
<title>Graphics: Models and Views</title>
<para>
Mozilla's XUL and Microsoft's XAML languages
are convenient ways to describe the graphical layout and structure
of a GUI window as hierarchical structure, using XML syntax similar
to expressing a web page in HTML.</para>
<programlisting>
&lt;button label="Yes"
        image="yes-image.png"
        oncommand="yes-action" /&gt;
</programlisting>
<para>
The behavior of the application has to be expressed using a different
programming language, for example JavaScript.
A better integrated language as described above could describe
both display and behavior more conveniently:</para>
<programlisting>
(button label: "Yes"
        image: "yes-image.png"
        oncommand: <if-twocolumn>
          </if-twocolumn>(lambda () <if-twocolumn>
            </if-twocolumn>(format #t <if-twocolumn>
              </if-twocolumn>"Yes button pressed!~%~!")))
</programlisting>
<para>
Both XUL and XAML describe the <quote>view</quote> aspect of an application.
but they don't support model-view separation.
Using a real programming language with variables and functions can do that.
The UI library defines two classes of <quote>GUI objects</quote>:
A <firstterm>model value</firstterm> is collection of data.
It may have a default way it is displayed, but it can also be
transformed by an affine transform, and it may be displayed
multiple times at once.
A <firstterm>view value</firstterm> represents actual <quote>screen real estate</quote>:
an actual window or sub-window.  Views may be nested inside other views,
but any given view only appears once.
A model constructor is a function that returns a model,
while a view constructor is a function that returns a view.
The parameters to a view constructors may be other (usually nested) views,
model values (to be displayed in the view), or other values.
If the parameter of a view constructors is a model where is a view is
expected, a model may be converted to a view using a default view constructor.
</para>
<para>
Here is a simple example, where an image (a model) is used twice,
once transformed, in a taskbar:
</para>
<programlisting>
(define left-arrow <if-twocolumn>
        </if-twocolumn>(image "left-arrow.png"))
(define right-arrow <if-twocolumn>
        </if-twocolumn>(flip-vertically left-arrow))
(taskbar
  (button label: "Back"
          image: left-arrow
          oncommand: back-command)
  (button label: "Next"
          image: right-arrow
          oncommand: next-command))
</programlisting>
<para>
Notice how various optional properties are specified using
keyword parameters.  Typesetting an article like this can
also use keyword parameters:</para>
<programlisting>
(paragraph slant: 'italic
  "This is important!")
</programlisting>
<para>
Alternatively one can use functions:
</para>
<programlisting>
(paragraph
  (italic "This")
  " is important!")
</programlisting>
<para>
The function function <literal>italic</literal> returns
an <quote>italic version</quote> of the argument,
while the function <literal>color</literal> takes a color
followed by one or more aruments to be displayed in that color.
All of these work on models, and so the results are values
that can be displayed many times.
What does this mean?  Consider:</para>
<programlisting>
(define blue-this
  (color 'blue "This "))
(define warning
  (color 'red
    blue-this
    "is important!"))
blue-this
(italic warning)
</programlisting>
<para>
The result should be a blue non-italic <literal>This</literal>
followed by an blue italic <literal>This</literal>
followed by a red italic <literal>is important!</literal>.
That means a function like <literal>color</literal>
should change the default color, but it should not change the
color property of any characters that alread have a color
property.  (There might be a separate <literal>force-color</literal>
function that does override any color properties in the arguments.)
An easy way to implement <literal>color</literal> is that it just
creates a data structure referencing the arguments.  When that value is typeset
or displayed using a <quote>graphics context</quote>,
we save the graphic context's current color, change the color,
display/typeset the arguments, and then restore the color.
To display/typeset the arguments may involve nested
color or font changes.</para>
</sect1>

<sect1 id="Lexing">
<title>Lexical structure</title>
<para>
Most of these ideas are compatible with different lexical syntaxes,
though above I've assumed a Scheme-like syntax.
A C/Java-like syntax is also possible, with a few more changes, including
adding keyword function arguments.
I have also explored a more Haskell-like syntax, which uses juxtaposition
for function calls, and structure using white space and indentation.
It is also appealing to use juxtaposition for tuple concatenation,
though using juxtaposition for both application and concatenation
might be confusing.
</para>
</sect1>

<sect1 id="more">
<title>Links and more information</title>
<para>
I name <quote>Q2</quote> refers to the language and implemenation
where I'm trying out these and other ideas.
For more on Q2 see <ulink url="http://gnu.org/software/kawa/q2"/>.
The implementation, such as it is, included
in the <ulink url="http://gnu.org/software/kawa">Kawa</ulink>
source tree.</para>
</sect1>

<bibliography>
<title>Bibliography</title>

<biblioentry id="CDuce">
<abbrev>CDuce</abbrev>
<authorgroup>
<author><surname>Benzaken</surname></author>
<author><surname>Castagna</surname></author>
<author><surname>Frisch</surname></author>
</authorgroup>
<title>CDuce: an XML-Centric General-Purpose Language</title>
<bibliomisc>ICFP SIGPLAN 38(9)</bibliomisc>
<pubdate>2003</pubdate>
<bibliomisc><ulink url="http://www.cduce.org/"/></bibliomisc>
</biblioentry>

<biblioentry id="Kawa">
<abbrev>Kawa</abbrev>
<authorgroup>
<author><firstname>Per</firstname> <surname>Bothner</surname></author>
</authorgroup>
<title>Kawa: Compiling Scheme to Java</title>
<bibliomisc>Lisp Users Conference (Berkeley)</bibliomisc>
<pubdate>1998</pubdate>
<bibliomisc><ulink url="http://www.gnu.org/software/kawa/"/></bibliomisc>
</biblioentry>

<biblioentry id="XAML">
<abbrev>XAML</abbrev>
<corpauthor>Microsoft</corpauthor>
<title><quote>Longhorn</quote> Markup Language (code-named <quote>XAML</quote>) Overview</title>
<bibliomisc><if-html><ulink url="http://longhorn.msdn.microsoft.com/lhsdk/core/overviews/about%20xaml.aspx"/></if-html><if-tex><ulink url="http://&#x200B;longhorn.msdn.microsoft.com/&#x200B;lhsdk/&#x200B;core/&#x200B;overviews/&#x200B;about%20xaml.aspx"/></if-tex></bibliomisc>
</biblioentry>

<biblioentry id="XQuery">
<abbrev>XQuery</abbrev>
<title>XQuery 1.0: An XML Query Language</title>
<bibliomisc><ulink url="http://www.w3c.org/XML/Query"/></bibliomisc>
</biblioentry>

<biblioentry id="XUL">
<abbrev>XUL</abbrev>
<corpauthor>Mozilla</corpauthor>
<title>XML User Interface Language (XUL)</title>
<bibliomisc><if-html><ulink url="http://www.mozilla.org/projects/xul/"/></if-html><if-tex><ulink url="http://www.mozilla.org/&#x200B;projects/&#x200B;xul/"/></if-tex></bibliomisc>
</biblioentry>

</bibliography>

</article>
