Science With the Virtual Observatory
2006 Summer School

Introduction to XML: eXtensible Markup Language

Gretchen Greene (STScI) and Simon Krughoff (Univ. Pitt)

Overview

This is an introduction to the EXtensible Markup Language (XML) as a common tool for data manipulation and transmission. We will go over the basic description formats of XML documents, explain how to describe your own data types, and provide high level examples of XML used throughout the VO framework. This lesson provides a brief introduction to XML. We will explore the components that make up an XML document as well as how to form correct XML documents. Finally, we will examine an example XML document. The student exercise involves construction of a correct XML document.

We shall look at :

========================================================================================

What is XML?

W3C Definition:  EXtensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere.

XML is text based markup language.
XML is more than html because html only deal with presentation and carries no information about the data itself.

 

In HTML :
For example
<b>M31<b>
<i>2900<i>
<i>3.4<i>
could mean anything

In XML we represent it as.

<Source>
    <Galaxy>
        <Name>M31</Name>
        <Distance> 2900 </ Distance >
        <Brightness>3.4</ Brightness >
    </ Galaxy >
</ Source >

 

And this clearly expresses M31 is a galaxy name with a distance of 2900 and brightness 3.4,  of course we can go on to add more markup in our XML to include data type, etc.

 

HTML has limited set of Tags But XML can be extended (i.e. we can create our own tags ).

Advantages of XML

Readability
   XML document is plain text and human readable, To edit/view XML documents any simple text editor will suffice .

Hierarchical
   XML document has a tree structure which is powerful enough to express complex data and simple enough to understand

Language Independent
   XML documents are language neutral.For e.g. a Java program can generate a XML which can be parsed by a program written in C++ or Perl.   In fact this is the basis you will see throughout the nvoss for web services.

OS Independent
    XML files are Operating System independent.

Uses of XML

Messaging
   Where applications or organizations exchanges data between them

Database
   The data extracted from the database can be preserved with original information and can be used for more than one application in different ways. One application might just display the data and the other application might perform some complex calculation on this data

Service Oriented Architecture (SOA)
  The neutral and generalized format are ideal for data exchange because to simplify reuse of program components the individual services need to send and receive data in general formats.

XML in the VO

Throughout the remainder of this summer school you will see this structured data exchange has been widely adopted throughout the VO community and is a key ingredient to the development of data models and standard formats for such things as the cone service, SIA (Simple Image Access), VO Registries, and more.

The IVOA Twiki is the central document repository for the standards. Below you will find several areas of VO framework development which use XML for the basis of data format and exchange. These specific xml models are currently implemented in several of the key VO applications which you will be using throughout the summer school.

Standard Schemas

Anatomy of an XML Document

The xml directive

XML documents should start with the xml directive.  This line in the document states which version of xml is used in the remainder of the document.  The first line in Listing 1 is the xml directive.

Elements

XML like HTML is constructed using tags which define elements of the document. Each portion of the document is set apart by beginning and ending tags. Listing 1 shows a short XML document with 5 separate elements. Elements can have element content, mixed content, simple content, and empty content.

·  Element content elements contain only other elements. Example: <FAMILYTREE>

·  Mixed content elements contain both text and other elements. Example: <FAMILY>

·  Simple content elements contain only text. Example: <MOTHER>

·  Empty content elements contain no information within the tag block. Example: <CHILDREN>

<?xml version="1.0"?>

<FAMILYTREE>
        <FAMILY> Krughoff Family
               <MOTHER> Noell </MOTHER>
               <FATHER> Tom </FATHER>
               <CHILDREN progeny="yes"></CHILDREN>
        </FAMILY>
</FAMILYTREE>
 
Listing 1
          

Attributes

XML elements can also have attributes associated with them. Attributes are name--value pairs associated with the element but not contained within the tag block. In Listing 1 the <CHILDREN> element has an attribute called 'progeny' with value 'yes' associated with it. Each element may have multiple attributes.

Attributes are frequently used in HTML. In XML, however, it is a good idea to avoid overusing attributes. A rule of thumb is: If the attribute contains data, use a child element instead. Attributes in XML can be useful for storing identifier numbers in documents where many instances of the same element occur.

Comments

Comments in XML documents are the same as in HTML documents <!-- begins the comment and --> ends the comment.

Special Characters

Similar to HTML, XML treats some characters as special. The characters <, >, &, ", and ' are some the characters which cannot appear in the text of elements. These must be translated to the appropriate entity reference. For example: & must be written as &amp;.

Namespaces and Grammars

In order to modularize and categorize XML documents, XML documents may contain references to a namespace. The namespace gives an identifier to the document. A technical description of namespaces can be found here.

Just as XML documents may be, but do not have to be, associated with a namespace, they may also have an associated document which describes the acceptable blocks within the document. There are two options for document description.

Document Type Definition (DTD): This is a document which is not strict XML but describes an XML document.

XML Schema (XSD): The XML Schema document is written in XML and serves the same purpose as the DTD, but is richer and extensible. The VO is currently implementing this form of description for the standard metadata models.

Both the DTD and XML Schema describe the form that a valid XML document should take. For example, they indicate which elements are valid children, which element is the root node, and even the datatype associated with and element. More discussion of Schema and DTDs will be carried out in the next session.

Features of XML

Well Formed, Parseable

Unlike HTML, XML must be strictly well formed. As an example, most HTML parsers will ignore ending tags if they are left off of one line elements. This is not true of XML. All beginning tags must have an associated ending tag. Following is a list of several of the most important aspects of well formedness.

  • Every start-tag must have a matching end-tag
  • Tags can't overlap
  • There can be only one root element
  • names must obey the XML naming conventions
  • XML is case sensitive
  • Whitespace is preserved

Listing shows a few examples of common gotchas associated with XML well formedness.

Case sensitive:
<TAG> This is incorrect </tag>
<TAG> This is correct </TAG>

Overlapping tags:
<TAG1>
        <TAG2>
               This is incorrect
        </TAG1>
</TAG2>
<TAG1>
        <TAG2>
               This is correct
        </TAG2>
</TAG1>
Single root node:
Incorrect:
<?xml version="1.0"?>
<FAMILY> Krughoff Family
        <MOTHER> Noell </MOTHER>
        <FATHER> Tom </FATHER>
        <CHILDREN progeny="yes"></CHILDREN>
</FAMILY>
<FAMILY> Worland Family
        <MOTHER> Wilhelmina </MOTHER>
        <FATHER> Vincent </FATHER>
        <CHILDREN progeny="yes"></CHILDREN>
</FAMILY>
Correct:
<?xml version="1.0"?>
<FAMILYTREE>
        <FAMILY> Krughoff Family
               <MOTHER> Noell </MOTHER>
               <FATHER> Tom </FATHER>
               <CHILDREN progeny="yes"></CHILDREN>
        </FAMILY>
        <FAMILY> Worland Family
               <MOTHER> Wilhelmina </MOTHER>
               <FATHER> Vincent </FATHER>
               <CHILDREN progeny="yes"></CHILDREN>
        </FAMILY>
</FAMILYTREE>
 
Listing 2
          

Well formedness has the primary benefit of making XML easily human readable and easy to parse by machine.

Extensible

There are no predefined elements in XML. This allows the user to define all the elements in use. It also makes it easy to extend XML documents to handle complex datatypes as they come into use.   We have already discussed this briefly with mentioning the schema descriptions being developed in the IVOA.

Hierarchical

XML is inherently hierarchical. Each element is either a parent or a child element or both. The only element which is only a parent element is the root element. The hierarchical nature of XML makes it directly applicable to hierarchical datatypes like trees or tables.

Plain Text

XML is a plain text protocol. Thus, by nature, XML is human readable, mailable, and easily editable.   In creating XML documents,  it is easily programmable by way of text substitution.

An Example XML Document: The Family Tree

The XML Hierarchy

<TREE>
   <FAMILY>
   <MOTHER> Billy </MOTHER>
   <FATHER> Vincent </FATHER>
   <CHILDREN>
      <SON> Peter </SON>
      <DAUGHTER> Sue </DAUGHTER>
      <FAMILY>
         <MOTHER progeny="true"> Noell </MOTHER>
         <FATHER progeny="false"> Tom </FATHER>
         <CHILDREN>
            <SON> Simon </SON>
            <DAUGHTER> Laura </DAUGHTER>
            <SON> Stephen </SON>
            <FAMILY>
               <MOTHER progeny="true"> Emily </MOTHER>
               <FATHER progeny="false"> Jarrod </FATHER>
               <CHILDREN>
                  <SON> Henry </SON>
               </CHILDREN>
            </FAMILY>
         </CHILDREN>
      </FAMILY>
   </CHILDREN>
   </FAMILY>
</TREE>
 
Listing 3
                                 

We will use the family tree to exemplify the hierarchy inherent in XML. Figure 1 shows a tree representation of the same XML document.

familytree

The family tree is a good metaphor for the XML hierarchy, partly because of similar terminology. Nodes closer to the root node are referred to as parent nodes to those directly below them. The lower nodes are known as child or progeny nodes. Nodes with the same parent node are siblings.

XML Parsing

In working with XML one of the main challenges is establishing the methods for reading and writing. While the formats include schemas and well described content, we still need to create them in the proper format and digest them in the APIs. Fortunately several packages are available, including the JAVOT and SAVOT java parsers which you will see throughout the NVOSS software exercises. The commercial software industry also provides several libraries for handling XML, parsers, serializers, etc., and they are largely following the two methods we discuss here.

DOM (Document Object Model ) parser - Tree Structure based API:
    The Dom parser implements the dom api and it creates a DOM tree in memory for a XML document: node types...

domtree

SAX (Simple API For XML ) parser - Event Based API
    The SAX parser implements the SAX API and it is event driven interface. As it parses it invokes the callback methods and uses handlers for StartElement, EndElement, the beginning and ending tags....

When to use DOM parser

  • Manipulate the document
  • Traverse the document back and forth
  • Small XML files
Drawbacks of DOM parser
    Consumes lot of memory

When to use SAX parser

  • No structural modification
  • Huge XML files. So this is most practical in working with large datasets.

Validating And Non Validating
DOM and SAX can either be a validating or a non validating parser.
A validating parser checks the XML file against the rules imposed by DTD or XML Schema.
A non validating parser doesn't validate the XML file against a DTD or XML Schema.
Both Validating and non validating parser checks for the well formedness of the xml document

NVOSS Parser code examples

A couple examples have been provided for DOM and SAX parsing of a VOTable. Go to the following directory in the nvoss software package. You can look at the code in your editor to compare the different methods for the java APIs.

nvoss2006\java\src\XMLparse

To run the examples you can use the ant build script. For example, to run and test the parses by typing at the prompt.

ant testSAX  		or
ant testDOM

Example File

A more complete version this family tree model, with additional tags and attributes is available in the nvoss xmlparse directory (simongen.xml). You may use it as a starting point for completing the student exercise.

Student Exercise

Write an XML description of your own family tree. You do not have to use the same hierarchy that is used in the example. Feel free to use the extensibility of XML to create a unique set of elements.

Useful Links:

http://www.w3.org/XML/

XML Checker URL based

XML Checker file test option

XML tutorial

 

The NVO Summer School is made possible through the support of the National Science Foundation and the National Aeronautics and Space Administration.