|
Science With the Virtual Observatory |
Gretchen Greene (STScI) and Simon Krughoff (Univ. Pitt)
This is an introduction to the EXtensible Markup Language (XML) as a common tool for data manipulation and transmission. We will go over the basic description formats of XML documents, explain how to describe your own data types, and provide high level examples of XML used throughout the VO framework. This lesson provides a brief introduction to XML. We will explore the components that make up an XML document as well as how to form correct XML documents. Finally, we will examine an example XML document. The student exercise involves construction of a correct XML document.
We shall look at :
========================================================================================
W3C Definition: EXtensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). Originally designed to meet the challenges of large-scale electronic publishing, XML is also playing an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere.
XML is text
based markup language.
XML is more than html because html only deal with
presentation and carries no information about the data itself.
In HTML :
For example
<b>M31<b>
<i>2900<i>
<i>3.4<i>
could mean anything
In XML we represent it as.
<Source>
<Galaxy>
<Name>M31</Name>
<Distance> 2900 </ Distance >
<Brightness>3.4</ Brightness >
</ Galaxy >
</ Source >
And this clearly expresses M31 is a galaxy name with a distance of 2900 and brightness 3.4, of course we can go on to add more markup in our XML to include data type, etc.
HTML has limited set of Tags But XML can be extended (i.e. we can create our own tags ).
Readability
XML document is plain text and human readable, To edit/view
XML documents any simple text editor will suffice .
Hierarchical
XML document has a tree structure which is powerful enough
to express complex data and simple enough to understand
Language Independent
XML documents are language neutral.For
e.g. a Java program can generate a XML which can be parsed by a program written
in C++ or Perl. In fact this is the
basis you will see throughout the nvoss for web
services.
OS Independent
XML files are Operating System independent.
Messaging
Where applications or organizations exchanges data between
them (e.g.
Astronomical Data Models)
Database
The data extracted from the database can be preserved with
original information and can be used for more than one application in different
ways. One application might just display the data and the other application
might perform some complex calculation on this data
Service Oriented Architecture (SOA)
The neutral and generalized format are ideal for data exchange because to simplify reuse of program components the individual services need to send and receive data in general formats.
Throughout the remainder of this summer school you will see this structured data exchange has been widely adopted throughout the VO community and is a key ingredient to the development of data models and standard formats for such things as the cone service, SIA (Simple Image Access), VO Registries, and more.
The IVOA Twiki is the central document repository for the standards. Below you will find several areas of VO framework development which use XML for the basis of data format and exchange. These specific xml models are currently implemented in several of the key VO applications which you will be using throughout the summer school.
Standard Schemas
XML documents should start with the xml directive. This line in the document states which version of xml is used in the remainder of the document. The first line in Listing 1 is the xml directive.
Examples of Directives:
<?xml version="1.0"?>
<?xml version="1.0" encoding="UTF-8"?>
· Element content elements contain only other elements. Example: <FAMILYTREE>
· Mixed content elements contain both text and other elements. Example: <FAMILY>
· Simple content elements contain only text. Example: <MOTHER>
· Empty content elements contain no information within the tag block. Example: <CHILDREN>
<?xml version="1.0"?> <FAMILYTREE>
<FAMILY> Krughoff Family <MOTHER> Noell </MOTHER>
<FATHER> Tom </FATHER>
<CHILDREN progeny="yes"></CHILDREN> </FAMILY> </FAMILYTREE>
Listing 1
XML elements can also have attributes associated with them. Attributes are name--value pairs associated with the element but not contained within the tag block. In Listing 1 the <CHILDREN> element has an attribute called 'progeny' with value 'yes' associated with it. Each element may have multiple attributes.
Example of VOTable Field Element with multiple attributes.
<field <FIELD ID="identifier" datatype="char" name="identifier" ucd="ID_MAIN" arraysize="*"/>
Attributes are frequently used in HTML. In XML, however, it is a good idea to avoid overusing attributes.
A rule of thumb is: If the attribute contains data, use a child element instead. Attributes in XML can be useful for storing identifier numbers in documents where many instances of the same element occur.
Comments in XML documents are the same as in HTML documents <!-- begins the comment and --> ends the comment.
Similar to HTML, XML treats some characters as special. The characters <, >, &, ", and ' are some the characters which cannot appear in the text of elements. These must be translated to the appropriate entity reference. For example: & must be written as &.
In order to modularize and categorize XML documents, XML documents may contain references to a namespace. The namespace gives an identifier to the document. A technical description of namespaces can be found here.
Just as XML documents may be, but do not have to be, associated with a namespace, they may also have an associated document which describes the acceptable blocks within the document. There are two options for document description.
Document Type Definition (DTD): This is a document which is not strict XML but describes an XML document.
XML Schema (XSD): The XML Schema document is written in XML and serves the same purpose as the DTD, but is richer and extensible. The VO is currently implementing this form of description for the standard metadata models.
Both the DTD and XML Schema describe the form that a valid XML document should take. For example, they indicate which elements are valid children, which element is the root node, and even the datatype associated with and element. More discussion of Schema and DTDs will be carried out in the next session.
Unlike HTML, XML must be strictly well formed. As an example, most HTML parsers will ignore ending tags if they are left off of one line elements. This is not true of XML. All beginning tags must have an associated ending tag. Following is a list of several of the most important aspects of well formedness.
Listing shows a few examples of common gotchas associated with XML well formedness.
Case sensitive:
<TAG> This is incorrect </tag>
<TAG> This is correct </TAG>
Overlapping tags:
<TAG1>
<TAG2>
This is incorrect </TAG1> </TAG2>
<TAG1>
<TAG2>
This is correct </TAG2> </TAG1>
Single root node:
Incorrect:
<?xml version="1.0"?> <FAMILY> Krughoff Family
<MOTHER> Noell </MOTHER>
<FATHER> Tom </FATHER>
<CHILDREN progeny="yes"></CHILDREN> </FAMILY>
<FAMILY> Worland Family <MOTHER> Wilhelmina </MOTHER>
<FATHER> Vincent </FATHER> <CHILDREN progeny="yes"></CHILDREN> </FAMILY>
Correct:
<?xml version="1.0"?> <FAMILYTREE>
<FAMILY> Krughoff Family
<MOTHER> Noell </MOTHER>
<FATHER> Tom </FATHER>
<CHILDREN progeny="yes"></CHILDREN>
</FAMILY>
<FAMILY> Worland Family
<MOTHER> Wilhelmina </MOTHER>
<FATHER> Vincent </FATHER>
<CHILDREN progeny="yes"></CHILDREN>
</FAMILY>
</FAMILYTREE>
Well formedness has the primary benefit of making XML easily human readable and easy to parse by machine.
There are no predefined elements in XML. This allows the user to define all the elements in use. It also makes it easy to extend XML documents to handle complex datatypes as they come into use. We have already discussed this briefly with mentioning the schema descriptions being developed in the IVOA.
XML is inherently hierarchical. Each element is either a parent or a child element or both. The only element which is only a parent element is the root element. The hierarchical nature of XML makes it directly applicable to hierarchical datatypes like trees or tables.
XML is a plain text protocol. Thus, by nature, XML is human readable, mailable, and easily editable. In creating XML documents, it is easily programmable by way of text substitution.
<TREE>
<FAMILY>
<MOTHER> Billy </MOTHER>
<FATHER> Vincent </FATHER>
<CHILDREN>
<SON> Peter </SON>
<DAUGHTER> Sue </DAUGHTER>
<FAMILY>
<MOTHER progeny="true"> Noell </MOTHER>
<FATHER progeny="false"> Tom </FATHER>
<CHILDREN>
<SON> Simon </SON>
<DAUGHTER> Laura </DAUGHTER>
<SON> Stephen </SON>
<FAMILY>
<MOTHER progeny="true"> Emily </MOTHER>
<FATHER progeny="false"> Jarrod </FATHER>
<CHILDREN>
<SON> Henry </SON>
</CHILDREN>
</FAMILY>
</CHILDREN>
</FAMILY>
</CHILDREN>
</FAMILY>
</TREE>
We will use the family tree to exemplify the hierarchy inherent in XML. Figure 1 shows a tree representation of the same XML document.

The family tree is a good metaphor for the XML hierarchy, partly because of similar terminology. Nodes closer to the root node are referred to as parent nodes to those directly below them. The lower nodes are known as child or progeny nodes. Nodes with the same parent node are siblings.
In working with XML one of the main challenges is establishing the methods for reading and writing. While the formats include schemas and well described content, we still need to create them in the proper format and digest them in the APIs. Fortunately several packages are available, including the JAVOT and SAVOT java parsers which you will see throughout the NVOSS software exercises. The commercial software industry also provides several libraries for handling XML, parsers, serializers, etc., and they are largely following the two methods we discuss here.
DOM (Document Object Model ) parser - Tree Structure based API:
The Dom parser implements the dom api and it creates a DOM tree in memory for a XML document: node types...
SAX (Simple API For XML ) parser - Event Based API
The SAX parser implements the SAX API and it is event driven interface. As it parses it invokes the callback methods and uses handlers for StartElement, EndElement, the beginning and ending tags....
When to use DOM parser
When to use SAX parser
Validating And Non Validating
DOM and SAX can either be a validating or a non validating parser.
A validating parser checks the XML file against the rules imposed by DTD or XML Schema.
A non validating parser doesn't validate the XML file against a DTD or XML Schema.
Both Validating and non validating parser checks for the well formedness of the xml document
A couple examples have been provided for DOM and SAX parsing of a VOTable. We will be looking more closely at some XML exercises later in the NVOSS in the "Working with XML" lecture.. You can look at the code in your editor to compare the different methods for the java APIs.
NVOSS_HOME\java\src\XMLparse
The NVO Summer School is made possible through the support of the National Science Foundation and the National Aeronautics and Space Administration.
|
|