The National Virtual Observatory
|
Building a SkyNode Server
Tom McGlynn (NASA/GSFC)
Introduction.
This talk describes the issues involved in setting up a SkyNode server in the Virtual Observatory. SkyNode's allows outside users sophisticated access to data tables. Currently users will not normally interact directly with a SkyNode. SkyNodes are accessed through SkyPortals. SkyPortals can relay a request to the appropriate SkyNode for simple queries, and can break down a complex request that involves multiple nodes into a series of requests against these nodes.

This diagram shows how the SkyPortal may initiate a request to several
SkyNodes. As part of the process of deciding how to handle
the full query, current
SkyPortals will send metadata and cost requests directly to each
of the nodes to assess the best order for the query may be
done.
SkyNodes are defined as Web Services, and their interface is defined through a WSDL file that gives the details of the specific services. Currently the WSDL describes SOAP as well CGI POST and GET interfaces to the SkyNode. These differ little in functionality: the differences occur in how the data is transported to the responding task and how the data is wrapped. We shall only consider the SOAP interfaces in the following discussion.
This presentation discusses the general issues of building a SkyNode. These include the needed pre-requisite capabilities, the functions a SkyNode needs to support, conversion of the query inputs to something the local database can use and handling of the special ADQL concepts. We then describe the FullSkyNode package included in the NVOSS2006 software distribution. This package enables users to rapidly set up SkyNodes if a few basic criteria are met.
SkyNode Prerequisites
There are really only two substantial prerequisites for building a SkyNode: a queryable dataset, and a Web server.The data provider needs to have some data to provide! It is likely that the provider has a relational database system running, but this is not explicitly required. It is typically far easier to put data into one of the free relational databases, than to build the translations that would be required to support queries of a table in some native FITS or other file format. The underlying organization of the data is largely invisible to the outside world.
Current SkyNodes all include tables with positions on the sky. This is not required by the protocol and it is perfectly feasible to create a non-positional node (though perhaps calling it a SkyNodes would be a bit of a misnomer then). A SkyNode may support one or more tables, but it is assumed that one table is the primary positional table. Special positional constraints can be applied to this table, and positional matches with other SkyNodes are possible. For current SkyNodes this table is normally a table of object positions. It is perfectly possible to have a table of observations as the 'primary' table, but this is not currently done. Many SkyNodes have both object and observation tables included, but the object table has always been set as the primary table.
Since SkyNodes run as Web services, the provider must be able to modify the configuration of some Web server to run their SkyNode. The Web server need not be on the same machine as the database. Most databases are easily accessible via network links. A typical configuration may have the Web server on some publicly visible machine (perhaps in a DMZ on sites with more sophisticated security postures), and the database server behind a firewall, visible to the Web server, but not to the general public.
SkyNode Functionality
The current (v1.01) documentation for SkyNodes describes functionality at the requirements level. A more useful to developers, though far less friendly, description of the interface is given by the WSDL returned from SkyNodes. We treat the SkyNode services developed at JHU and STScI as reference nodes. These were the first SkyNodes available are are still likely the most commonly used. E.g., the WSDL returned from the GSC2.2 SkyNode can be be retrieved at the URL http://galex-dev.stsci.edu/GSC23Skynode/nodeb.asmx?wsdl. Even this is not sufficient to build a functioning SkyNode. The discussion later in this document and in the final section describes specific issues that must be addressed by the SkyNode developer that are not discussed in the documentation nor indicated in the WSDL.The following metadata functions are to be supported by SkyNodes:
- Return WSDL describing the node
- As with most Web services this is done by appending with ?wsdl to the base URL of the Web Service.
- Tables
- Return metadata on all of the tables in the SkyNode. The metadata includes the name, description and number of rows in the table.
- Table table_name
- Return metadata on a specific table.
- Columns table_name
- Return metadata on all columns of a specified table. The metadata may include the name, description, type and UCD of the column.
- Column table_name column_name
- Return metadata on a specific column of specific table.
- Formats
- Return the formats supported by the SkyNode. SkyNodes are required to support VOTables, and the software we describe later uses only this format.
- Functions
- Return the metafunctions supported by the SkyNode. No use of this function is known.
- FootPrint region_specification
- Return the overlap between the input footprint and the region specified. If a node does not wish to implement this it can simply return the input region specification (implying all-sky coverage).
- GetAvailability
- Return a description of when the service is available. No use of this function is known.
- PerformQuery adql_query format
- Execute a query on the local machine and return the results in the requested format.
- QueryCost adql_query
- Estimate the cost of this query. Normally the cost is an estimate of the number of rows the query will return. (Executing the query is a perfectly valid approach). This method will be used by SkyPortals in estimating how to organize a query most efficiently.
- ExecutePlan exec_plan
- If appropriate relay the plan to a downline node and ingest the results. Run the query on the local machines and return the results.
A SkyNode that supports only the PerformQuery function is sometimes called a 'basic' SkyNode. A 'full' SkyNode supports all of three of the major services. The software described below supports full SkyNode capabilities. There is an implicit requirement on full SkyNodes that the a request can generate a temporary table in the database. If this is not possible, then a provider may wish to disable the ExecutePlan functionality. Most databases allow for temporary tables which are automatically deleted when the database session in which they were created terminates even for users who cannot create permanent tables.
Handling the metadata functions are relatively straightforward. A data provider may choose to put this in the code, read from configuration files, or query the database (especially for the column and table information). Some of the metadata functions are not currently used.
Handling the last three methods is more complex and we deal with the issues they raise in the following sections. PerformQuery and ExecutePlan run queries on the SkyNode and return results as VOTables or possibly in proprietary formats. The QueryCost function may use some heuristic in estimating the query cost, but the simplest implementation is to execute the query and count the results.
Handling an Execution Plan
The execution plan is a relatively simple structure. Basically it has a set of rows, one for each node that participates in the query. Each row has a node identifier, a URL for the node, and a query to be executed on this node. [There's a little bit more information for the execution plan as a whole, but we don't need to worry about this here.] The SkyPortal sends the execution plan to the URL specified in the first row. That SkyNode is expected to recognize that it is the current node by looking at the node identifier for the node, and see that the id in the execution plan matches the node id.It is not the role of the SkyNode to do anything with the execution plan. That's fully the purview of the SkyPortal.
Once a node locates itself in the ExecutePlan, the node needs to check to see if there are subsequent nodes. If there are, then the execution plan is sent unaltered to the node specified in the next row. The syntax allows for fallback URLs if the primary URL fails, e.g., if a given SkyNode has multiple mirrors. It is unclear if this have ever been used.
The downline SkyNode should return a table as output. The current SkyNode is expected to ingest this table into the local database. To avoid tables cluttering the database it is best to use temporary tables which will be deleted automatically when the current session terminates.
The node then executes its own query. The local query will only refer to local tables, and to the one uploaded table.
Once the SkyNode is finished with the local query it simply prints out the results (in the agreed format) on the standard output. The upline node, or portal, will be waiting for them.
Handling the Query
The queries included in requests to the SkyNode are sent as an XML document essentially giving a fully parsed version of the ADQL query. Translating this query into something that the local database can handle requires a number of transformations.Automated tools can fairly quickly regenerate an SQL-like ADQL query. The SkyNode implementation below uses an XSL transformation. The ADQL received by SkyNodes is somewhat simpler than that described in the ADQL documentation. SkyNodes should never see node designators in the ADQL. E.g., a user might enter
Also, any XMATCH seen at a SkyNode will have exactly two tables involved. Multiple XMATCH's are broken up into a series of queries at different nodes.
The user may need to address the two ADQL special features: REGION and XMATCH. We discuss these below showing how these can be translated into more or less standard SQL. For very large databases more specialized and optimized solutions are likely to be desirable, but the approach here illustrates issues that any SkyNode needs to address.
Regions
A region may have been specified using a REGION constraint in the WHERE clause of the database. Only circular regions are currently used. The ADQL will have an element in the WHERE clause looking like: REGION('CIRCLE J2000 12.3 15.6 10') This indicates that we are only to search within 10 arcminutes of the J2000 position (RA,Dec)=(12.3,15.6) where the coordinate values are in decimal degrees.Note that the REGION does not indicate the table or parameters that are involved in this selection. This is up to the SkyNode provider. (Hopefully this will be more explicit in future versions of SkyNodes.)
Assuming RA and Dec columns in decimal degrees this can be replaced using the Haversine formula for the distance between two points of a sphere.
Following the example above sin(10'/2) = .0014..., cos(15.6) = .963... and there are 0.017... radians per degree so the constraint can be expressed in SQL as
There are a couple of subtleties here. If the circle wraps over the 360 to 0 transition in RA, then the between on the RA should be replaced with a not between. E.g., if we want the data between RA=359 and 1 then the maximum is 359 and the minimum is 1 and we wish to exclude everything between 1 and 359.
It is fairly easy to compute the min and max Dec values, but getting the min and max RA is a little less obvious. Just assuming that the delta we wish to apply in rights ascension is just Dra=Ddec/cos(dec) turns out to be a bit naive. The maximum range in longitude for a circle of radius r drawn around a point at latitude b0 occurs at a latitude b such that sin(b) = sin(b0)/cos(r) This implies that the maximum offset in RA is
Some tables may represent coordinate positions with unit vectors in addition to or in place of standard coordinates.
Handling XMATCH
Handling XMATCH's requires more work, since the SkyPortal makes many implicit assumptions. What needs to be done depends upon where the current SkyNode is in the set of nodes. XMATCH's will only be found when we are processing an ExecutePlan since they always involve multiple nodes.A brief review of XMATCH
The XMATCH algorithm is described in detail at the OpenSkyQuery portal site. Here we give a short, but perhaps more intuitive description. Consider a set of unit vectors on the unit sphere. Take the average -- in three space -- of those vectors. The average vector will always lie within the sphere. As the points are spread further over the sphere the average vector will get shorter and shorter.The XMATCH algorithm takes advantage of this and accepts matches from multiple catalogs when the average of the points from all the catalogs being joined is longer than some criterion. The math in the reference above suggest an appropriate value for that criterion. When a catalog has very precise positions, it gets weighted highly in building the average. Lower precision catalogs are weighted less. The weight is essentially the inverse square of the positional accuracy expressed in radians. Since we only deal with averages, not two way joins, the algorithm is clearly symmetric with regard to any number of participating tables. Because of the way the variances add, it turns out to be possible to do the calculation as a series of two-way joins without losing this symmetry.
Adding in the XMATCH columns
An XMATCH is always between the current node and the results uploaded from the immediately downline node. Thus the bottom node never executes an XMATCH. However the bottom node and all nodes except the top node must return six special parameters as columns in the results. The columns must have the following names.- xmatch_a
- The sum of the weights of the unit vectors so far
- xmatch_ax, xmatch_ay, xmatch_az
- The weighted sums of the unit vectors.
- pk
- An integer (normally intended to be distinct)
- id
- Another integer (normally intended to be distinct)
For the bottom node if we do not keep the positions as unit vectors, we need to compute the positions from the coordinates (as described in the Region specification) and multiply by the table weight. Note that the table weight is typically a large number, of order 1010 for a table with a positional accuracy of a few arcseconds. For intermediate nodes, we need to add the weighted values from the current node to the values we read from the previous node. (Note the exception in anticorrelations below).
Current SkyPortals do not normally request the unit vectors and weights from the top node, so these need not be supplied there.
Two integer fields are also needed. These PK and ID fields are used to identify rows in the cross-match at some nodes, but the values do not seem to be crucial.
The SkyPortal normally requests an XMATCH_CHISQ value from the top node. This is simply the difference between the magnitude of the vector represented by (xmatch_ax,xmatch_ay,xmatch_az) and the sum of the weights, xmatch_a (see below in Perform the XMATCH for how we might compute this).
Performing the XMATCH
For all nodes except the bottom node, there will be something like XMATCH(a,up) < 3.5 in the query. To replace this with standard SQL is a bit tedious. It is a bit easier if the table has unit vector columns, but basically we need to replace this bit of the ADQL with something like:Anticorrelations
One of the useful capabilities of an XMATCH is to do anticorrelations. When a SkyNode sees an anticorrelation it will always be the local table that will be correlated against, e.g., the node will see XMATCH(up,!a), never XMATCH(a, !up).An anticorrelation is most conveniently done in two steps. First do the standard XMATCH, and then return all the uploaded results that were not matched in the first step. This can be done by saving the first result as a temporary table, or by using a subselect.
In the initial query we replace the entire selection list with a unique key for the uploaded table information. Presumably this could be the mysterious PK or ID fields, but it is probably best to generate a key as one uploads these data directly to be sure of uniqueness. So
These results (i.e., the column of matched keys) can be stored in a temporary table (using syntax that depends upon the local database). Then we execute a second query
Note that all the the remaining criteria in the WHERE clause are dropped from the second query. They should have already been applied. Also, there should be no columns that reference the local table in the selection list. It is not clear how such columns could be filled. Anticorrelations are filters that drop rows from the database but never add anything.
Note that this means that when we do an anticorrelation we do not (since we cannot) add information from the current table to the xmatch_a, xmatch_ax, ... columns. The values in the uploaded table are passed on unchanged if there is a higher level node.
The FullSkyNode Software
A Java-based SkyNode package is supplied in the Summer School software distribution in the $NVOSS_HOME/java/src/fullskynode directory. This package supports all of the major functions of the SkyNode: it can query the local database or handle an execution plan sent from another SkyNode. This section discusses how to adapt this software to easily bring up new SkyNodes.This software implements all major functions for SkyNodes and is intended to replace all the FullSkyNode package released last year, and the BasicSkyNode package released in the first summer school. It does not use any Java environment outside of what is provided with the Java 1.5 JRE. Neither AXIS nor any other Web services framework is required. Java 1.5 has substantially reduced the time needed to start up applications and running as a CGI rather a servlet does not significantly slow this application.
The SkyNode specification makes a distinction between basic and full nodes. This package supports the full SkyNode implementation.
Additional Prerequisites
The software package requires the following to be able to work:- A Java 1.5 installation visible from the Web environment.
- A JDBC compatible database and the JDBC JAR for that database
- Perl. A simple Perl wrapper, sn.pl, initiates the CGI in the standard release.
- A table with RA and Dec columns (of any name) with the J2000 or ICRS coordinates in decimal degrees.
Organization of the software installation
The FullSkyNode package requires the installation of 5 files. We describe each in turn. The ANT script provided in the source directory is set to install a SkyNode that points to the MySQL database built elsewhere in the Summer School. For a production installation a provider may wish to modify this script to do the installation. The notes indicate how users might tweak the provided files for different circumstances.- The Perl script which actually runs the CGI request [sn.pl]. On Unix systems the first line of this script may need to be modified to point to the correct version of perl (e.g. #!/usr/bin/perl). In Windows the environment variable SystemRoot needs to be set to root directory of the Windows installation. As delivered the code assumes this is c:\windows. If the provider wishes to initiate the Java command in some other way, the Perl script can be discarded entirely.
- The Java JAR file with the SkyNode libraries [skynode.jar]. This is normally created using ANT. Users may wish to modify the software as discussed below in which case the JAR needs to be rebuilt.
- An XSL transformation file that is used to transform between the XML and SQL-like representations of the query [xml2sql.xsl]. This file can be modified to adapt to the local SQL dialect but in the current software these adaptations are done in the DBQuery class instead.
- The WSDL description of the service [skynode.wsdl]. This file needs to be updated with the URL pointing to the SkyNode. No other modifications are needed.
- The run-time configuration file [vo.settings]. This file is used to configure the SkyNode. It is read as the SkyNode begins to process a request. The SkyNode software has been designed such that almost all changes you need to do to install a new SkyNode are made to this file.
The ANT script and the vo.settings file supplied with the distribution can be modified to adapt the default settings to a given SkyNode. A USAGE file indicates how a user can set up a SkyNode with the MySQL SPOCS database set up earlier in the Summer School. Note that the ANT install script will transform the vo.settings, skynode.wsdl and sn.pl found in the source directory as it installs them into the CGI directory. Fields of the form $XXXX$ are replaced. The XSL file is not currently edited by the ANT script.
Run Time Configuration
The Full SkyNode uses the fields of the SkyNode class to control its actions. Users can override the defaults at run time using a vo.settings files which is found in the same directory as the CGI script. It can be edited there, or they they can edit the vo.settings file found in the FullSkyNode source directory before the SkyNode is installed using ANT.Each line of the settings file is either blank, a comment (begins with #) or overrides a single setting. A line looks like:
Key settings include:
- WSDLFile
- The WSDL service description file. This is the path to the WSDL file from directory in which the CGI script is run. This file is only used when a request comes for the WSDL description (i.e, http://some.node/../sn.pl?wsdl). It is not used within the program at all, nor are any classes built from it.
- XSLFile
- The XSL file used to translate from the XMLified query to an ADQL string. Users can modify this file to adapt to the local SQL dialect, or they may prefer (as is done in the current software) to manipulate the ADQL string version.
- MaxString
- The Maximum length of a string. This is used when ingesting string values from a downline node. Strings will be truncated at this length.
- AdqlURI
- The URI used to identify the very of ADQL used. This may be needed in future versions if the ADQL is manipulated before being sent to downline nodes. It is not currently used.
- VOTableURI
- The URI used to identify the version of the VOTable used. This is used when the VOTable result is sent upline. Currently it indicates the version 1.1 of the VOTable standard is being used.
- jdbcURL
- The JDBC URL that identifies the database to connect to.
The syntax of this URL varies among databases. Here are some
of the variants (note that a given system may support
more than one variant).
- Oracle: jdbc:oracle:thin:@<SERVERNAME>:<PORT>:<DBNAME>
- Sybase: jdbc:sybase:Tds:<SERVERNAME>:<PORT>/<DBNAME>
- SQL Server: jdbc:microsoft:sqlserver://<SERVERNAME>:<PORT>
- MySQL: jdbc:mysql://<SERVERNAME>:<PORT>/<DBNAME>
- Postgres: jdbc:postgresql://<SERVERNAME>:<PORT>/<DBNAME>
- JDBCDriver
- The class name of the JDBC Driver class. This is
specific to a given database.
- SQL Server: com.microsoft.jdbc.sqlserver.SQLServerDriver
- MySQL: org.gjt.mm.mysql.Driver
- Oracle 8i: oracle.jdbc.driver.OracleDriver
- Oracle 9i: oracle.jdbc.OracleDriver
- Postgres: org.postgresql.Driver
- Sybase com.sybase.jdbc2.jdbc.SybDriver
- jdbcName
- The account name for the database account used in queries. Note that this is the database account name, not the system account name.
- jdbcPwd
- The password for the account used in queries. Note that this is the password for the database account, not the system account.
- TempTableName
- The name used in creating temporary tables. Special strings may be needed. Some database systems use "create temporary table" to create a temporary table. Others use special names. The software tries the "create temporary table" syntax first and if that fails assumes that a name beginning with '#' is to be a temporary table.
- TargetNode
- The Identifier of the SkyNode. This is given as the ShortName in the registry entry for the SkyNode. It must be unique. Note that the SkyPortal will normally send this as uppercase.
- raName
- The name of the RA column in the database. This column is assumed to have the decimal J2000 (or ICRS) right ascension for the given row. This is the RA column that will be used in resolving REGION and XMATCH constraints.
- decName
- The name of the Dec column in the database. This column is assumed to have the decimal J2000 (or ICRS) declination of the table. This is the declination column that will be used in resolving REGION and XMATCH constraints.
- TableName
- An array of table names that are to be part of the SkyNode. If more than one table name is to be specified, separate them with commas.
- TableDescription
- An array of table descriptions that are to be part of the SkyNode If more than one table is to be specified, separate the values with commas. Note that a comma cannot be escaped to be part of the description.
- TableSize
- An array giving the size of the tables. This is often specified as -1 for each table.
- PositUncert
- The positional uncertainty of objects in the SkyNode. In arcseconds. This value is used in a XMATCH to allow simpler, index friendly, search criteria to be added to the XMATCH. It's value should be coupled with the TableWeight. Note that this is given in arcseconds, but the TableWeight is the inverse square of this in radians. 1 radian=200000 arc seconds.
- TableWeight
- The weight of the SkyNode in an XMATCH. This is equal to 1/PositUncert^2 when the PositUncert is expressed in radians. Increasing the TableWeight has the effect of decreasing the match radius.
- LogFile
- The name of a file in which details (and some debugging information) will be stored. By default this file "sn.log". If you are having problems with the SkyNode look at the log to see where it might be having difficulties. The code will run fine if the log file cannot be created, but you will have a more difficult time resolving problems.
Configuring Multiple SkyNodes
Given the constraints on how tables are used in SkyNodes, a data provider may wish to set up multiple nodes pointing to different tables in the database. The easiest way to accomplish this with the software provided is to create a separate directory in the CGI area for each SkyNode. The CGI script, settings and WSDL file will need to be customized for each node, but the XSL file and the JAR file can be shared by all of them. The settings files need to indicate which tables each node is to use and may have different table weights if the tables have different positional accuracies. The WSDL files need the appropriate URL.Modifying the code.
The FullSkyNode software may be adapted to address specific issues in a given installation. The Java code is available and may be used and modified without any restrictions. Note that there are likely errors in the system and a few features have yet to be implemented. The Java classes used in the library are in the net.ivoa.skynode package with a couple of utilities from the net.ivoa.util package. The only other Java code needed not found in the system libraries in Java 1.5 is the JDBC JAR file for the database that is to be queried.The local Java classes (in the net.ivoa.skynode package) are listed below.
- SkyNode
- This class is primarily definitions of constants used throughout the SkyNode software. There are also static methods that update the compiled settings with values from the vo.settings file.
- FullNode
- This is the main class for the SkyNode. It initiates the process of parsing and responding to a request. It handles the request for the SkyNode WSDL and provides a central logging facility.
- SoapParser
- This class parses the SOAP request and passes the elements to be processed by other class.
- SoapWrapper
- This class provides the appropriate SOAP wrappers for results.
- DBQuery
- This class handles the low level communications with the database using JDBC. It also handles database dependent features. Currently this includes handling temporary tables and how a query should be formatted to return only a limited number of rows.
- Query
- This class is responsible for translating the query from the input ADQL into SQL that the database can handle. It translates the REGION and XMATCH constraints.
- MetaQuery
- This class handles the metadata queries of the database.
- Ingester
- This class reads a VOTable from a downline node and ingests it into a table in the local database.
- Encoder
- This class writes a VOTable given the results of the current query.
- DOMUtil
- This class provides a few utility functions that for handling XML representations of data.
- XSLTransform
- This class invokes an XSL transformation on an XML document.
- BreakError
- This class is used to terminate SAX parsing of an XML document.
- FootPrintQuery
- This class responds to the FootPrint request. The current implementation simply returns the input region.
- PlanExecuter
- This class processes an Execution Plan.
- QueryCost
- This class returns the cost of a query -- currently just the number of rows in the query.
Some changes can be relatively trivial. E.g., if the Right Ascension is stored in decimal hours rather than decimal degrees, then a factor of 15 needs to be added several places in the code.
If the positions are stored in galactic coordinates, then the changes required are rather more substantial. While it's very easy to handle the REGIONS, since we can simply translated the J2000 input to Galactic coordinates, doing an XMATCH is rather more complex. Users may note that the database is perfectly happy doing the complex arithmetic transformations, so it would be perfectly feasible to write the SQL that transforms the L and B in the RA and Dec (or perhaps does this for the unit vectors). Alternatively the data provider may choose to add RA and Dec columns to the table.
If positions are already stored in unit vector fashion, then the handling of REGION and XMATCH can be substantially simplified.
The code uses XSL transformations in Java to transform from an XML to a string representation for the query. A user can update this XSL to more closely mimic the local database. E.g., the supplied code might translated something at select top 10 ... but this XSL file can be modified to translate that to the MySQL compatible syntax of select ... limit 10
Future Developments In SkyNodes
The SkyNode and ADQL standards are the subject of intense interest currently in the both the NVO and the IVOA generally. A number of suggestions have been made as to how to improve and generalize the SkyNodes. These include new ADQL functions (e.g., a function to make it easy to specify the angular distance between two points), sophisticated region handling, more complex query topologies than the simple linear topology of the current query plans, and changes to the way the query is encoded. These should greatly enhance the power of SkyNodes.It is unlikely that there will be substantial changes before the end of the year, and we intend that should there be changes, new versions of the FullSkyNode code will be available to adapt to these changes with minimal effort on the data providers part. E.g., a SkyNode may need to update the JAR file used.