Science with the Virtual Observatory |
These notes are companions to the presentationOpenSkyQuery (PPT).
A Web application that allows querying and cross-matching distributed astronomical services known as SkyNodes. It is also a place to upload small personal catalogs (~ 5k rows) for cross-matching with the big public SkyNodes.
A number of astronomy survey projects produce databases containing source catalogs. Some of these databases contain a billion or more entries, and for each entry, tens or even hundreds of columns may be stored. A VO SkyNode is a standard interface to such databases. SkyNodes are needed because these databases can be in different systems, such as SQL Server, Oracle, Sybase, MySQL, etc. To avoid requiring users to learn different database systems, or to get database access accounts for many servers, these databases are provided to the VO through the SkyNode interface.
SkyNodes are classified as BASIC and FULL. The main distinction between them is that FULL SkyNode are able to perform cross-matches.
ADQL is the Astronomy Data Query Language. It is the query language understood by SkyNodes. It looks very much like SQL, the standard Structured Query Language used by many database systems. ADQL does not support all features of SQL, but it adds several important capabilities specialized for astronomy. These include a cross-match function, XMATCH, that allows users to find objects in different databases that are spatially coincident to within some tolerance, and a REGION function that allows users to constrain their queries to a particular region on the sky.
When the portal receives a query, it is parsed and sent to the SkyNodes according to an optimized query execution plan. At the moment, the optimization is very basic and consists of first counting how many objects meet the WHERE constraint and then organizing the execution plan in such a way that the nodes with the smallest number of objects meeting the constraint executes first. The execution plan is transmitted from one node to the next in descending order of cardinality. Results are pushed up from one node to the next in ascending order of cardinality. The last node in receiving the plan does not require executing XMATCH. There is nothing to be cross-matched against yet. The cross-matching happens in subsequent nodes.
Limitations: SkyNodes are currently synchronous. This means the caller waits for the answer from the node. which may lead to performance problems if no limit is applied. Currently, we restrict the return from the nodes to 5k rows. This number is an arbitrary cut but some limit is necessary when running synchronously.
How does this limitation affect queries? First, single SkyNode queries will return a maximum of 5k rows. Second, and most important, cross-match queries between sets that contain about 5k rows are likely to be incomplete. If a SkyNode has more than 5k objects meeting the WHERE condition, the cut will be applied. This usually happens when the REGION constraint covers a big area with a lot of objects, OR/AND other conditions in the WHERE clause are not restrictive enough. An additional cut may happen after the cross-match is performed and the results are sent to the next node. During the cross-match process, each object from the prior node is compared to the current. If the prior node provided about 5k rows and a one-to-one match is expected, the match table might end up with more than 5k rows depending on how restrictive is the confidence level in the XMATCH and the catalog precision.
Conclusion: if you expect to have a big overlay between catalogs, use constrains (PARAMETER ranges or REGIONS) that keep the number of objects small. How small depends on how many objects you expect to cross-match per object. We are working on a parallel platform to provide large-scale query and cross-match. Thank you for your patience!
Problem statement: JWST is planning to select guide stars from the Guide Star Catalog 2 (GSC2). The field of view of the JWST Fine Guidance Sensor (FGS) is only about 11 square arcmin, so it will be necessary to select guide stars near the faint limit of GSC2. GSC2 contains both unresolved point sources (“stars”) and extended objects (“galaxies”). The extended objects may not be good for guiding, so the guide star selection system may avoid objects classified as extended in GSC2. Classifying objects as point or extended sources is difficult at the faint limit of GSC2, so it is interesting and useful to explore where these classifications start to break down in the current version of GSC2. One way to see if there is a problem is to compare classifications from GSC2 and the Sloan Digital Sky Survey (SDSS) using Open SkyQuery. Where they agree, the classifications are probably good. Where they start to disagree, further work is probably needed in one or both catalogs. Below are detailed instructions for using Open SkyQuery to make this comparison.