Suggested changes and clarifications for the SkyNode. Suggested changes are first described briefly. A discussion of the justification for the suggestions follows. This changes are limited to the SkyNode protocol and do not address concerns with the Java libraries. Changes: 1. Explicitly define all columns that are returned when a cross-match is done. This includes: name, type and definition of these columns. xmatch_a, double, summed weights for each source xmatch_a[x|y|z], double, weighted summed unit vectors The names should all have a common prefix (xmatch_). 2. Do not require id information columns from crossmatch (or if they remain required, make sure the names begin with xmatch_) 3. Include XMATCH(a) as an explicit qualifier for bottom nodes that are involved in a cross-match, i.e., select a.ra as sdss_ra, a.dec as sdss_dec from photoprimary a where xmatch(a) 4. Include the tables for which the region is specified in the region qualifier. E.g., select a.ra, a.dec from table1 a, table2 b where REGION(regionspec, a) 5. Describe any limits/changes between portal and SkyNode ADQL. Changes that I might include: SkyNode XMATCH only includes 1 or 2 tables (only 2 now, but I'm suggesting adding 1) SkyNode tables do not include node qualifications, i.e., SkyNodes need not handle queries of form: select ... from SDSS:photoprimary a where ... since the portal will have resolved the prefixes. Note that the SkyNode document explicitly includes the prefixes (e.g., Q1-11) and should be updated. 6. The syntax of the ExecPlan should explicitly indicate the element that the current node is addressing. Alternatives as to how this might be done are discussed below. 7. Move the format, upload table name and upload prefix elements from the ExecPlan as a whole to within the PlanElements. 8. Do not require specification of a PortalURL in the ExecPlan. Make it possible for a plan to suppress logging. 9. The SkyNode should have an additional method or methods which return metadata regarding the node. This metadata includes: The table or tables which are XMATCHable. The precision/weight for the XMATCH for each XMATCHable table. The table/tables to which REGION applies The prefix associated with this node in portal ADQL (i.e., in select ... from PREFIX:table a) Most of this information could be added to the TableData metadata or to a new call. Justification: 1. To build a full SkyNode it is essential to know which columns to create and which to expect when doing queries. 2. These are not required to do a cross-match. If they derive from the original table then they are not guaranteed to be unique. If the node reading a set of results wants a row number, it can easily generate it iself. 3. A node should only have to look at the element that it is handling and not at any other element in the plan. The XMATCH(a) signals that the XMATCH columns will be required. There should not be 'conventions' that span multiple elements. Each query should be completely described in its own element. Currently although there is no indication of the XMATCH columns in the bottom node, they are explicitly indicated in the intermediate nodes. It would be reasonable to either always include the columns explicitly, even in the bottom node [select xup.xmatch_a, .... from table a where xmatch(a) ...] or to always include them implicitly when there is an XMATCH in the where clause. In the latter case they would be dropped from the intermediate nodes. XMATCH columns are marked using an undefined table prefix, xup. This corresponds to the JHU implementation of XMATCH, but a temporary table is not generally required. Some consistent marking of the XMATCH columns would be desirable... One possible syntax might be select a.ra,..., xmatch(a,b) from table a, table b where xmatch(a,b) < ... which includes xmatch() in the selection list as well as the where clause. Then bottom nodes would be select a.ra,... xmatch(a) from table a where xmatch (a) < ... 4. The REGION qualifier is magic but by constaining it to specify the tables that are affected we allow it to be used much more powerfully than if the node needs to make the decision. E.g., suppose a node has a table of observations and a table of objects -- this is common enough. We can easily want to use REGION for either or both of those tables but that is currently precluded The current required behavior is not at all clear to me. E.g., if we have select a.ra,a.dec,b.ra,b.dec from observations a, objects b where REGION(spec) does the region constraint currently apply to the observations, objects or both independently. All three options should be supported and could be as REGION(spec, a), REGION(spec, b) and REGION(spec,a,b) 5. The node writer need not deal with all of the complexities of the full portal ADQL. Either we need to define an action that a node should take given these constructs, or we should indicate that it is an error for a node to see them. Generally the SkyNode document needs to be updated to be a full specification of the interface including the full protocols specification for the methods supported, the description of all metadata returned, the ADQL syntax supported, and a detailed specification of the structure of the returned VOTable. 6. By requiring an analysis of the ExecPlan to determine the 'current' element, the current system makes the ExecPlan much more rigid since all nodes need to agree upon how to do this analysis. By explicitly specifying the 'current' element and making sure that the current node only needs to worry about the current element (see #3), we decouple the plan elements and make it much clearer what each node needs to do. Here are a few possible ways we might indicate the 'current' element: - The first element is always the current element, and is discarded from the ExecPlan when it is passed to the next element. This is what I thought happened and I still think it is the most elegant. It makes it very clear that the lower nodes are not dependent upon the higher nodes since they don't even see those plan elements. This one lends itself most naturally to growth to more sophisticated plan patterns. - An index is passed in the ExecPlan structure and incremented as the plan is passed along. - As each element is processed a flag is set for that plan element so that a lower node can recognize that that plan element is being handled by a higher node. - Each element has an id, and the current ID is included as a new element. 7. If the format is useful it implies multiple formats are available. If SkyNodes support multiple formats, then we may wish to use different formats from different nodes when the receiving node can handle both formats. If not all formats are supported by all nodes, then we may need different formats to be able to get to all nodes. The format for the ExecPlan itself would be superceded by the format in the first plan element. If all nodes are required to use VOTables for communication and the FORMAT is only intended for the final output, then it isn't really a SkyNode issue, but essentially a portal one. Putting the upload table name and prefix in the execplan means that only a linear query structure is supported. Allowing different names and prefixes enables one node to simultaneously query multiple subtables. We should retain that flexibility even if we don't currently use it. 8. Logging should be optional. This doesn't mean that a node cannot keep internal logs, simply that it should not require a query to use the SkyNode specific logging interface. 9. This information is needed by portals to construct query plans. Even if it were available from the registry (and I cannot see where it is), the node should be self-descriptive since we don't want to mandate coupling the nodes and the registry. If this information is to be placed in the registry, it should probably be done by giving the registry (or registry populator service) the node URL and then using these new interfaces to get the information to the registry.