ADQL, SkyNode, SkyQuery, Java Class issues... - We are doing an Xmatch of N nodes where n=0 is the portal and N is the node where no cross-match is done. N-1 is the node doing the first cross-match, and the last XMatch is done at node 1. Suppose the weight of each node is w_n. For the operation to be guaranteed commutative, then the chi-squares need to have X2_n = X2_1*Sum(AllWeights)/Sum(WeightsOf the nodes m>=n) This suggests doing XMATCHes in decreasing weight, but that's likely to be counter to the desire for the first queries to return the fewest results. (E.g., there are a lot fewer ROSAT objects in a region than SDSS). - Absent the previous, it would be very nice to be able to have a weight that is a field in each row, i.e., some positions in a table may be better than others. This would be easy to do if we didn't have to worry about commutivity. Wouldn't even require a change to the interface, it could be implemented by nodes as appropriate. - If we are doing a Xmatch of N nodes, and the early nodes are underweighted compared to later nodes, then not only will there be extra matches at these early nodes, there will be extra matches in the later nodes. Here underweighting refers to the weight relative to the actual positional precision, not the absolute weighting. In a situation where the early nodes and later nodes are both highly weighted, there might be serious problems in the middle nodes. - The XMatch X2 value does not depend upon the density of objects in the survey. This is counterintuitive. Perhaps the weighting should depend upon both the precision and the density of objects weight ~ d/e^2 where d is the density of objects and e is the typical error in the position in a survey. - When the XMatch is used in large numbers of surveys, it will eventually begin to include objects where there are occasional dropouts. The nearest non-match will be included, but its big error contribution to the average position of the unit-vector will be washed out by the large number of correct vectors. [This may be desirable in some contexts and undesirable in others.] - XMATCH of large numbers of surveys may arise in the analysis of temporal datasets, e.g., hundreds of LSST frames of the same region. - What columns are essential to the cross match? Presumably only the sum of the weights and the weighted sum of the unit vectors. Nolan indicates that two id columns are also required, but these seem to be conveniences for a particular implementation of the cross-match. By the by it would seem possible that there would be no original column associated with a given row in a cross-match, i.e., WGACAT may have mulitple observations of the same object so one might wnat to do something like (syntax TBD) select ... from WGACAT where ... group by XMATCH(...) which does an internal cross-match. Or more conventionally... select target_name,avg(*) from WGACAT group by target_name where the unit vectors are averaged over the original rows where multiple rows have the same target_name. (The avg(*) may need to be avg(xmatch_a), avg(xmatch_ax), ... - SQL precedence has OR as the lowest precedence. Thus where a.ra>3 or a.ra<2 and Region('...') and Xmatch(...) either breaks the precedence rules or doesn't do what is expected. I suspect this will trap unwary users. - In many cases one may wish to query a remote catalog without doing a cross-match, e.g., any non-positional query. Right now this isn't really possible and it limits the kind of queries that can be sent (since it must be possible to somehow generate the XMATCH columns). It might be nicer if the XMATCH criterion were sent even to the last node in an XMATCH chain, and only if that criterion is present is an XMATCH expected to be done. E.g., the last node in an XMATCH would get XMATCH([!]a) where a is the table prefix. If no such field were present, then the final node would recognize that no xmatch query parameters were required. As far as I can tell, the query plans do not currently tell the last node that they are involved in an XMATCH chain. It is simply assumed. - The use of XMATCH is limited by how it has been made magic. In the ClassX project we did a lot of work where we correlated tables against positions near but offset from the real positions. This cannot be done with the current cross-match, but with a less magical syntax it could be. E.g., suppose the XMATCH explictly specified the RA and Dec (or unit vector) columns in its syntax. Then one could do something like... Xmatch(a.ra,a.dec,xup.ra+rand()/(60*sin(xup.dec)), xup.dec+rand()/60) Since Xmatch doesn't allow the user to specify the columns to be used that isn't possible. The syntax may let us get away with specifying a temporary table and using that for the XMATCH, but I'm not sure that our execution plans can be sophisticated enough to support that. (i.e., it requires more than one select statement or subselects). Should the execution plan allow multiple selects including some into temporary tables? Then I could do select into #temp1 a.ra + rand()....; select ... from table a, #temp1 b, upload up where ... XMATCH(b,up) Since XMATCH identifies the tables matched there are no syntax problems, although discussions indicate that the current XMATCH has only one XMATCHable table per node. - Auto XMATCH should be supported. No problems with the syntax here I think, but it's an illustration that the XMATCH may join two tables from the same archive. E.g., if I have a table of object detections in multiple observations of the same region, then select ... from MyTable a, MyTable b where Xmatch(a,b) and a.mag < b.mag-2 finds all the objects that vary by more than 2 magnitudes between observations. [Note that MyTable is the same for a and b]. - The link between the XMATCH query and the query columns that it produces which are indicated by the table xup, seems overly magical. The output indicator would better be specified in the query. This will likely be essential if we get sophisticated and allow sub-queries and such. - The parameters of the various magical incantations used in XMATCH and REGION should be accessible through standard protocols. This includes at least the names of the primary position table used in the REGION, the position columns, and the weights associated with a table in a XMATCH. E.g., we need a metaNode interface which returns a set of tables for which the XMATCH and REGION operators are effective (I do think this will often be more than one), give the position columns and coordinate frame for these (either lat,lon or unit vector or both), and specify the weight of the table in an XMATCH or the column where the weight is given if the weight varies by row. - The Java ADQL classes for functions do not seem to provide for functions with more than one argument (or less in the case of Pi and Rand), even though some such are included in the lists. (I.e. the get/putArg methods assume a scalar). - Why do the ADQL objects distinguish between Math and Trig functions? - The REGION operator is much more magical than XMATCH in what I think are undesirable ways. It would be much less of a problem if the REGION operator indicated which tables it was acting on. (It would be a little better still if it indicated which fields of which table, but I can live without that.) So rather than REGION("region specification in highly formatted string") it would be much nicer to have REGION("region specification in highly formatted string", a, b, c) where a, b and c were the tables it was to be applied to. This would make REGION and XMATCH compatible. Even better REGION(Circle(10,10,10),a,b,c) where we expose the syntax of the region specification. Note that the latter immediately allows for something like: select b.id from chandra a, hst b where region(Circle(a.ra,a.dec,10), b) as the specification for all Chandra observations within 10' of HST observations. If the field of view of a particular instrumment is defined as a parametrized region then we might get something like:.. select a.id from chandra a, hst_acs b where region(hst_acs_fov(b.center_ra, b.center_dec, b.position_angle), a) - Only a linear query topology is supported by the current executation plan. It would be easy to make it able to support abitrary topologies which may be desirable in some cases, e.g., to support parallelism in the query. The only substantial impact on a SkyNode is that it may have to ingest more than one table. Right now we have the structure ExecPlan SomeData about plan Query[] where a single ExecPlan includes some constant information about the plan plus an array of queries of various nodes. Each node deletes the top query from the plan before sending it to the next node. A more general structure would be ExecPlan SomeData about Plan QueryForThisNode ExecPlan[] where an ExecPlan includes an array of sub-plans which could be run in parallel or sequentially. This would also allow for the possibility of using different formats at different nodes (where there are receiving nodes that can handle multiple formats). This would be a little more elegant in my opinion, since rather than manipulating the current plan before sending it on, each node just sends out the sub-plan. The current topology would be represented as ExecPlan ExecPlan ExecPlan ... but one could also have ExecPlan ExecPlan ExecPlan ExecPlan ExecPlan where the second node fanned out a query to three separate nodes. - Re the previous point... We should keep the distinction between SkyNodes and SkyQuery in mind. It seems to me that SkyNode should be fairly generic and built without the specific SkyQuery implementation (however clever that might be) in mind. The linear topology, the assumption that all nodes are doing a XMATCH arise from SkyQuery. We should be building SkyNodes with more generic applications in mind. - In the portal ADQL there are Archive tables with a syntax: Archive:table alias How is Archive resolved to a SkyNode? Is this intended to be Portal independent? If not, then perhaps this syntax needn't be part of ADQL. I.e., I don't think that syntax that only appears in the portal needs to be part of ADQL. Only syntax that goes over the wire. - In the vein of the previous... The requirement that all ADQL tables have aliases makes simple queries look complex. In building a query within a portal we should make it clear that the portal can use whatever simple constructs it wants to represent the query to the user. It's only when the query goes to execute at other nodes that we need to worry about compliance with the ADQL spec. - Why are there no string functions in ADQL? These are essential in many real life situations. These vary a lot from vendor to vendor but a substring, and case and index functions will suffice. Being able to write these in a standard way would be a real boon. - The formal descriptions of ADQL given at the end of the ADQL document are abstruse enough to be useless to me. Maybe others find them helpful. A BNF description of the syntax is nicer (imho). I can make one with Javacc if you like. - The Java classes associated with the WSDL are very clumsy. E.g., there are separate classes for arrays of objects, rather than a straightforward use of the natural array syntax. In some cases getting and setting appropriate parameters requires several very confusing layers of classes (getting or setting the RA and Dec of the center of a CIRCLE region object is the most egregious example I've seen). This libraries are not ready for prime time. - To what extent is this Java class library fixed or can it be modified in ways consistent with the WSDL? Can we rewrite the WSDL consistent with the current .NET implementation in ways that give more rational Java classes? - Packages have version numbers embedded in them. This seems a no-no since these package names are needed in user code. - Packages should conform to standard Java usage (don't start with caps). - Packages need some rational organization. - Do we need to write a layer on top of the auto-generated classes that is what the user should see? - SkyNodes should not require the query portal to respond to logging requests. Simplest approach seems to be to allow the Portal URL in the query plan to be null. Do we want to support different levels/atomicity of logging? If so how does the portal tell the node what levels are supported. - Consider Xmatch(a,!b,!c). If b and c are the bottom two nodes then we get xmatch(!b, !c) for the first XMATCH. What does this mean? This needs to be an outer join of b and c rather than a match, but what do we do about the points where b and c potentially could match? Do we return only a single position there or a position for both tables? How do we ensure commutivity? Might be easier to mandate that the first table involved in an XMatch has to be a positive match. I.e., in this query a would have to be the bottom node. - Need to think about the implications of commutivity with regard to !matches. - JHU skynode implementation uses temporary tables for xmatch. For positive matches we don't need that since we can just return tables which meet the xmatch criterion. For matches with a single ! we probably need to use an 'exists' qualifier and subselect e.g., where not exists(select a.ra,a.dec where (weight+up.xmatch_a)^2 - (sin(a.ra)*cos(a.dec)+up.xmatch_ax)^2 - ... < X2 Is the exists qualifier standard SQL. [I think so]. This has the database create the temporary table. In my experience the database may not be very clever at this. It would be good to compare the two approaches (as a function of the underlying db's). - !matches affect the count of unit vectors that is used to normalize the X2. It's probably best if the nodes don't worry about this at all and just let the portals take care of it. However this would mean that nodes need to be careful if they think about reorganizing the query (some piece of documentation suggests that a node can re-order a query to suit itself). - Just reiterating... There are three things that affect the X2 value that an actual XMATCH uses: the user desired limit, the number of unit vectors that have been averaged into current vectors, and need to accommodate potentially low errors in subsequent queries to assure commutivity. - Java classes use two different types for match tables and drop tables. This seems strange. The underlying tables aren't different. It's how they are used in the XMATCH. - XMATCH syntax is reasonable for SQL in that it is given as function < value However Java classes don't work that way since the < value is incorporated into the XMATCH object rather than being a standard relationship between two values. Why not just match XMATCH a real valued function? It would still be a bit magical in how we compute it for a given row, but one could imagine it being used in much more interesting ways. E.g., we might have XMATCH(a,b) < a.criterion to do an XMATCH where the criterion was a function of the row. - REGION (on the other hand) is boolean valued, even though SQL doesn't have standard boolean variables. It would be less magical if it had a standard type. REGION < 1 means inside the region, for example. - What is sigma in the plan element? Is this the weight? I assume the weight is supposed to be 1/sigma^2. Would probably be more natural to give the weight. - The Java objects representing the SQL query should probably be available as an abstract tree. Right now there is no easy way to search the tree. E.g., suppose I want to know if there is an XMATCH. In principle I need a special parser, because there is no way to search the Where claause. E.g., the XMATCH may be hidden many levels down inside IntersectionSearchTypes and ClosedSearchTypes. Each node could represent the elements of the node as elements of tree leaves which could be searched in a standard way that didn't depend upon the type of the node. Then I could easily descend the tree to search for XMATCHes, or REGIONs or whatever I needed to know about the query. Without this, it is easier (for me at least) to do regular expression searches on the SQL rendering of the query than to manipulate the object tree. I.e., I'd do better just to get the SQL string directly since all I do is build it and then manipulate it -- which is easier than the supposedly preparsed object tree. - Need to track database peculiarities. Sybase power(a,exp) versus pow(a,exp) Datatype is 'double precision' not 'double' set rowlimit n select ... where ... set rowlimit 0 versus select top n from ... where ... - In Class AllOrDistinctType the case is peculiar... All must be specified as "All" but Distinct is specified as "DISTINCT". Probably should be case sensitive, but if it's going to be it needs to be consistent. - Why isn't XMATCH serializable? [Because of bug in EnumSerializer. Is our fix generally correct or have we broken serialization of something else.] - Why does xmatch have minoccurs for sigma as 0 in WSDL? - Looks like SDSS node adds xmatch columns on output if a region qualifier is included. Is this mandatory? Nolan had indicated these were added by an analysis looking for XMATCHes on higher level nodes... - The portal node need not be the same as the node the data goes to... Should not need to have a portal node respond to queries. Do we want to be able to specify more than one portal? - Mixed us of both single and double quotes in region qualifier (for string and arcminutes or arcseconds) is extremely nasty. This breaks standard SQL rules and parsing. Should just make the field use arcminutes (or add another field which gives the units. Why we don't have: REGION(CIRCLE('J2000', ra, dec, radius, 'arcminutes')) which fits in nicely with other SQL.