The National Virtual Observatory
|
Introduction to Structured Query Language (SQL)
Tom McGlynn (NASA/GSFC) and Maria Nieto-Santisteban (JHU)
- Introduction
- SQL Overview
- Using MySQL and the SPOCS database
- Using On-line Astronomical databases
- SQL in the Virtual Observatory
- Useful links
Introduction.
Structured Query Language (SQL, sometimes pronounced as "sequel") is a language used for manipulating and querying relational databases. Much of the most interesting information in the VO is in tables large and small. An understanding of how these databases are built and queried is essential to understanding the kinds of science that can be with these resources.The origins of relational databases.
The roots of SQL go back about 30 years to the introduction of the relational model for databases and the release of database systems based on it. (Or at least partially based. The inventor of the relational model E.F. Codd was no friend to many of the relational databases systems he inspired, nor of SQL itself.) At that time there were two other popular models for databases.In hierarchical databases, information was stored in a hierarchy of levels. The associations between different kinds of data were reflected in the hierarchical structure.
In the network data model the associations were carried by explicit pointers from one piece of data to another. In network data models anything can point to anything else. This provides enormous flexibility in the database, but it can be hard to understand the underlying structure.
The relational model separates two concepts that are intertwined in the other models. The tables used in relational databases are very simple, non-hierarchical and have no pointers. The associations we wish to make between tables are given as sets of rules that can be used to join when querying them. These rules are implicit and dynamic, rather than the explicit and static and static structures in hierachical and network databases, A consequence of this dynamic binding is that it is natural in relational databases to look for new associations between tables in addition to any built-in structure. Relational databases can be very powerful tools in the kind of data mining that we wish to do in the Virtual Observatory.
The combination of flexibility and structure in the relational database model proved in the 1980's to be the most efficient way of storing large amounts of tabular information. The vast majority of databases today use this model. However the hierarchical and network models are still very much alive. XML underlies much of what we do in the Virtual Observatory and is a clear inheritor of the hierarchical model. The WorldWideWeb, arguably the most important invention of the last two decades, is essentially just a network database.
Once it became clear that relational databases would dominate, attention
turned to developing standards for them. In 1986 the first SQL standard
was published and several updates have been made since then, the latest
in 2003. This talk will first describe SQL and then go through
exercises using local and remote databases.
Overview of SQL
What is a table?
The fundamental unit of a relational database is the table. We are all familiar with tables and relational databases has a very simple model of what these are. A table comprises a name, and data which we can visualize as being organized in columns or in rows. Columns have a name and a data type (float, double, integer, various kinds of strings, etc). Column specifications define the content of the database. Rows (or records) are that content.In tables there is often a column or columns that can be used to identify that row. E.g., in a table of people the name of the person might be this primary key. If the number of people in the table is so large that the name might not be unique, then it may be supplemented by a number that ensures that each row refers to a different, specific individual -- e.g., the Social Security Number for US taxpayers. While tables are not required to have such keys, well-designed tables will do so. The ability to identify rows using these primary keys makes it possible for one table to point to a row in another just as if this were a network database.
In SQL, the standard column types are familiar computer types like int's and float's and double. Strings (char and varchar types) are another basic type. The exact types supported varies among the database vendors but the number, string, and occasionally date/time types are the ones primarily of interest in astronomical applications.
The descriptions of the tables in the database comprise the database schema. Learning at least a bit of the schema is the first step in accessing a database.
There are at least three very distinct topics that a full discussion of SQL and relational databases needs to include: the semantics of creating and querying databases, ensuring the integrity of databases, and making databases efficient. This discussion will largely be confined to the first of these with only occasional references to the last two.
The next few sections highlight the commands to build, populate and query tables. There are some differences between the syntax used in different database systems (e.g., how commands are terminated). All the examples shown here will use the MySQL syntax. Virtuall all the commands have both SQL standard and proprietary options that are not described here.
Databases within the a Database system
Since there may be many different users and uses of a given relational database system installation, most systems support a way to group related tables. A group of tables is called a 'database'. While this is rather confusing terminology, these databases are essentially just independent names spaces. A user of a database system selects the current default database using a system dependent command. [In the MySQL CLI the USE command selects the current database. When accessing a database through a JDBC connection, the JDBC URL indicates the chosen database.]Generally tables in other databases can still be accessed and used, they just have to have the database name prefixed to the table name.
Creating a table
The command to create a table is simplyExample:
To get rid of table
Adding rows
The INSERT command is used to add a row to a database. This simplest syntax for this command isOne can also add a row and only specify some of the columns (in any order). E.g., the second insert above could have been specified as
The DELETE command is used to delete rows from database, but we'll defer the discussion of that till after the select.
Querying a table: SELECT
By far the most important SQL statements/command, is the SELECT statement. A SELECT statement takes zero or more input tables and generates an ouput table. The basic syntax of the SELECT isThe selection list.
The selection list (along with the the initial SELECT keyword) is the only part of the SELECT statement that is not optional. The selection list is a list of comma separated expressions. Each expression can be a constant (e.g., 1 or 'aStringConstant'), a column in the table (as shown in the example) or an expression (e.g., amountInPennies/100).Expressions can be quite complex, they can involve any number of constants and columns. Most databases support a variety of mathematical (sin, cos, sqrt, exp, ...) and string functions that can be used. The string functions tend to vary from database to database, but normally include functions for getting substrings and changing the case of the string and various kinds of formatting.
If you wish to specify specific column names for the output table this can be done by adding an 'as colName' to an element of the selection list. This can be done to supply a column name where one might be missing (e.g., for a constant or expression), or to override the column name when the element is just a column in the original table.
There are two special features in the selection list. Rather than specifying an explicit list of columns the query can just use '*' to say that you want to get all columns. E.g.,
The other special feature is the 'distinct' keyword. If we try
The table list.
The table list follows the FROM keyword and indicates one or more tables that are be used in the query. Multitable queries are described below.The WHERE clause.
The WHERE clause is optional. If specified it is a boolean expression which indicates if a given row should be included in the output. Standard operators like '>', '<' and '=' have their usual mathematical meaning and can be combined using AND and OR operators. Any of the mathematical expressions that might be used in the selection list can also be used in the where clause. Parenthesis can be used to ensure the proper order of operations. E.g.,Strings can be searched either for exact matches, comment='Delete me' or more flexibly using the LIKE operator. LIKE matches strings that may contain two special characters '%' matches any string of any length (including 0) while '_' matches exactly one character. So
If no WHERE clause is specified then the number of output rows is the same as the number of input rows. The columns in the output table are determined by the selection list.
The ORDER By clause.
The order by clause specifies the order in which the results will be displayed. Unless this clause is present the order of results is undefined and is not guaranteed to be the same even for an identical query. One or more fields may be specified (expressions can also be used) and the results are ordered by the first elemement. If this is equal for two rows then the second element is used to break the tie and so forth. The keywords "ASC" or "DESC" can be appended to an element to change specify the direction of the sort. ASC is the default. E.g.,Deleting rows from a database
Now that we've seen the WHERE clause used in the select statement we can show the delete statement. This is simplyMultiple tables and joins
The real power of the relational database comes when we deal with more than one table at a time. Suppose we have three tables, A, B and C with columns x rows of 3x3, 2x3 and 3x2 respectively.In practice, databases usually find substantial shortcuts to creating the full cross-product table. If there is a constraint that involves only one table, then we may be able to immediately limit the rows used from that table. However the database actual performs the query this model defines what the results should be.
Joins
Let's add two more tables in the same database as our bank transfer table.
The primary key of our bankTransfer table is the transaction number, the primary key of the user table is the userID while the primary key of the account table is the acctNo. We join the tables by seeing if the primary key field of one table matches a 'foreign key' field in the other table. E.g., in account, the userNo is a foreign key field. It picks out an associated row in another table. Similarly the srcAcct and dstAcct fields in the bankTransfer table match the acctNo field in the account table.
When multiple tables are involved, there can be ambiguities about which table a given column belongs to. We can explicitly write out the table (as above) but this tends to look messy. In the table list after the FROM keyword Each table can be associated with a prefix and the prefix used to identify the table. We rewrite the query as
Note that there is no requirement that all tables be used in the output, or that the sort order be determined by a visible column. The account table plays a key role in this query, but no fields from this table are displayed.
Kinds of Joins
The structural joins illustrated above are often divided into one-one, one-to-many and many-to-many relationships. A one to one relationship indicates that for each row in one table there is at most one row in the other and vice versa. In essence these two tables could have been combined to generate one largest table, but its often most convenient and efficient to split tables this way. Smaller tables are more efficient. E.g., in addition to our user table we might have a userAddress table. This also has the primary key of userID, but includes the mailing address for the user.The joins we did above are one-to-many relationships. For each transaction we have only one fromAccount, and only one toAccount. However if we go the other direction, we can find many transactions from a given account. Similarly there is only one user for each account, but there may be many accounts for a given user.
Many-to-many relationships are ubiquitous once we start thinking about the real world. E.g., suppose accounts are allowed to have multiple users. Joint accounts are nothing new. Now each account has (potentially) multiple users and each user has (likely) multiple accounts. We could duplicate the account record for every user of that account, but then whenever we update the account we need to be careful to update every record. The best solution to the need for a many-to-many association is usually to create a new very small table which has only two columns: the keys of the two tables being joined. In our example we would create userAccount table with the two columns userID and acctNo, and we get rid of the userID in the curren account table.
For astronomers, we often want to use non-structural joins, we want to join fields based upon similar physical characteristics, e.g., we might be looking for variables with
This kind of join has the potential to be a many-many relationship. The user has to decide how to deal with the multiplicity of hits. Could all the matches be real (one table has better resolution than another)? Do we want to find a 'best' match and if so what is the criterion.
Outer Joins
There is an alternative syntax for joinsAggregate functions
In additional to standard functions on individual columns, SQL supports a limited number (count, avg, min, max) of aggregate functions. These functions work on a group of rows. By default the entire table is the group. SoThe GROUP BY attribute in the SELECT statement can be used to break a table into multiple groups. E.g.,
If you want to make a qualification on the results based upon an aggregate function, then you may also need a HAVING clause. While the WHERE clause eliminates rows in the initial set, the HAVING clause works on rows after the grouping has been achieved. E.g., if we are looking for suspicious activity in a bank account, say one where there are a lot of large transactions, we might have:
Our previous check for overdrawn balances didn't consider that someone might have several checks which put them over the limit. A more sophisticated approach would be
If we include aggregate functions then we can summarize the the actions of the SELECT as:
- Create the cross-product of all the tables in the FROM clause.
- Eliminate any rows that do not meet the criteria in the WHERE clause
- If there are aggregate functions in the selection list, sort the table by GROUP BY parameters, determine the groups, and transform the table into one with just the group parameters and the results of the aggregate functions.
- If there is a HAVING clause eliminate rows that do not meet the HAVING criterion.
- Sort the table according to the elements in the SORT clause.
- Create the output columns according to the SELECT clause.
Database Integrity
There are many features in the database systems that are intended to assure the integrity of the data. All serious database systems have sophisticated permissions systems where only some users are permitted to do particular operations. Generally users should only have the minimum permissions required for their own tasks. The GRANT and REVOKE commands control who has which privileges.The creation and updating of the various user accounts tends to be rather database specific. The precise privileges supported, and the granularity on which they apply differs greatly. Typically users can get permissions based upon the database or table. In some case permissions may also be granted on an individual row or field basis.
Appropriate use of keys can ensure that fields that are supposed to be unique really are. Triggers can be created that occur during specified database events. E.g., a trigger may check that before it accepts a bank transaction, both the source and destination accounts might exist.
A complex feature supported by commercial databases is transactional integrity. For dynamic databases it is often unacceptable for a group of statements to fail halfway through. E.g., when doing a transfer we need to make sure that we both debit one account and credit another. If we can't do both we want the whole process to fail. This ensures transactional integrity in the database. A transaction that fails can use the ROLLBACK command to go back to the last valid point. Once a group of related changes succeeds the user can COMMIT them.
Efficiency
Tuning databases and queries to run efficiently is an art form on which many volumes have been written, but there are a few key features that users may keep in mind.- Joins are powerful. Keep in mind that the potential number of rows goes as the product of all the tables involved. If more than two or three tables are involved in a join try to make sure that there are criteria in the query that enable the system to limit the number of rows needed.
- Make indices on fields used in joins. The CREATE INDEX command analyzes the distribution of a given field or fields within a table such that queries can run most efficiently. A table is often physically organized according to the primary key so that it can very efficiently access a given value or range of values in that key.
- Large database queries are much more likely to be disk than CPU bound. Typically we can do hundreds of operations in the time it takes to read a single byte.
- Small tables are more efficient that large tables. Usually the database system will have to read the entire row even if many of the columns are not needed. You can split a table into a few key columns and join as needed with a secondary table.
Building and Querying the SPOCS Database
In the following examples we will see how we can create, modify, and query a database using MySQL. We will be using SPOCS, which you will find in the SummerSchool package under the SQL directory. SPOCS has a very basic schema. It is composed of three tables SPOCS, SPOCS_EXT and SANTOS. SPOCS and SPOCS_EXT are related through the ID column. SPOCS and SANTOS are related through the name and HD columns respectively. The data and content descriptions are included in the SPOCS data directory.Getting Started
the examples assume that you have set up the MySQL database and can start the MySQL CLI on using the MySQL root account. E.g.,The schema.sql file defines the structure of the tables. It looks like:
While the schema.sql file defines the tables, the three table_build.sql files fill them up with data. They are simply sets of insert statements.
Ok we're ready to build the database...
We're not quite done... The positional information in the SPOCS table is encoded in sexigesimal form. We're going to use these tables as a SkyNode in a few days. It's much more convenient to be able to manipulate real numbers, so we're going to add RA and Dec columns to the SPOCS database. The addcolumns.sql and fillcolumns.sql files do this.
If we run these:
The ALTER TABLE command modifies and existing table. The syntax and capabilities change differ among database systems. Often it's better to just delete the database and build it right the first time!
The UPDATE command changes the values of fields. It fills in the blank RA and Dec fields that we created with the alter table. Note that we get the information from the spocs_ext table, but we created the new columns on the spocs table. SQL can be quite powerful if you have the ability to change the database. It's also easy to destroy the database. Most of the time want users have only read access to the database.
Note: There is an error in the SPOCS database that we have built here. You may be able to deduce it purely from the schema. We'll discuss--and show how to fix--this error in the SkyNode server session, but bonus points to anyone who can figure out what it is before then.
Querying SPOCS.
- Query 1: Find all solar analogs.
-
The Sun has a temperature of Teff=5770 K and a gravity of logg=4.4 and
a metallicity of [M/H]=0.0. Stars like the Sun (analogs) will have
values of these three parameters similar to the solar values. In this
example, we look for stars with temperatures within 50 K of the Sun and
similar gravities and metallicities.
SELECT s.ID, s.Name, s.Teff, s.Logg, s.Metal FROM spocs s WHERE s.Teff BETWEEN 5720 AND 5820 AND s.Logg BETWEEN 4.34 AND 4.54 AND s.Metal BETWEEN -0.1 AND +0.1 ORDER BY s.Teff;
- Query 1b: Computes the average values for Teff, Logg and Fe.
- Demonstrates the use of the avg function applied to full columns
SELECT avg(s.Teff), avg(s.Logg), avg(s.Fe) FROM spocs s
- Query 2: Putting your results into a file.
-
Extract information for an HR diagram with data from SPOCS and SPOCS_EXT Demonstrates a join between two tables where the same ID corresponds to the same object. This query shows how to copy the query results into a file. Note that this syntax tends to be both system and database dependent. You need to edit this query to get it to work since the path for the output depends upon your system. On Windows systems backslashes in the file path must be escaped. This syntax only works if you are on the server.
The first column is the effective temperature of the star while the second is the logarithm of the luminosity. If you plot these results, you need to make the temperature decrease to the right to get the traditional orientation of the diagram.
SELECT s.Teff, t.logL INTO outfile '/a/full/directory/path/testout.sql' FIELDS terminated by ',' FROM spocs s, spocs_ext t WHERE s.ID = t.ID; - Query 3. Look for systematic trends vs temperatures.
-
Making a star a little bit cooler, reducing the gravity a little bit,
or increasing the metallicity a little bit all tend to make spectral
lines a little bit deeper. This unfortunate reality makes it hard to get
the temperature, gravity, and metallicity exactly correct.
When astronomers measure gravity, any error the temperature or
metallicity will cause a compensating error in gravity. Similarly, when
measuring metallicity, any error in temperature and gravity will cause a
compensating error in metallicity. All three parameters are interlinked.*
Detecting and characterizing systematic errors can be difficult. Here
we compare stellar parameters from two different spectroscopists, using
different analysis techniques. Ordering the differences in gravity or
metallicity as a function of increasing temperature shows trends. The
size of these trends are an indication of the minimum systematic error
in one or both of the results from the two analyses.
The next step would be to plot the trends, using Excel or any other plotting package.
SELECT s.ID, s.Name, s.Teff, (s.Teff - e.Teff) as dTeff, s.Logg, format(s.Logg - e.Logg,2) as dLogg, s.Fe, format(s.Fe - e.Fe,2) as dFe FROM spocs s, santos e WHERE substring(s.Name,3) = trim(e.HD) ORDER BY s.Teff; - Query 4: Count how many times each star appears in the Santos catalog
-
Notice that some stars in the Santos catalog have measurements for
multiple observations. Frequently, we want to combine multiple
measurements of the same quantity into a single value. This is a very
likely scenario in the world of VO, which draws upon many different
catalogs. In this simple case, the multiple measurements are in the same
catalog.
SELECT s.ID, s.Name, count(e.HD) c FROM spocs s, santos e WHERE substring(s.Name, 3) = trim(e.HD) GROUP BY s.ID, s.Name ORDER BY c desc;Query 5: Combine multiple entries of the same star from SANTOS.
- Here we average together multiple observations of the same star in the **Santos** catalog. For stars with multiple observations, averaging improves the precision of the result. Combining multiple values for each star also provides a single number to use when comparing results from the two catalogs. (The comparison in Q3 compared each measurement from the **Santos** catalog separately.)* This prints the avg value of two measurements and demonstrates how to use group by to get averages per group of rows. The format function converts the numbers into nicely formatted strings. We also discard rows that have already been averaged.
SELECT s.ID, s.Name, count(s.ID) n, format(avg(e.Teff), 1) as mean_Teff, format(avg(s.Teff - e.Teff), 1) as mean_dTeff, format(avg(s.Logg - e.Logg), 2) as mean_dLogg, format(avg(s.Fe - e.Fe), 2) as mean_dFe FROM spocs s, santos e WHERE substring(s.Name, 3) = trim(e.HD) AND e.Inst != 'avg' GROUP BY s.ID ORDER BY mean_Teff; - Here we average together multiple observations of the same star in the **Santos** catalog. For stars with multiple observations, averaging improves the precision of the result. Combining multiple values for each star also provides a single number to use when comparing results from the two catalogs. (The comparison in Q3 compared each measurement from the **Santos** catalog separately.)* This prints the avg value of two measurements and demonstrates how to use group by to get averages per group of rows. The format function converts the numbers into nicely formatted strings. We also discard rows that have already been averaged.
- Query 6: Get rid of the AVG rows.
- In the previous query we had to handle the 'avg' rows
specially. This query just removes them, so we don't
need to worry about them in the future. It's also
possible on most database systems to create a view, a kind of
virtual table, in which these troublesome rows wouldn't show up.
DELETE FROM santos WHERE Inst = 'avg';
- Query 7: Adds a column estimating the error in the effective temperature.
-
The spectra used to create the SPOCS catalog were all obtained with
very similar spectrographs. Exposure times were adjusted so that each
spectrum had the same high signal to noise ratio. Thus, the authors
claim that the error in temperature, gravity, metallicity, etc. are the
same for any individual spectrum. These error values for individual
spectra are given in a small table in the SPOCS paper. For example, the
error in Teff is 44 K for a measurement based on one spectrum.
When the authors average results for N spectra of the same star, they
claim the error in the result should get smaller, going from 44 K down
to 44/sqrt(N) K.
The SPOCS spectra come from three observatories: Keck Observatory in Hawaii, the Anglo-Australian Telescope (AAT) in Australia (duh!), and Lick Observatory in California. The total number of spectra is just the sum of the number of spectra from each of these three observatories. This 1/sqrt(N) adjustment to the error is certainly optimistic, but illustrates how we can dynamically add scientific information to the table. Real-world errors don't get smaller quite so fast
ALTER TABLE spocs ADD e_Teff smallint after Teff; UPDATE spocs s SET e_Teff = round(44.0/sqrt(s.NKeck + s.NAAT + s.NLick)); - Query 8: Calculate the significance of differences in temperature between the SPOCS catalog and the Santos catalogs.
-
Now we are going to do some error analysis. We'll compare the
temperature from Santos for each star with the temperature from
SPOCS. Of course the two values differ, but the question is whether the
difference is reasonable, given the errors (actually given the
"uncertainties") provided by the authors.
What is the expected uncertainty (e_Diff) in this difference? The typical error in "Diff" is given by the following formula, which combines the uncertainty in the SPOCS value (e_SPOCS) and the error in the Santos value (e_Santos)
e_Diff2 = e_SPOCS2 + e_Santos2
The square of the error/uncertainty squared is the variance. For uncorrelated errors, when we add (or subtract) measurements, we can just add the variance to get the anticipated error in the combined measurement.Anyway, back to our goal of deciding whether the differences between SPOCS and Santos are significant. The calculations above give us Diff and e_Diff. If all differences are just due to Gaussian noise (which people often assume), then about 2/3 of the Diff values should be between -e_Diff and +e_Diff. In words, one would say, "If difference are just due to Gaussian noise, then two-thirds of our calculated Diffs should be between minus one-sigma and plus one-sigma."
If we divide Diff by e_Diff, then 2/3 of our calculate Diff/e_Diff values should be between -1 and +1. This is easy to test in a table, especially if sorted by Diff.
SELECT s.ID, s.Name, s.Teff, s.e_Teff, e.Teff, e.e_Teff, (s.Teff - e.Teff) as dTeff, round(sqrt((s.e_Teff * s.e_Teff) + (e.e_Teff* e.e_Teff))) as e_dTeff, (s.Teff - e.Teff) /sqrt((s.e_Teff * s.e_Teff) + (e.e_Teff* e.e_Teff)) as sigma FROM spocs s, santos e WHERE substring(s.Name, 3) = trim(e.HD) ORDER BY sigma; - Query 9: Three way joins.
SELECT s.ID, s.Name, e.HD as santosHD, t.Mass, (t.Mass - e.Mass) as dMass, s.Teff, (s.Teff - e.Teff) as dTeff, s.Logg, format(s.Logg - e.Logg,2) as dLogg, s.Fe, format(s.Fe - e.Fe,2) as dFe FROM spocs s, santos e, spocs_ext t WHERE substring(s.Name, 3) = trim(e.HD) AND s.ID = t.ID ORDER BY s.Teff;
- Query 10: Is there a systematic difference between the 2 catalogs?
- In Query 8 we saw that individual measurements from SPOCS and Santos
seemed to be consistent to within the predicted errors. But, there could
still be small systematic offsets between the two catalogs that are too
small to see by looking at measurements for only individual stars.
What happens if try the average for all the matched stars?
SELECT format(avg (s.Teff - e.Teff), 1) as mean_dTeff, format(avg(s.Logg - e.Logg), 3) as mean_dLogg, format(avg(s.Fe - e.Fe), 3) as mean_dFe FROM spocs s, santos e WHERE substring(s.Name, 3) = trim(e.HD);
Online Astronomical Databases
As noted below in the links section there are a number of astronomical databases currently on-line that support direct SQL queries. Additionally there are many more that use the VO dialect of SQL, ADQL (or Astronomical Data Query Language). We can try out some queries on these databases as well.First, we will need a database. The examples below use the Sloan Digital Sky Survey catalog on-line databases from any of the SkyServer Web sites (DR2, DR3, or DR4):
- http://skyserver.sdss.org/dr2/en/tools/search/sql.asp
- http://cas.sdss.org/dr3/en/tools/search/sql.asp
- http://cas.sdss.org/dr4/en/tools/search/sql.asp
- Query 1: What are the u and r magnitudes for object X? In what field is the object located?*
-
SELECT p.objid, p.modelMag_u, p.modelMag_r, p.fieldID FROM PhotoTag p WHERE p.objid = 587726014001185324
- Query 2: What do we know about object X from the PhotoTag table?
SELECT p.* FROM PhotoTag p= WHERE p.objid = 587726014001185324
- Query 2b: Tell me all you know about X objects in the PhotoTag table.
-
SELECT TOP 10 p.* FROM PhotoTag pTOP is not supported by all (or even most) systems. There are many different syntaxes that are used by different databases for only getting the first few rows returned from a query. Most databases support it using some syntax. When using TOP, there is no guarantee that this will bring the same results unless it is used with an ORDER BY. Using TOP (or whatever its equivalent is in the current database) is a "quick and dirty" way to know what columns are available in a table and get a subsample of its content. The ways a user can limit a query for the most common databases are described at the A Gentle Introduction to SQL web pages.
- Query 3: Tell me the identifier of all observation runs without repetitions.
SELECT DISTINCT f.run FROM Field f
- Query 4: Find objects contained in fields X and Y with u magnitude between M and N.
- We are interested only in those with u magnitude fainter than r by F
units.
SELECT p.objid, p.modelMag_u, p.modelMag_r, p.modelMag_u - p.modelMag_r as modelMag_ur, p.fieldID FROM PhotoTag p WHERE p.modelMag_u between 27.5 AND 28 AND p.modelMag_u - p.modelMag_r > 0.5 AND (p.fieldID = 587726014001184768 OR p.fieldID = 587726014001250304)
- Query 5: Find objects contained in fields X and Y with u magnitude between M and N.
- We are interested only on those with u magnitude fainter than r by F
units. Show them in descending order of u magnitude.
SELECT p.objid, p.modelMag_u, p.modelMag_r, p.modelMag_u - modelMag_r as modelMag_ur, p.fieldID FROM PhotoTag p WHERE p.modelMag_u between 27 AND 28 AND p.modelMag_u - p.modelMag_r > 0.5 AND (p.fieldID = 587726014001184768 OR p.fieldID = 587726014001250304) ORDER BY p.modelMag_u DESC
- Query 6: What are the average u and r magnitudes for galaxies in field X?
-
We can use aggregate functions in our queries:
SELECT avg(g.modelMag_u) as avg_modelMag_u, avg(g.modelMag_r) avg_modelMag_r FROM Galaxy g WHERE g.fieldID = 587726014001184768
- Query 7: What are the average u and r magnitudes for galaxies observed during run X, rerun Y, with camcol Z grouped by field.
-
SELECT g.fieldID, avg(g.modelMag_u) as avg_modelMag_u, avg(g.modelMag_r) avg_modelMag_r FROM Galaxy g WHERE g.run = 1458 AND g.rerun = 40 AND g.camcol = 1 GROUP BY g.fieldID ORDER BY g.fieldID
- Query 8: Find the average u and r magnitudes for galaxies observed during run X, rerun Y, with camcol Z grouped and ordered by field.
-
Here we are only only interested in average r magnitudes higher than T.
SELECT g.fieldID, avg(g.modelMag_u) as avg_modelMag_u, avg(g.modelMag_r) avg_modelMag_r FROM Galaxy g WHERE g.run = 1458 AND g.rerun = 40 AND g.camcol = 1 AND avg(g.modelMag_r) > 21.53 GROUP BY g.fieldID ORDER BY g.fieldIDThis query fails. Aggregate functions are not allowed in the WHERE clause. We need the HAVING keyword which specifies a search condition for a group or an aggregate. HAVING is usually used with the GROUP BY clause. When GROUP BY is not used, HAVING behaves like a WHERE clause.SELECT g.fieldID, avg(g.modelMag_u) as avg_modelMag_u, avg(g.modelMag_r) avg_modelMag_r FROM Galaxy g WHERE g.run = 1458 AND g.rerun = 40 AND g.camcol = 1 GROUP BY g.fieldID HAVING avg(g.modelMag_r) > 21.53 ORDER BY g.fieldID
- Query 9: What are the redshift and i magnitude of objects contained in MyPhotoObj?
-
We want to all photometry objects whether they
have spectra or not.
SELECT p.objid, p.i, s.specobjid, s.z FROM PhotoObj p LEFT OUTER JOIN SpecObj s ON p.objid = s.bestobjid WHERE p.objid IN (587726014001185022, 587726014001315956, 587726014001315970, 587726031180792552)
SQL in the VO
One of the primary service types supported by the Virtual Observatory is the SkyNode. SkyNodes support queries using a dialect of SQL known as ADQL. When properly invoked by a sufficiently intelligent SkyPortal, they can also handle queries of distributed databases.The primary limitations of SQL with regard to the VO are lack of support for spatial queries and issues with distributed queries for the very large astronomical data. Given the growth of GIS systems, most database systems now have extensions to support such queries relatively efficiently, but standards are still evolving.
SQL and ADQL will be further discussed in a number of other courses
in this summer school, including discussions of the SkyPortal services,
ADQL, and the use and deployment of SkyNodes.
Links of interest
Some astronomical databases on-line using SQL interfaces
- The Sloan Digital Sky Survey - SDSS
- http://skyserver.sdss.org/
- Galaxy Evolution Explorer - Galex
- http://galex.stsci.edu/
- SuperCOSMOS Science Archive - SSA
- http://surveys.roe.ac.uk/
General SQL Tutorials
- A Gentle Introduction to SQL
- http://sqlzoo.com
This reference shows how some queries and operations differ among popular databases systems. - SQL Course.com
- http://sqlcourse.com/
http://www.sqlcourse2.com/ - SQL Tutorial
- http://www.w3schools.com/sql/
- FirstSQL
- http://www.firstsql.com/tutor.htm
- SDSS SQL Tutorial
http://cas.sdss.org/dr4/en/help/docs/sql_help.asp
SampleQueries