The National Virtual Observatory
2006 Summer School

Introduction to Structured Query Language (SQL)

Tom McGlynn (NASA/GSFC) and Maria Nieto-Santisteban (JHU)

Introduction.

Structured Query Language (SQL, sometimes pronounced as "sequel") is a language used for manipulating and querying relational databases. Much of the most interesting information in the VO is in tables large and small. An understanding of how these databases are built and queried is essential to understanding the kinds of science that can be with these resources.

The origins of relational databases.

The roots of SQL go back about 30 years to the introduction of the relational model for databases and the release of database systems based on it. (Or at least partially based. The inventor of the relational model E.F. Codd was no friend to many of the relational databases systems he inspired, nor of SQL itself.) At that time there were two other popular models for databases.

In hierarchical databases, information was stored in a hierarchy of levels. The associations between different kinds of data were reflected in the hierarchical structure.

In the network data model the associations were carried by explicit pointers from one piece of data to another. In network data models anything can point to anything else. This provides enormous flexibility in the database, but it can be hard to understand the underlying structure.

The relational model separates two concepts that are intertwined in the other models. The tables used in relational databases are very simple, non-hierarchical and have no pointers. The associations we wish to make between tables are given as sets of rules that can be used to join when querying them. These rules are implicit and dynamic, rather than the explicit and static and static structures in hierachical and network databases, A consequence of this dynamic binding is that it is natural in relational databases to look for new associations between tables in addition to any built-in structure. Relational databases can be very powerful tools in the kind of data mining that we wish to do in the Virtual Observatory.

The combination of flexibility and structure in the relational database model proved in the 1980's to be the most efficient way of storing large amounts of tabular information. The vast majority of databases today use this model. However the hierarchical and network models are still very much alive. XML underlies much of what we do in the Virtual Observatory and is a clear inheritor of the hierarchical model. The WorldWideWeb, arguably the most important invention of the last two decades, is essentially just a network database.

Once it became clear that relational databases would dominate, attention turned to developing standards for them. In 1986 the first SQL standard was published and several updates have been made since then, the latest in 2003. This talk will first describe SQL and then go through exercises using local and remote databases.

Overview of SQL

What is a table?

The fundamental unit of a relational database is the table. We are all familiar with tables and relational databases has a very simple model of what these are. A table comprises a name, and data which we can visualize as being organized in columns or in rows. Columns have a name and a data type (float, double, integer, various kinds of strings, etc). Column specifications define the content of the database. Rows (or records) are that content.

In tables there is often a column or columns that can be used to identify that row. E.g., in a table of people the name of the person might be this primary key. If the number of people in the table is so large that the name might not be unique, then it may be supplemented by a number that ensures that each row refers to a different, specific individual -- e.g., the Social Security Number for US taxpayers. While tables are not required to have such keys, well-designed tables will do so. The ability to identify rows using these primary keys makes it possible for one table to point to a row in another just as if this were a network database.

In SQL, the standard column types are familiar computer types like int's and float's and double. Strings (char and varchar types) are another basic type. The exact types supported varies among the database vendors but the number, string, and occasionally date/time types are the ones primarily of interest in astronomical applications.

The descriptions of the tables in the database comprise the database schema. Learning at least a bit of the schema is the first step in accessing a database.

There are at least three very distinct topics that a full discussion of SQL and relational databases needs to include: the semantics of creating and querying databases, ensuring the integrity of databases, and making databases efficient. This discussion will largely be confined to the first of these with only occasional references to the last two.

The next few sections highlight the commands to build, populate and query tables. There are some differences between the syntax used in different database systems (e.g., how commands are terminated). All the examples shown here will use the MySQL syntax. Virtuall all the commands have both SQL standard and proprietary options that are not described here.

Databases within the a Database system

Since there may be many different users and uses of a given relational database system installation, most systems support a way to group related tables. A group of tables is called a 'database'. While this is rather confusing terminology, these databases are essentially just independent names spaces. A user of a database system selects the current default database using a system dependent command. [In the MySQL CLI the USE command selects the current database. When accessing a database through a JDBC connection, the JDBC URL indicates the chosen database.]

Generally tables in other databases can still be accessed and used, they just have to have the database name prefixed to the table name.

CREATE DATABASE database_name
creates a database with the given name. Normally database names must be unique within the database system. Example:
create database SPOCS;
To get rid of a database just
DROP DATABASE database_name

Creating a table

The command to create a table is simply
CREATE TABLE tablename (name1 type1, name2 type2, ...)
Example:
create table star(id varchar(20), vmag float, ra float, dec float);
In addition to specifying whether a the type of a field, the column definition can indicate whether this column must be specified. SQL supports a special value, 'null', that indicates that this field has not been filled for this row. A column that allows null values is 'nullable'. It may not make sense for a column to allow nulls and the phrase 'not null' can be added if that is so.

Example:

create table bankTransfer (transaction long not null, srcAccount int not null, dstAccount int not null, amountInPennies int not null, comment varchar64);
This definition ensures that no essential data is missing in any bank transfer, but allows an optional comment.

To get rid of table

DROP TABLE tableName

Adding rows

The INSERT command is used to add a row to a database. This simplest syntax for this command is
INSERT INTO tableName VALUES(val1, val2, val3 ...)
where val1, val2, ... are the values for the row in the in same order as specified in the create table command. If a string is being inserted, it should be enclosed in single quotes. E.g.,
insert into bankTransfer values(1111002, 10030001, 10030002, 390000, 'Transfer checking to savings'); insert into bankTransfer values(1111003, 96300206, 10030002, 141117, null);
adds two rows to the bankTransfer table.

One can also add a row and only specify some of the columns (in any order). E.g., the second insert above could have been specified as

insert into bankTransfer (transaction, dstAccount, srcAccount, amountInPennies) values (1111003, 10030002, 96300206, 141117);
The following should give an error if we try it:
insert into bankTransfer(toAccount, srcAccount, amountInPennies) values(10030002, 96300206, 141117);
We have to specify all the non-nullable fields in the insert.

The DELETE command is used to delete rows from database, but we'll defer the discussion of that till after the select.

Querying a table: SELECT

By far the most important SQL statements/command, is the SELECT statement. A SELECT statement takes zero or more input tables and generates an ouput table. The basic syntax of the SELECT is
SELECT selectionList FROM tablelist WHERE condition ORDER BY criteria
For the moment we'll confine ourselves to single table selects. There is still a lot the user can do. E.g.,
select transaction, toAccount amountInPennies from bankTransfer where amountInPennies > 100000 order by toAccount,amountInPennies
This select statement will generate an output table including all the transfers > $1,000. The output will be ordered first by account number and secondly by the amount of the transaction.

The selection list.

The selection list (along with the the initial SELECT keyword) is the only part of the SELECT statement that is not optional. The selection list is a list of comma separated expressions. Each expression can be a constant (e.g., 1 or 'aStringConstant'), a column in the table (as shown in the example) or an expression (e.g., amountInPennies/100).

Expressions can be quite complex, they can involve any number of constants and columns. Most databases support a variety of mathematical (sin, cos, sqrt, exp, ...) and string functions that can be used. The string functions tend to vary from database to database, but normally include functions for getting substrings and changing the case of the string and various kinds of formatting.

If you wish to specify specific column names for the output table this can be done by adding an 'as colName' to an element of the selection list. This can be done to supply a column name where one might be missing (e.g., for a constant or expression), or to override the column name when the element is just a column in the original table.

select transaction as trans, amountInPennies/100 as amountInDollars, 'FQZZ18943' as requestCode from bankTransfer where amountInPennies > 100000;
This example will create an output table with three columns. The requestCode column will always have the value 'FQZZ18943'.

There are two special features in the selection list. Rather than specifying an explicit list of columns the query can just use '*' to say that you want to get all columns. E.g.,

select * from bankTransfer
is equivalent to
select transaction, srcAccount, dstAccount, amountInPennies, comment from bankTransfer;
With limited exceptions you cannot mix '*' and explicit columns in the selection list.

The other special feature is the 'distinct' keyword. If we try

select srcAccount from transaction;
we will get the same number of input and output rows. If the same account is the source of multiple transactions, it will be listed multiple times.
select distinct srcAccount from transaction;
will ensure that only distinct rows are returned. A row is suppressed only if all output fields are identical to a previous row.

The table list.

The table list follows the FROM keyword and indicates one or more tables that are be used in the query. Multitable queries are described below.

The WHERE clause.

The WHERE clause is optional. If specified it is a boolean expression which indicates if a given row should be included in the output. Standard operators like '>', '<' and '=' have their usual mathematical meaning and can be combined using AND and OR operators. Any of the mathematical expressions that might be used in the selection list can also be used in the where clause. Parenthesis can be used to ensure the proper order of operations. E.g.,
select transaction as suspicious_transaction from bankTransfer where amountInPennies < 0 or amountInPennies > 100000000 or srcAccount=dstAccount or comment='Delete me'

Strings can be searched either for exact matches, comment='Delete me' or more flexibly using the LIKE operator. LIKE matches strings that may contain two special characters '%' matches any string of any length (including 0) while '_' matches exactly one character. So

select * from observation where observer like '%Fred_%King%'
will match the rows where the observer is 'Fred King', or 'Fred R. King' or 'xxxxFredxxxxxxxxKingxxx' but not 'FredKing' (since the '_' must match a character).

If no WHERE clause is specified then the number of output rows is the same as the number of input rows. The columns in the output table are determined by the selection list.

The ORDER By clause.

The order by clause specifies the order in which the results will be displayed. Unless this clause is present the order of results is undefined and is not guaranteed to be the same even for an identical query. One or more fields may be specified (expressions can also be used) and the results are ordered by the first elemement. If this is equal for two rows then the second element is used to break the tie and so forth. The keywords "ASC" or "DESC" can be appended to an element to change specify the direction of the sort. ASC is the default. E.g.,
select transaction, amountInPennies/100 from bankTransfer where amountInPennies > 10000000 order by amountInPennies desc;
to get the largest transfers starting with the very largest.

Deleting rows from a database

Now that we've seen the WHERE clause used in the select statement we can show the delete statement. This is simply
DELETE FROM table WHERE condition
Example:
delete from bankTransfer where (amountInPennies < 0 or srcAccount < 0 or dstAccount < 0) and srcAccount != -1

Multiple tables and joins

The real power of the relational database comes when we deal with more than one table at a time. Suppose we have three tables, A, B and C with columns x rows of 3x3, 2x3 and 3x2 respectively.
a11,a12,a13 a21,a22,a23 a31,a32,a33 b11,b12 b21,b22 b31,b23 and c11,c12,c13 c21,c22,c23
Suppose we put all three of these tables in the table list, i.e., select ... from A,B,C .... Conceptually the very first thing the relational database system does is create a cross-product of all of the tables, i.e., a table like:
a11,a12,a13,b11,b12,c11,c12,c13 a11,a12,a13,b11,b12,c21,c22,c23 a11,a12,a13,b21,b22,c11,c12,c13 a11,a12,a13,b21,b22,c21,c22,c23 a11,a12,a13,b31,b32,c11,c12,c13 a11,a12,a13,b31,b32,c21,c22,c23 a21,a22,a23,b11,b12,c11,c12,c13 a21,a22,a23,b12,b12,c21,c22,c23 ... 9 rows skipped a31,a32,a33,b32,b32,c21,c22,c23
This table has a number of rows equal to the product of all of the rows in the participating tables, 18 in our example here. Once this cross-product table has been created, the query works exactly as before for a single table query. The where clause is used to decide which rows to keep, the selection list determines which columns to display, and the order by is used to specify the order of the output.

In practice, databases usually find substantial shortcuts to creating the full cross-product table. If there is a constraint that involves only one table, then we may be able to immediately limit the rows used from that table. However the database actual performs the query this model defines what the results should be.

Joins

Let's add two more tables in the same database as our bank transfer table.
create table account ( userID int, acctNo int, balanceInPennies int } create table user ( userID int, name varchar(64), email varchar(64) );
Suppose we want to find all the transactions that might cause the negative balance... Let's find the bounced checks.

select name,email,amountInPennies,transaction from account,user,bankTransfer where srcAcct = acctNo and account.userID = user.userID and balanceInPennies < amountInPennies; order by acctNo;
There are three conditions in the where clause. The first makes sure that for each account, we only consider the transfers that are associated with that account. The second gets us only the row in the user table that has the name and address of the user who wrote the bad check. These first two conditions are traditional 'joins' between the tables. The database will have been designed in the expectation of tables being associated this way.

The primary key of our bankTransfer table is the transaction number, the primary key of the user table is the userID while the primary key of the account table is the acctNo. We join the tables by seeing if the primary key field of one table matches a 'foreign key' field in the other table. E.g., in account, the userNo is a foreign key field. It picks out an associated row in another table. Similarly the srcAcct and dstAcct fields in the bankTransfer table match the acctNo field in the account table.

When multiple tables are involved, there can be ambiguities about which table a given column belongs to. We can explicitly write out the table (as above) but this tends to look messy. In the table list after the FROM keyword Each table can be associated with a prefix and the prefix used to identify the table. We rewrite the query as

select u.name,u.email,t.amountInPennies,t.transaction from account a, user u, bankTransfer t where t.srcAcct = a.acctNo and a.userID = u.userID and a.balanceInPennies < t.amountInPennies; order by a.acctNo;
The ADQL that is used within the virtual observatory usually requires these prefixes to be used.

Note that there is no requirement that all tables be used in the output, or that the sort order be determined by a visible column. The account table plays a key role in this query, but no fields from this table are displayed.

Kinds of Joins

The structural joins illustrated above are often divided into one-one, one-to-many and many-to-many relationships. A one to one relationship indicates that for each row in one table there is at most one row in the other and vice versa. In essence these two tables could have been combined to generate one largest table, but its often most convenient and efficient to split tables this way. Smaller tables are more efficient. E.g., in addition to our user table we might have a userAddress table. This also has the primary key of userID, but includes the mailing address for the user.

The joins we did above are one-to-many relationships. For each transaction we have only one fromAccount, and only one toAccount. However if we go the other direction, we can find many transactions from a given account. Similarly there is only one user for each account, but there may be many accounts for a given user.

Many-to-many relationships are ubiquitous once we start thinking about the real world. E.g., suppose accounts are allowed to have multiple users. Joint accounts are nothing new. Now each account has (potentially) multiple users and each user has (likely) multiple accounts. We could duplicate the account record for every user of that account, but then whenever we update the account we need to be careful to update every record. The best solution to the need for a many-to-many association is usually to create a new very small table which has only two columns: the keys of the two tables being joined. In our example we would create userAccount table with the two columns userID and acctNo, and we get rid of the userID in the curren account table.

select u.name,u.email,t.amountInPennies,t.transaction from account a, user u, bankTransfer t, userAccount ua where u.userID = ua.userID and ua.acctNo = a.acctNo and t.srcAcct = a.acctNo and a.balanceInPennies < t.amountInPennies; order by a.acctNo;
The link between the user and account tables now uses the userAccount table.

For astronomers, we often want to use non-structural joins, we want to join fields based upon similar physical characteristics, e.g., we might be looking for variables with

select a.name,b.name from table1 a, table2 b where a.ra between b.ra-1 and b.ra+1 and a.dec between a.dec-1 and b.dec+1 and abs(a.vmag-b.vmag) > 1
This query illustrates the ineptitude of SQL in doing nice spatial proximity searches. A cone match user the Haversine formula would work much better but it's rather intimidating written out (though it's not a problem computationally for most databases). How to do this will be shown in the SkyNode server talk.

This kind of join has the potential to be a many-many relationship. The user has to decide how to deal with the multiplicity of hits. Could all the matches be real (one table has better resolution than another)? Do we want to find a 'best' match and if so what is the criterion.

If we include aggregate functions then we can summarize the the actions of the SELECT as:

  1. Create the cross-product of all the tables in the FROM clause.
  2. Eliminate any rows that do not meet the criteria in the WHERE clause
  3. If there are aggregate functions in the selection list, sort the table by GROUP BY parameters, determine the groups, and transform the table into one with just the group parameters and the results of the aggregate functions.
  4. If there is a HAVING clause eliminate rows that do not meet the HAVING criterion.
  5. Sort the table according to the elements in the SORT clause.
  6. Create the output columns according to the SELECT clause.

Building and Querying the SPOCS Database

In the following examples we will see how we can create, modify, and query a database using MySQL. We will be using SPOCS, which you will find in the SummerSchool package under the SQL directory. SPOCS has a very basic schema. It is composed of three tables SPOCS, SPOCS_EXT and SANTOS. SPOCS and SPOCS_EXT are related through the ID column. SPOCS and SANTOS are related through the name and HD columns respectively. The data and content descriptions are included in the SPOCS data directory.

Getting Started

the examples assume that you have set up the MySQL database and can start the MySQL CLI on using the MySQL root account. E.g.,
mysql -u root -ppwd
where pwd is the password you have set for the root account. Remember not to put a space between the -p and the password. Once you start mysql you should see a
mysql>
prompt. You can see if you've got things set up by trying to run the test SQL script as shown here...
mysql> \. test.sql Database changed Query OK, 0 rows affected (1.09 sec) Query OK, 1 row affected (0.11 sec) Query OK, 1 row affected (0.07 sec) +--------+ | avg(y) | +--------+ | 1.5000 | +--------+ 1 row in set (0.07 sec) Query OK, 0 rows affected (0.15 sec) mysql>
To build the database we run the build_all.sql script. This creates the tables, and then inserts all of the data into them. Let's take a look:
-- Create the database we are going to work in. create database SPOCS; -- Tell MySQL that this is the database we are going to use. use SPOCS; -- Create the tables. \. schema.sql -- Build the SPOCS table \. spocs_build.sql -- Build the extension table \. spocs_ext_build.sql -- Build the Santos table \. santos_build.sql
This script just loads a series of other scripts that define and populate the tables.

The schema.sql file defines the structure of the tables. It looks like:

DROP DATABASE IF EXISTS SPOCS; CREATE DATABASE SPOCS; USE SPOCS; CREATE TABLE spocs ( ID smallint not null, Name varchar(20) not null, Teff smallint DEFAULT null, logg double DEFAULT null, Metal double DEFAULT null, Na double DEFAULT null, Si double DEFAULT null, Ti double DEFAULT null, Fe double DEFAULT null, Ni double DEFAULT null, Vsini double DEFAULT null, Vrad double DEFAULT null, CRMS double DEFAULT null, LRMS double DEFAULT null, NKeck tinyint DEFAULT null, NAAT tinyint DEFAULT null, NLick tinyint DEFAULT null, CONSTRAINT PRIMARY KEY (id) ); CREATE TABLE spocs_ext ( ID smallint not null, Name varchar(20) not null, RAh smallint DEFAULT null, RAm smallint DEFAULT null, RAs double DEFAULT null, DEd smallint DEFAULT null, DEm smallint DEFAULT null, DEs double DEFAULT null, Vmag double DEFAULT null, Dis double DEFAULT null, e_Dis double DEFAULT null, logL double DEFAULT null, e_logL double DEFAULT null, Rad double DEFAULT null, e_Rad double DEFAULT null, Mass double DEFAULT null, e_Mass double DEFAULT null, Miso double DEFAULT null, min_Miso double DEFAULT null, max_Miso double DEFAULT null, giso double DEFAULT null, Age double DEFAULT null, min_Age double DEFAULT null, max_Age double DEFAULT null, CONSTRAINT PRIMARY KEY (id) ); CREATE TABLE santos ( RAJ2000 varchar(10) DEFAULT null, DEJ2000 varchar (9) DEFAULT null, cmp varchar(1) DEFAULT null, HD varchar(6) DEFAULT null, m_HD varchar(1) DEFAULT null, n_HD varchar(1) DEFAULT null, Teff smallint DEFAULT null, e_Teff smallint DEFAULT null, logg double DEFAULT null, e_logg double DEFAULT null, Fe double DEFAULT null, e_Fe double DEFAULT null, Inst varchar(3) DEFAULT null, Mass double DEFAULT null, RA double DEFAULT null, DE double DEFAULT null, recno bigint DEFAULT null, Simbad varchar(6) DEFAULT null );
These use a few additional SQL constraints that we haven't talked about. The 'DEFAULT null's indicate the the default values for most fields are just null. The "IF EXISTS" syntax on the DROP DATABASE is not supported by all databases. The same for the CONSTRAINT PRIMARY KEY. The effect of that is to make sure that ID column is unique for every row. It also affects how the database is stored internally.

While the schema.sql file defines the tables, the three table_build.sql files fill them up with data. They are simply sets of insert statements.

insert into spocs values(0,'Sun',5770,4.44,0.00,0.00,0.00,0.00,0.00,0.00,1.7,-0.1,1.80,1.09,6,0,0); insert into spocs values(1,'HD225261',5265,4.59,-0.31,-0.35,-0.25,-0.28,-0.44,-0.43,0.0,7.7,0.57,1.12,1,0,0); insert into spocs values(2,'HD105',6126,4.65,-0.02,-0.20,-0.02,0.05,0.08,-0.06,14.5,1.7,0.48,0.93,1,0,0); insert into spocs values(3,'HD142',6249,4.19,0.08,0.07,0.10,0.07,0.10,0.04,10.4,6.0,0.53,0.99,0,6,0); ...
Pretty boring!

Ok we're ready to build the database...

mysql> \. build_all.sql
When we run the build_all.sql we'll get an acknowledgement each time we add a row. It will take a minute or two to complete.

We're not quite done... The positional information in the SPOCS table is encoded in sexigesimal form. We're going to use these tables as a SkyNode in a few days. It's much more convenient to be able to manipulate real numbers, so we're going to add RA and Dec columns to the SPOCS database. The addcolumns.sql and fillcolumns.sql files do this.

use SPOCS; alter table spocs add ra double precision; alter table spocs add `dec` double precision;
and
use SPOCS; update spocs s, spocs_ext e set s.ra = 15*(e.RAh+e.RAm/60.+e.RAs/3600.) where s.ID = e.ID; update spocs s, spocs_ext e set s.dec = e.DEd + e.DEm/60. + e.DEs/3600. where e.DEd >= 0 and s.ID = e.ID; update spocs s, spocs_ext e set s.dec = e.DEd - e.DEm/60. - e.DEs/3600. where e.DEd < 0 and s.ID = e.ID;
Note the backticks around the dec field. MySQL treats dec by itself as a reserved word (short for decimal, a data type). The backticks allow us to use such reserved words as column names. There is no problem referring to dec as a column when it is preceded by a table prefix.

If we run these:

mysql> \. addcolumns.sql Database changed Query OK, 1040 rows affected (0.52 sec) Records: 1040 Duplicates: 0 Warnings: 0 Query OK, 1040 rows affected (0.50 sec) Records: 1040 Duplicates: 0 Warnings: 0 mysql> \. fillcolumns.sql Database changed Query OK, 1039 rows affected (0.82 sec) Rows matched: 1040 Changed: 1039 Warnings: 0 Query OK, 541 rows affected (0.21 sec) Rows matched: 541 Changed: 541 Warnings: 0 Query OK, 498 rows affected (0.30 sec) Rows matched: 498 Changed: 498 Warnings: 0
Only get 1039 rows changed. The first row, the Sun, doesn't have positional information.

The ALTER TABLE command modifies and existing table. The syntax and capabilities change differ among database systems. Often it's better to just delete the database and build it right the first time!

The UPDATE command changes the values of fields. It fills in the blank RA and Dec fields that we created with the alter table. Note that we get the information from the spocs_ext table, but we created the new columns on the spocs table. SQL can be quite powerful if you have the ability to change the database. It's also easy to destroy the database. Most of the time want users have only read access to the database.

Note: There is an error in the SPOCS database that we have built here. You may be able to deduce it purely from the schema. We'll discuss--and show how to fix--this error in the SkyNode server session, but bonus points to anyone who can figure out what it is before then.

Querying SPOCS.

Query 1: Find all solar analogs.
The Sun has a temperature of Teff=5770 K and a gravity of logg=4.4 and a metallicity of [M/H]=0.0. Stars like the Sun (analogs) will have values of these three parameters similar to the solar values. In this example, we look for stars with temperatures within 50 K of the Sun and similar gravities and metallicities.
SELECT s.ID, s.Name, s.Teff, s.Logg, s.Metal FROM spocs s WHERE s.Teff BETWEEN 5720 AND 5820 AND s.Logg BETWEEN 4.34 AND 4.54 AND s.Metal BETWEEN -0.1 AND +0.1 ORDER BY s.Teff;
Query 1b: Computes the average values for Teff, Logg and Fe.
Demonstrates the use of the avg function applied to full columns
SELECT avg(s.Teff), avg(s.Logg), avg(s.Fe) FROM spocs s
Query 2: Putting your results into a file.
Query 3. Look for systematic trends vs temperatures.
Making a star a little bit cooler, reducing the gravity a little bit, or increasing the metallicity a little bit all tend to make spectral lines a little bit deeper. This unfortunate reality makes it hard to get the temperature, gravity, and metallicity exactly correct. When astronomers measure gravity, any error the temperature or metallicity will cause a compensating error in gravity. Similarly, when measuring metallicity, any error in temperature and gravity will cause a compensating error in metallicity. All three parameters are interlinked.* Detecting and characterizing systematic errors can be difficult. Here we compare stellar parameters from two different spectroscopists, using different analysis techniques. Ordering the differences in gravity or metallicity as a function of increasing temperature shows trends. The size of these trends are an indication of the minimum systematic error in one or both of the results from the two analyses.

The next step would be to plot the trends, using Excel or any other plotting package.

SELECT s.ID, s.Name, s.Teff, (s.Teff - e.Teff) as dTeff, s.Logg, format(s.Logg - e.Logg,2) as dLogg, s.Fe, format(s.Fe - e.Fe,2) as dFe FROM spocs s, santos e WHERE substring(s.Name,3) = trim(e.HD) ORDER BY s.Teff;
Query 4: Count how many times each star appears in the Santos catalog
Notice that some stars in the Santos catalog have measurements for multiple observations. Frequently, we want to combine multiple measurements of the same quantity into a single value. This is a very likely scenario in the world of VO, which draws upon many different catalogs. In this simple case, the multiple measurements are in the same catalog.
SELECT s.ID, s.Name, count(e.HD) c FROM spocs s, santos e WHERE substring(s.Name, 3) = trim(e.HD) GROUP BY s.ID, s.Name ORDER BY c desc; Query 5: Combine multiple entries of the same star from SANTOS.
Here we average together multiple observations of the same star in the **Santos** catalog. For stars with multiple observations, averaging improves the precision of the result. Combining multiple values for each star also provides a single number to use when comparing results from the two catalogs. (The comparison in Q3 compared each measurement from the **Santos** catalog separately.)* This prints the avg value of two measurements and demonstrates how to use group by to get averages per group of rows. The format function converts the numbers into nicely formatted strings. We also discard rows that have already been averaged.
SELECT s.ID, s.Name, count(s.ID) n, format(avg(e.Teff), 1) as mean_Teff, format(avg(s.Teff - e.Teff), 1) as mean_dTeff, format(avg(s.Logg - e.Logg), 2) as mean_dLogg, format(avg(s.Fe - e.Fe), 2) as mean_dFe FROM spocs s, santos e WHERE substring(s.Name, 3) = trim(e.HD) AND e.Inst != 'avg' GROUP BY s.ID ORDER BY mean_Teff;
Query 6: Get rid of the AVG rows.
In the previous query we had to handle the 'avg' rows specially. This query just removes them, so we don't need to worry about them in the future. It's also possible on most database systems to create a view, a kind of virtual table, in which these troublesome rows wouldn't show up.
DELETE FROM santos WHERE Inst = 'avg';
Query 7: Adds a column estimating the error in the effective temperature.
The spectra used to create the SPOCS catalog were all obtained with very similar spectrographs. Exposure times were adjusted so that each spectrum had the same high signal to noise ratio. Thus, the authors claim that the error in temperature, gravity, metallicity, etc. are the same for any individual spectrum. These error values for individual spectra are given in a small table in the SPOCS paper. For example, the error in Teff is 44 K for a measurement based on one spectrum. When the authors average results for N spectra of the same star, they claim the error in the result should get smaller, going from 44 K down to 44/sqrt(N) K.

The SPOCS spectra come from three observatories: Keck Observatory in Hawaii, the Anglo-Australian Telescope (AAT) in Australia (duh!), and Lick Observatory in California. The total number of spectra is just the sum of the number of spectra from each of these three observatories. This 1/sqrt(N) adjustment to the error is certainly optimistic, but illustrates how we can dynamically add scientific information to the table. Real-world errors don't get smaller quite so fast

ALTER TABLE spocs ADD e_Teff smallint after Teff; UPDATE spocs s SET e_Teff = round(44.0/sqrt(s.NKeck + s.NAAT + s.NLick));
Query 8: Calculate the significance of differences in temperature between the SPOCS catalog and the Santos catalogs.
Now we are going to do some error analysis. We'll compare the temperature from Santos for each star with the temperature from SPOCS. Of course the two values differ, but the question is whether the difference is reasonable, given the errors (actually given the "uncertainties") provided by the authors.

What is the expected uncertainty (e_Diff) in this difference? The typical error in "Diff" is given by the following formula, which combines the uncertainty in the SPOCS value (e_SPOCS) and the error in the Santos value (e_Santos)
e_Diff2 = e_SPOCS2 + e_Santos2
The square of the error/uncertainty squared is the variance. For uncorrelated errors, when we add (or subtract) measurements, we can just add the variance to get the anticipated error in the combined measurement.

Anyway, back to our goal of deciding whether the differences between SPOCS and Santos are significant. The calculations above give us Diff and e_Diff. If all differences are just due to Gaussian noise (which people often assume), then about 2/3 of the Diff values should be between -e_Diff and +e_Diff. In words, one would say, "If difference are just due to Gaussian noise, then two-thirds of our calculated Diffs should be between minus one-sigma and plus one-sigma."

If we divide Diff by e_Diff, then 2/3 of our calculate Diff/e_Diff values should be between -1 and +1. This is easy to test in a table, especially if sorted by Diff.

SELECT s.ID, s.Name, s.Teff, s.e_Teff, e.Teff, e.e_Teff, (s.Teff - e.Teff) as dTeff, round(sqrt((s.e_Teff * s.e_Teff) + (e.e_Teff* e.e_Teff))) as e_dTeff, (s.Teff - e.Teff) /sqrt((s.e_Teff * s.e_Teff) + (e.e_Teff* e.e_Teff)) as sigma FROM spocs s, santos e WHERE substring(s.Name, 3) = trim(e.HD) ORDER BY sigma;
Query 9: Three way joins.
SELECT s.ID, s.Name, e.HD as santosHD, t.Mass, (t.Mass - e.Mass) as dMass, s.Teff, (s.Teff - e.Teff) as dTeff, s.Logg, format(s.Logg - e.Logg,2) as dLogg, s.Fe, format(s.Fe - e.Fe,2) as dFe FROM spocs s, santos e, spocs_ext t WHERE substring(s.Name, 3) = trim(e.HD) AND s.ID = t.ID ORDER BY s.Teff;
Query 10: Is there a systematic difference between the 2 catalogs?
In Query 8 we saw that individual measurements from SPOCS and Santos seemed to be consistent to within the predicted errors. But, there could still be small systematic offsets between the two catalogs that are too small to see by looking at measurements for only individual stars. What happens if try the average for all the matched stars?
SELECT format(avg (s.Teff - e.Teff), 1) as mean_dTeff, format(avg(s.Logg - e.Logg), 3) as mean_dLogg, format(avg(s.Fe - e.Fe), 3) as mean_dFe FROM spocs s, santos e WHERE substring(s.Name, 3) = trim(e.HD);

Online Astronomical Databases

As noted below in the links section there are a number of astronomical databases currently on-line that support direct SQL queries. Additionally there are many more that use the VO dialect of SQL, ADQL (or Astronomical Data Query Language). We can try out some queries on these databases as well.

First, we will need a database. The examples below use the Sloan Digital Sky Survey catalog on-line databases from any of the SkyServer Web sites (DR2, DR3, or DR4):

You can get familiar with the database schema by giving a look at its picture and by using the *Schema Browser* located on the top menu. Open the browser in a new window. It will help you to formulate your queries.
Query 1: What are the u and r magnitudes for object X? In what field is the object located?*
SELECT p.objid, p.modelMag_u, p.modelMag_r, p.fieldID FROM PhotoTag p WHERE p.objid = 587726014001185324
Query 2: What do we know about object X from the PhotoTag table?
SELECT p.* FROM PhotoTag p= WHERE p.objid = 587726014001185324
Query 2b: Tell me all you know about X objects in the PhotoTag table.
SELECT TOP 10 p.* FROM PhotoTag p
TOP is not supported by all (or even most) systems. There are many different syntaxes that are used by different databases for only getting the first few rows returned from a query. Most databases support it using some syntax. When using TOP, there is no guarantee that this will bring the same results unless it is used with an ORDER BY. Using TOP (or whatever its equivalent is in the current database) is a "quick and dirty" way to know what columns are available in a table and get a subsample of its content. The ways a user can limit a query for the most common databases are described at the A Gentle Introduction to SQL web pages.
Query 3: Tell me the identifier of all observation runs without repetitions.
SELECT DISTINCT f.run FROM Field f
Query 4: Find objects contained in fields X and Y with u magnitude between M and N.
We are interested only in those with u magnitude fainter than r by F units.
SELECT p.objid, p.modelMag_u, p.modelMag_r, p.modelMag_u - p.modelMag_r as modelMag_ur, p.fieldID FROM PhotoTag p WHERE p.modelMag_u between 27.5 AND 28 AND p.modelMag_u - p.modelMag_r > 0.5 AND (p.fieldID = 587726014001184768 OR p.fieldID = 587726014001250304)
Query 5: Find objects contained in fields X and Y with u magnitude between M and N.
We are interested only on those with u magnitude fainter than r by F units. Show them in descending order of u magnitude.
SELECT p.objid, p.modelMag_u, p.modelMag_r, p.modelMag_u - modelMag_r as modelMag_ur, p.fieldID FROM PhotoTag p WHERE p.modelMag_u between 27 AND 28 AND p.modelMag_u - p.modelMag_r > 0.5 AND (p.fieldID = 587726014001184768 OR p.fieldID = 587726014001250304) ORDER BY p.modelMag_u DESC
Query 6: What are the average u and r magnitudes for galaxies in field X?
We can use aggregate functions in our queries:
SELECT avg(g.modelMag_u) as avg_modelMag_u, avg(g.modelMag_r) avg_modelMag_r FROM Galaxy g WHERE g.fieldID = 587726014001184768
Query 7: What are the average u and r magnitudes for galaxies observed during run X, rerun Y, with camcol Z grouped by field.
SELECT g.fieldID, avg(g.modelMag_u) as avg_modelMag_u, avg(g.modelMag_r) avg_modelMag_r FROM Galaxy g WHERE g.run = 1458 AND g.rerun = 40 AND g.camcol = 1 GROUP BY g.fieldID ORDER BY g.fieldID
Query 8: Find the average u and r magnitudes for galaxies observed during run X, rerun Y, with camcol Z grouped and ordered by field.
Here we are only only interested in average r magnitudes higher than T.
SELECT g.fieldID, avg(g.modelMag_u) as avg_modelMag_u, avg(g.modelMag_r) avg_modelMag_r FROM Galaxy g WHERE g.run = 1458 AND g.rerun = 40 AND g.camcol = 1 AND avg(g.modelMag_r) > 21.53 GROUP BY g.fieldID ORDER BY g.fieldID
This query fails. Aggregate functions are not allowed in the WHERE clause. We need the HAVING keyword which specifies a search condition for a group or an aggregate. HAVING is usually used with the GROUP BY clause. When GROUP BY is not used, HAVING behaves like a WHERE clause.
SELECT g.fieldID, avg(g.modelMag_u) as avg_modelMag_u, avg(g.modelMag_r) avg_modelMag_r FROM Galaxy g WHERE g.run = 1458 AND g.rerun = 40 AND g.camcol = 1 GROUP BY g.fieldID HAVING avg(g.modelMag_r) > 21.53 ORDER BY g.fieldID
Query 9: What are the redshift and i magnitude of objects contained in MyPhotoObj?
We want to all photometry objects whether they have spectra or not.
SELECT p.objid, p.i, s.specobjid, s.z FROM PhotoObj p LEFT OUTER JOIN SpecObj s ON p.objid = s.bestobjid WHERE p.objid IN (587726014001185022, 587726014001315956, 587726014001315970, 587726031180792552)

SQL in the VO

One of the primary service types supported by the Virtual Observatory is the SkyNode. SkyNodes support queries using a dialect of SQL known as ADQL. When properly invoked by a sufficiently intelligent SkyPortal, they can also handle queries of distributed databases.

The primary limitations of SQL with regard to the VO are lack of support for spatial queries and issues with distributed queries for the very large astronomical data. Given the growth of GIS systems, most database systems now have extensions to support such queries relatively efficiently, but standards are still evolving.

SQL and ADQL will be further discussed in a number of other courses in this summer school, including discussions of the SkyPortal services, ADQL, and the use and deployment of SkyNodes.

Links of interest

Some astronomical databases on-line using SQL interfaces

The Sloan Digital Sky Survey - SDSS
http://skyserver.sdss.org/
Galaxy Evolution Explorer - Galex
http://galex.stsci.edu/
SuperCOSMOS Science Archive - SSA
http://surveys.roe.ac.uk/

General SQL Tutorials

A Gentle Introduction to SQL
http://sqlzoo.com
This reference shows how some queries and operations differ among popular databases systems.
SQL Course.com
http://sqlcourse.com/
http://www.sqlcourse2.com/
SQL Tutorial
http://www.w3schools.com/sql/
FirstSQL
http://www.firstsql.com/tutor.htm
SDSS SQL Tutorial http://cas.sdss.org/dr4/en/help/docs/sql_help.asp
SampleQueries

Some Vendor Links