SQL
Course Description
In the era of Tera and Peta byte astronomical datasets, Database Management Systems (DBMS) are becoming essential tools to access and mine the data. This class will cover basic concepts related to Relational Database Management Systems (RDBMS) and the Structured Query Language (SQL). The student will learn how to formulate a diversity of queries varying in complexity and will experiment with current on-line astronomical databases. Performance and query optimization will be discussed as well.
1. Introduction
SQL (Structured Query Language) is the language used for manipulating and querying relational databases.
A relational database is a group of related (sets) tables containing data. Tables are composed of one or more columns and rows. Columns have a name and a data type (float, double, integer, varchar, etc). Columns define the content of the database. Rows are the content of the database. Rows are usually also referred as records. Tables in the database are related to each other using keys. A primary key is a column (or group of columns, composed key) with a unique value for each row. A gn key is a column (or group of columns) which is a primary key in another table. Keys are used to bind objects between tables. In order to enable fast access to the data, tables are indexed. Indexes can be built on one or more columns (covering index). Tables are usually indexed on their primary key by default and often times on foreign keys as well to speed up joins. Some times, different type of users need to see subsets of the table content. To avoid to repeat the content we define views. Views, from the user's point of view, are equivalent to tables.
The description of tables (columns, keys, and indexes) and their relationships conform the database schema. In order to work with a database it is important to know its schema.
2. Getting Started
First, we will need a database. The examples below will be using:
The Sloan Digital Sky Survey catalog on-line databases from any of the SkyServer Web sites (DR2, DR3, or DR4):
- http://skyserver.sdss.org/dr2/en/tools/search/sql.asp
- http://cas.sdss.org/dr3/en/tools/search/sql.asp
- http://cas.sdss.org/dr4/en/tools/search/sql.asp
You can get familiar with the database schema by giving a look at its picture and by using the Schema Browser located on the top menu. Open the browser in a new window. It will help you to formulate your queries.
3. SQL Statements
There are three basic types of SQL statements.
- Data statements to access an modify the content of the database (SELECT, INSERT, UPDATE, DELETE)
- Schema statements to create and modify the database schema (CREATE, DROP, GRANT, REVOKE)
- Transaction statements to control concurrent database access and recovery in case of failure (COMMIT, ROLLBACK)
We will be covering mostly Data access statements and some for the Schema definition.
4. Accessing Data
The most basic statement in SQL is SELECT which is composed of
- A SELECT clause, which specifies the columns you wish to retrieve
- A FROM clause, which specifies the database tables you want to extract the data from
- A WHERE clause, which specifies the limitations/predicates you want to place on the extracted data
Basic syntax:
SELECT [ALL | DISTINCT] column1[,column2]
FROM table1[,table2]
[WHERE conditions]
[GROUP BY column-list]
[HAVING conditions]
[ORDER BY column-list [ASC | DESC] ]
SELECT s.name, s.age
FROM Students s
WHERE s.age > 16 and s.age < 20
<
For example, using SkyServer
Q1: What are the u and r magnitudes for object X? In what field is the object located?
SELECT p.objid, p.modelMag_u, p.modelMag_r, p.fieldID
FROM PhotoTag p
WHERE p.objid = 587726014001185324
SELECT p.objid, p.modelMag_u, p.modelMag_r, p.fieldID FROM PhotoTag p WHERE p.objid = 587726014001185324
| objid | modelMag_u | modelMag_r | fieldID |
| 587726014001185324 | 24.49 | 21.261 | 587726014001184768 |
If you are interested in all columns for a given record you can use the *
Q2: What do you know about object X from the PhotoTag table?SELECT p.*
FROM PhotoTag p
WHERE p.objid = 587726014001185324
Q2b: Tell me all you know about X objects in the PhotoTag table.
SELECT TOP 10 p.*
FROM PhotoTag p
TOP is not supported by all systems. It is important to note that when using TOP, there is no guarantee that this will bring the same results always unless is used with an ORDER BY (This is true for any SELECT statemet in general). Often times, the TOP keyword is a "quick and dirty" way to know what columns are available in a table and get a subsample of its content. (In MySQL, instead of TOP is called LIMIT and goes at the end of the SELECT stament after FROM, WHERE, ...)
DISTINCT is useful when you want to know how many unique values you have in a given column or combination of columns.
Q3: Tell me the identifier of all observation runs without repetitions.SELECT DISTINCT f.run
FROM Field f
| run |
| 745 |
| 752 |
| ... |
In the WHERE clause we specify filters or constraints to our search. The following operators can be used combined with AND and OR:
| Operator | Description |
| = | Equal |
| <> | Not equal |
| > | Greater than |
| < | Less than |
| >= | Greater than or equal |
| <= | Less than or equal |
| BETWEEN | Between an inclusive range |
| LIKE | Search for a pattern |
Note: In some versions of SQL the <> operator may be written as ! =
Q4: Find objects contained in fields X and Y with u magnitude between M and N. I interested only on those with u magnitude fainter than r by F units.SELECT p.objid, p.modelMag_u, p.modelMag_r,
p.modelMag_u - p.modelMag_r as modelMag_ur, p.fieldID
FROM PhotoTag p
WHERE p.modelMag_u between 27.5 AND 28
AND p.modelMag_u - p.modelMag_r > 0.5
AND (p.fieldID = 587726014001184768 OR p.fieldID = 587726014001250304)
| objid | modelMag_u | modelMag_r | modelMag_ur | fieldID |
| 587726014001250343 | 27.967 | 19.904 | 8.064 | 587726014001250304 |
The ORDER BY keyword sorts the results in ascending order by default or in descending order if DESC is specified. More than one column may be included.
Q5: Find objects contained in fields X and Y with u magnitude between M and N. I interested only on those with u magnitude fainter than r by F units. Show them in descending order of u magnitude.SELECT p.objid, p.modelMag_u, p.modelMag_r,
p.modelMag_u - modelMag_r as modelMag_ur, p.fieldID
FROM PhotoTag p
WHERE p.modelMag_u between 27 AND 28
AND p.modelMag_u - p.modelMag_r > 0.5
AND (p.fieldID = 587726014001184768 OR p.fieldID = 587726014001250304)
ORDER BY p.modelMag_u DESC
| objid | modelMag_u | modelMag_r | modelMag_ur | fieldID |
| 587726014001250343 | 27.967 | 19.904 | 8.064 | 587726014001250304 |
| 587726014001250816 | 27.146 | 19.374 | 7.771 | 587726014001250304 |
| 587726014001184814 | 27 | 20.788 | 6.213 | 587726014001184768 |
The SELECT clause can use agregate functions like COUNT, SUM, AVG, MIN, and MAX that summarize column values
>Q6: What are the average u and r magnitudes for galaxies in field X?SELECT avg(g.modelMag_u) as avg_modelMag_u, avg(g.modelMag_r) avg_modelMag_r
FROM Galaxy g
WHERE g.fieldID = 587726014001184768
| avg_modelMag_u | avg_modelMag_r |
| 23.3848 | 21.732277 |
Aggregate functions are usually used along with the GROUP BY keyword
Q7: What are the average u and r magnitudes for galaxies observed during run X, rerun Y, with camcol Z grouped and ordered by field?SELECT g.fieldID, avg(g.modelMag_u) as avg_modelMag_u, avg(g.modelMag_r) avg_modelMag_r
FROM Galaxy g
WHERE g.run = 1458 AND g.rerun = 40 AND g.camcol = 1
GROUP BY g.fieldID
ORDER BY g.fieldID
| fieldID | avg_modelMag_u | avg_modelMag_r |
| 587726013994565632 | 23.396385 | 21.529083 |
| 587726013994631168 | 23.358104 | 21.549631 |
| ... | ... | ... |
SELECT g.fieldID, avg(g.modelMag_u) as avg_modelMag_u, avg(g.modelMag_r) avg_modelMag_r
FROM Galaxy g
WHERE g.run = 1458 AND g.rerun = 40 AND g.camcol = 1 AND avg(g.modelMag_r) > 21.53
GROUP BY g.fieldID
ORDER BY g.fieldID
This query fails. Aggregate functions are not allowed in the WHERE clause.
We need the HAVING keyword which specifies a search condition for a group or an aggregate. HAVING is usually used with the GROUP BY clause. When GROUP BY is not used, HAVING behaves like a WHERE clause.
Q8b: Tell me the average u and r magnitudes for galaxies observed during run X, rerun Y, with camcol Z grouped and ordered by field. I'm only interested in average r magnitudes higher than T.SELECT g.fieldID, avg(g.modelMag_u) as avg_modelMag_u, avg(g.modelMag_r) avg_modelMag_r
FROM Galaxy g
WHERE g.run = 1458 AND g.rerun = 40 AND g.camcol = 1
GROUP BY g.fieldID
HAVING avg(g.modelMag_r) > 21.53
ORDER BY g.fieldID
| fieldID | avg_modelMag_u | avg_modelMag_r |
| 587726013994631168 | 23.358104 | 21.549631 |
| 587726013995155456 | 23.348868 | 21.740538 |
| ... | ... | ... |
SELECT g.fieldID, count(g.objid) cnt_obj
FROM Galaxy g
WHERE g.run = 1458 AND g.rerun = 40 AND g.camcol = 1
GROUP BY g.fieldID
HAVING avg(g.modelMag_r) BETWEEN 21.5 AND 21.75
ORDER BY g.fieldID
| fieldID | cnt_obj |
| 587726013994565632 | 428 |
| 587726013994631168 | 425 |
| ... | ... |
JOINS: Using more than one table
Often times we want to retrieve data from more than one table. To do so we use JOINS. There are three types of JOINS: INNER, OUTER, CROSS
MyPhotoObj:
SELECT p.objid, p.ra, p.dec, p.i
FROM PhotoObj p
WHERE p.objid IN (587726014001185022, 587726014001315956, 587726014001315970,587726031180792552)
| objid | ra | dec | i |
| 587726014001185022 | 193.751615 | 1.548215 | 16.976 |
| 587726014001315956 | 193.97825 | 1.467137 | 17.599 |
| 587726014001315970 | 194.022063 | 1.535068 | 16.159 |
| 587726031180792552 | 193.750587 | 1.257379 | 22.06 |
MySpecObj:
SELECT s.specobjid, s.ra, s.dec, s.z, s.bestobjid
FROM SpecObj s
WHERE s.specobjid IN (147434863106981888, 147434863341862912, 147434863358640128, 147434863694184448)
| specobjid | ra | dec | z | bestobjid |
| 147434863106981888 | 194.992 | 2.049279 | 2.9E-3 | 0 |
| 147434863341862912 | 193.97826 | 1.467157 | 0.267 | 587726014001315956 |
| 147434863358640128 | 194.02205 | 1.535082 | 0.071 | 587726014001315970 |
| 147434863694184448 | 193.75164 | 1.548211 | 0.124 | 587726014001185022 |
Q9: Given
the tables above, what are the redshift and i magnitude of
objects contained in MySpecObj?
In the
following example, we only return those objects which are matched between the
multiple tables. We need to perform an INNER JOIN.
The old style but still very common
way of writting this query would be:
SELECT p.objid, p.i, s.specobjid, s.z
FROM MyPhotoObj p, MySpecObj s
WHERE p.objid = s.bestobjid
The ANSI, SQL-92 syntax:
SELECT p.objid, p.i, s.specobjid, s.z
FROM MyPhotoObj p INNER JOIN MySpecObj s
ON p.objid = s.bestobjid
objid
i
specobjid
z
587726014001185022
16.976
147434863694184448
0.124
587726014001315956
17.599
147434863341862912
0.267
587726014001315970
16.159
147434863358640128
0.071
SELECT p.objid, p.i, s.specobjid, s.z FROM MyPhotoObj p, MySpecObj s WHERE p.objid = s.bestobjid
SELECT p.objid, p.i, s.specobjid, s.z
FROM MyPhotoObj p INNER JOIN MySpecObj s
ON p.objid = s.bestobjid
Q9b: To run in SkyServer
SELECT p.objid, p.i, s.specobjid, s.z
FROM PhotoObj p INNER JOIN SpecObj s
ON p.objid = s.bestobjid
WHERE p.objid IN (587726014001185022, 587726014001315956, 587726014001315970, 587726031180792552)
SELECT p.objid, p.i, s.specobjid, s.z
FROM PhotoObj p INNER JOIN SpecObj s
ON p.objid = s.bestobjid
WHERE p.objid IN (587726014001185022, 587726014001315956, 587726014001315970, 587726031180792552)
<** the keyword INNER is optional. By
default a JOIN is a INNER JOIN
If we want to include all rows of one of the tables, regardless of whether or not they are matched to another table, we must perform an OUTER JOIN. There three types of OUTER JOIN: LEFT, RIGHT, and FULL
In the example below, we perform a LEFT OUTER JOIN, which means that we will get all entries (regardless of matching) from the table on the left side of the join.
Q10: What are the redshift and i magnitude of objects contained in MyPhotoObj? I want to see all photometry objects whether they have spectra or not.SELECT p.objid, p.i, s.specobjid, s.z
FROM MyPhotoObj p LEFT OUTER JOIN MySpecObj s
ON p.objid = s.bestobjid
| objid | i | specobjid | z |
| 587726014001185022 | 16.976 | 147434863694184448 | 0.124 |
| 587726014001315956 | 17.599 | 147434863341862912 | 0.267 |
| 587726014001315970 | 16.159 | 147434863358640128 | 0.071 |
| 587726031180792552 | 22.06 | null | null |
Q10b: To run in SkyServer
SELECT p.objid, p.i, s.specobjid, s.z
FROM PhotoObj p LEFT OUTER JOIN SpecObj s
ON p.objid = s.bestobjid WHERE p.objid IN (587726014001185022, 587726014001315956, 587726014001315970, 587726031180792552)
SELECT p.objid, p.i, s.specobjid, s.z
FROM PhotoObj p LEFT OUTER JOIN SpecObj s
ON p.objid = s.bestobjid WHERE p.objid IN (587726014001185022, 587726014001315956, 587726014001315970, 587726031180792552)
In the example below, we perform a RIGHT OUTER JOIN, which means that we will get all entries (regardless of matching) from the table on the RIGHT side of the join.
Q11: What are the redshift and i magnitude of objects contained in MySpecObj? I want to see all spec objects whether they have photometry or not.SELECT p.objid, p.i, s.specobjid, s.z
FROM MyPhotoObj p RIGHT OUTER JOIN MySpecObj s
ON p.objid = s.bestobjid
| objid | i | specobjid | z |
| null | null | 147434863106981888 | 2.9E-3 |
| 587726014001315956 | 17.599 | 147434863341862912 | 0.267 |
| 587726014001315970 | 16.159 | 147434863358640128 | 0.071 |
| 587726014001185022 | 16.976 | 147434863694184448 | 0.124 |
Q11b: To run in SkyServer
SELECT p.objid, p.i, s.specobjid, s.z
FROM PhotoObj p RIGHT OUTER JOIN SpecObj s
ON p.objid = s.bestobjid
WHERE s.specobjid IN (147434863106981888, 147434863341862912, 147434863358640128, 147434863694184448)
SELECT p.objid, p.i, s.specobjid, s.z
FROM PhotoObj p RIGHT OUTER JOIN SpecObj s
ON p.objid = s.bestobjid
WHERE s.specobjid IN (147434863106981888, 147434863341862912, 147434863358640128, 147434863694184448)
Finally, we can also perform a FULL OUTER JOIN, which will show all entries regardeless of matching
Q12: What are the redshift and i magnitude of objects contained in MySpecObj and myPhotoObj? I want to see all spec and phot objects.SELECT p.objid, p.i, s.specobjid, s.z
FROM MyPhotoObj p FULL OUTER JOIN MySpecObj s
ON p.objid = s.bestobjid
| objid | i | specobjid | z |
| null | null | 147434863106981888 | 2.9E-3 |
| 587726014001185022 | 16.976 | 147434863694184448 | 0.124 |
| 587726014001315956 | 17.599 | 147434863341862912 | 0.267 |
| 587726014001315970 | 16.159 | 147434863358640128 | 0.071 |
| 587726031180792552 | 22.06 | null | null |
Q12b: To emulate in SkyServer
SELECT p.objid, p.i, s.specobjid, s.z
FROM PhotoObj p LEFT OUTER JOIN SpecObj s
ON p.objid = s.bestobjid
WHERE p.objid IN (587726014001185022, 587726014001315956, 587726014001315970, 587726031180792552)
UNION
SELECT p.objid, p.i, s.specobjid, s.z
FROM PhotoObj p RIGHT OUTER JOIN SpecObj s
ON p.objid = s.bestobjid
WHERE s.specobjid IN (147434863106981888, 147434863341862912, 147434863358640128, 147434863694184448)
5. Creating, modifying, and querying a database
SELECT p.objid, p.i, s.specobjid, s.z
FROM PhotoObj p LEFT OUTER JOIN SpecObj s
ON p.objid = s.bestobjid
WHERE p.objid IN (587726014001185022, 587726014001315956, 587726014001315970, 587726031180792552)
UNION
SELECT p.objid, p.i, s.specobjid, s.z
FROM PhotoObj p RIGHT OUTER JOIN SpecObj s
ON p.objid = s.bestobjid
WHERE s.specobjid IN (147434863106981888, 147434863341862912, 147434863358640128, 147434863694184448)
In the following examples we will see how we can create, modify, and query a database using MySQL. All examples have also been implemented using SQL Server 2000.
We will be using SPOCS, which you will find in the SummerSchool package under the SQL directory. SPOCS has a very basic schema. It is composed of three tables SPOCS, SPOCS_EXT and SANTOS. SPOCS and SPOCS_EXT are related through the ID column. SPOCS* and SANTOS are related through the name and HD columns respectively. The data and content descriptions are included in the SPOCS\data directory.
In the SPOCS_Schema.sql file you will find first how to CREATE the database
DROP DATABASE IF EXISTS SPOCS;
CREATE DATABASE SPOCS;
and three CREATE TABLE statements like the following
CREATE TABLE SPOCS (
ID smallint not null,
Name nvarchar(20) not null,
Teff smallint DEFAULT null,
logg double DEFAULT null,
Metal double DEFAULT null,
Na double DEFAULT null,
Si double DEFAULT null,
Ti double DEFAULT null,
Fe double DEFAULT null,
Ni double DEFAULT null,
Vsini double DEFAULT null,
Vrad double DEFAULT null,
CRMS double DEFAULT null,
LRMS double DEFAULT null,
NKeck tinyint DEFAULT null,
NAAT tinyint DEFAULT null,
NLick tinyint DEFAULT null,
CONSTRAINT PRIMARY KEY (id)
);
The nullability of a column determines if the rows in the table can contain a null value for that column. A null value, or NULL, is not the same as zero (0), blank, or a zero-length character string such as ""; NULL means that no entry has been made. The presence NULL usually implies that the value is either unknown or undefined.
After you have created your database you will need to load the data. SPOCS_Load.sql includes statements like
LOAD DATA INFILE ''c:\\NVOSS2005\\sql\\SPOCS\\data\\SPOCSWH.csv''
INTO TABLE SPOCS
FIELDS TERMINATED BY '',''
IGNORE 1 LINES; -- Ignores the header
To create and load the SPOCS database go to the SQL\SPOCS\mySQL: and start up mysql
C:\NVOSS2005\sql\SPOCS\mySQL> mysql
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 6 to server version: 4.1.12a-nt
Type ''help;'' or ''\h'' for help. Type ''\c'' to clear the buffer.
mysql > \. SPOCS_Schema.sql
mysql > \. SPOCS_Load.sql -- if you are working with Windows
mysql > \. SPOCS_Load4LM.sql -- if you are working with Linux or Mac
There is additional information about how to run SQL commands with MySQL in SQL\SPOCS\mySQL\HowTo.txt
From now on, we will be running one by one each query SPOCS_Query*.sql from inside mySQL.
SPOCS_Query1
mysql > \. SPOCS_Query1.sql
-- Q1: Find all solar analogs
-- Demonstrates selection of objects based on filtering criteria
-- WARNING: MySQL doesn''t work properly when comparing floats and constants.
-- This query wouldn''t work properly if Metal is declared as a float.
-- If you are going to do this type of comparisons use doubles.
SELECT s.ID, s.Name, s.Teff, s.Logg, s.Metal
FROM SPOCS s
WHERE s.Teff BETWEEN 5720 AND 5820
AND s.Logg BETWEEN 4.34 AND 4.54
AND s.Metal BETWEEN -0.1 AND +0.1
ORDER BY s.Teff;
Q1: "Find solar analogs"
The Sun has a temperature of Teff=5770 K and a gravity of logg=4.4 and a metallicity of [M/H]=0.0. Stars like the Sun (analogs) will have values of these three parameters similar to the solar values. In this example, we look for stars with temperatures within 50 K of the Sun and gravities within 0.1 dex of the Sun and metallicities within 0.1 dex. (Here "0.1 dex" means a change of 0.1 in the logarithm of some quantity. This term is commonly used in astronomy.)
SPOCS_Query2 series show various ways of putting your results into a file
mysql > \. SPOCS_Query2.sql
-- Q2: Create an HR diagram with data from SPOCS and SPOCS_EXT
-- Demonstrates a INNER JOIN between two tables where the
-- same ID corresponds to the same object.
-- Shows how to copy the query results into a file.
-- This syntax only works if you are on the server.
-- if your %NVOSS_HOME% is other than c:\\NVOSS2005 you will have to change
-- c:\\NVOSS2005 for the value of your own installation
-- in unix change ''\\'' to ''/''
-- For clients see SPOCS_Query2b.sql
SELECT s.Teff, t.LogL
INTO outfile ''c:\\NVOSS2005\\sql\\SPOCS\\data\\HR.csv''
FIELDS terminated by '',''
FROM SPOCS s
JOIN SPOCS_EXT t
ON s.ID = t.ID;
-- nr = 1040
Q2: "Create an HR Diagram"
HR Diagram is short for Hertzsprung-Russell Diagram, named after two famous and now deceased astronomers.
The x-axis is either the color or in this case temperature of the stars. The y-axis is the either absolute magnitude or in this case logarithm of the luminosity of the stars, relative to the Sun. By wacky astronomical convention, temperatures on the x-axis are reversed so that hotter stars are on the left and cooler stars are on the right.
The y-axis is log(L/Lsun). When a star has the same instrinsic brightness (luminosity) as the Sun (L=Lsun), then log(L/Lsun)=0. When a star has 1/10 the luminosity of the Sun, log(L/Lsun)=0.1. When a star has 10 times the luminosity of the Sun, log(L/Lsun)=10.
(The luminosity of a star is a measure of how much total energy the star is emitting every second. Two stars with the same luminosity can have very different apparent brightnesses, if the two stars are at very different distances (near -> bright, far -> faint).
The SPOCS catalog has calculated luminosity from the observed V band magnitude, the Hipparcos distance, and a "bolometric" correction that accounts for extra light from the star outside the measured V band.
SPOCS_Query3
mysql > \. SPOCS_Query3.sql
-- Q3: Calculates differences between entries in two catalogs
-- Looks for systematic trends vs temperatures
-- Demonstrantes a INNER JOIN between two tables using the HD identifier
-- and how to use some string functions
SELECT
s.ID, s.Name,
s.Teff, (s.Teff - e.Teff) as dTeff,
s.Logg, format(s.Logg - e.Logg,2) as dLogg,
s.Fe, format(s.Fe - e.Fe,2) as dFe
FROM SPOCS s
JOIN SANTOS e ON substr(s.Name, 3) = trim(e.HD)
ORDER BY s.Teff;
-- nr = 86
mysql > \. SPOCS_Query3.sql
-- Q3: Calculates differences between entries in two catalogs -- Looks for systematic trends vs temperatures -- Demonstrantes a INNER JOIN between two tables using the HD identifier -- and how to use some string functions SELECT s.ID, s.Name, s.Teff, (s.Teff - e.Teff) as dTeff, s.Logg, format(s.Logg - e.Logg,2) as dLogg, s.Fe, format(s.Fe - e.Fe,2) as dFe FROM SPOCS s JOIN SANTOS e ON substr(s.Name, 3) = trim(e.HD) ORDER BY s.Teff; -- nr = 86
Q3: "Look for systematic trends in gravity vs. temperature and also metallicity vs. temperature."
Making a star a little bit cooler, reducing the gravity a little bit, or increasing the metallicity a little bit all tend to make spectral lines a little bit deeper. This unfortunate reality makes it hard to get the temperature, gravity, and metallicity exactly correct.
When astronomers measure gravity, any error the temperature or metallicity will cause a compensating error in gravity. Similarly, when measuring metallicity, any error in temperature and gravity will cause a compensating error in metallicity. All three parameters are interlinked.
Detecting and characterizing systematic errors can be difficult. Here we compare stellar parameters from two different spectroscopists, using different analysis techniques. Ordering the differences in gravity or metallicity as a function of increasing temperature shows trends. The size of these trends are an indication of the minimum systematic error in one or both of the results from the two analyses.
The next step would be to plot the trends, using Excel or any other plotting package.
-----
But notice that some stars in the Santos catalog have measurements for multiple observations. Frequently, we want to combine multiple measurements of the same quantity into a single value. This is a very likely scenario in the world of VO, which draws upon many different catalogs. In this simple case, the multiple measurements are in the same catalog.
SPOCS_Query4
mysql > \. SPOCS_Query4.sql
-- Q4: The same star appears more than once in SANTOS catalog
-- Shows how to use count, group by, and descending order
SELECT s.ID, s.Name, count(e.HD) c
FROM SPOCS s
JOIN SANTOS e ON substr(s.Name, 3) = trim(e.HD)
GROUP BY s.ID, s.Name
ORDER BY c desc;
-- nr = 70
mysql > \. SPOCS_Query4.sql
Q4: Count how many times each star appears in the Santos catalog
The SPOCS catalog presents averages for stars with more than one observation. The Santos catalog present results for each individual observation.
This is an SQL exercise that simply counts how many times each star appears in the Santos catalog. The same exercise for the SPOCS catalog would yield "1" for every star, since each star appears only once in the SPOCS catalog.
SPOCS_Query5
mysql > \. SPOCS_Query5.sql
mysql > \. SPOCS_Query5.sql
-- Q5: Combines multiple entries of the same star from SANTOS
-- and prints the avg value of two measurements.
-- Demonstrates how to use group by to get averages per group of rows
-- Shows how to format a decimal point numbers into strings for nice looking.
-- Discards the rows with Inst = avg
SELECT
s.ID, s.Name, count(s.ID) n,
format(avg(e.Teff), 1) as mean_Teff, format(avg(s.Teff - e.Teff), 1) as mean_dTeff,
format(avg(s.Logg - e.Logg), 2) as mean_dLogg,
format(avg(s.Fe - e.Fe), 2) as mean_dFe
FROM SPOCS s
JOIN SANTOS e ON substr(s.Name, 3) = trim(e.HD)
WHERE e.Inst != ''avg''
GROUP BY s.ID
ORDER BY mean_Teff
-- nr = 70
-- Q5: Combines multiple entries of the same star from SANTOS -- and prints the avg value of two measurements. -- Demonstrates how to use group by to get averages per group of rows -- Shows how to format a decimal point numbers into strings for nice looking. -- Discards the rows with Inst = avg SELECT s.ID, s.Name, count(s.ID) n, format(avg(e.Teff), 1) as mean_Teff, format(avg(s.Teff - e.Teff), 1) as mean_dTeff, format(avg(s.Logg - e.Logg), 2) as mean_dLogg, format(avg(s.Fe - e.Fe), 2) as mean_dFe FROM SPOCS s JOIN SANTOS e ON substr(s.Name, 3) = trim(e.HD) WHERE e.Inst != ''avg'' GROUP BY s.ID ORDER BY mean_Teff -- nr = 70
Q5: Average multiple measurements for the same star in the Santos catalog.
Building on the previous example, here we average together multiple observations of the same star in the Santos catalog. For stars with multiple observations, averaging improves the precision of the result. Combining multiple values for each star also provides a single number to use when comparing results from the two catalogs. (The comparison in Q3 compared each measurement from the Santos catalog separately.)
-----
But notice that some of the rows in the Santos catalog are already averages for the star! These rows are labelled "avg".
SPOCS_Query6
mysql > \. SPOCS_Query6.sql
mysql > \. SPOCS_Query6.sql
-- Q6: To avoid problems with the repeated instances with the avg we decided
-- to get rid of rows with Inst = ''avg''
-- We could create a view or remove the rows.
-- WARNING: CREATE VIEW doesn''t work
-- for MySQL 4.1.12 but works for MySQL 5.0
-- Demonstrates how to use DELETE and CREATE a VIEW
-- CREATE VIEW SANTOSM AS SELECT * FROM SANTOS WHERE (Inst != ''avg'')
-- if you chose to create a view instead of deleting objects remember you need to
-- use the name of the VIEW (SANTOS_M) instead of the original table name (SANTOS)
DELETE FROM SANTOS
WHERE Inst = ''avg'';
-- nr = 7 rows deleted
-- Q6: To avoid problems with the repeated instances with the avg we decided -- to get rid of rows with Inst = ''avg'' -- We could create a view or remove the rows. -- WARNING: CREATE VIEW doesn''t work -- for MySQL 4.1.12 but works for MySQL 5.0 -- Demonstrates how to use DELETE and CREATE a VIEW -- CREATE VIEW SANTOSM AS SELECT * FROM SANTOS WHERE (Inst != ''avg'') -- if you chose to create a view instead of deleting objects remember you need to -- use the name of the VIEW (SANTOS_M) instead of the original table name (SANTOS) DELETE FROM SANTOS WHERE Inst = ''avg''; -- nr = 7 rows deleted
Q6: Ignore or delete rows in the Santos catalog that contain the average (avg) of multiple measurements for that star.
Building on the previous example, here we delete rows in the Santos catalog that are averages of all measurements for a star. This is an SQL example that demonstrates how to ignore or delete rows. Here we delete the "avg" rows to leave only actual observations, which we can then combine (as in Q5). This is the VO-like situation.
Alternatively, we could also delete the individual observations and keep the averages, but this is only useful if somebody has already done the work of calculating averages for us.
SPOCS_Query7
mysql > \. SPOCS_Query7.sql
mysql > \. SPOCS_Query7.sql
-- Q7: Adds the error in Teff
-- Demonstrates ALTER and UPDATE
ALTER TABLE SPOCS ADD e_Teff smallint after Teff;
UPDATE SPOCS s
SET e_Teff = round(44.0/sqrt(s.NKeck + s.NAAT + s.NLick));
-- nr = 1040
-- Q7: Adds the error in Teff -- Demonstrates ALTER and UPDATE ALTER TABLE SPOCS ADD e_Teff smallint after Teff; UPDATE SPOCS s SET e_Teff = round(44.0/sqrt(s.NKeck + s.NAAT + s.NLick)); -- nr = 1040
Q7: Add the error in Teff
The spectra used to create the SPOCS catalog were all obtained with very similar spectrographs. Exposure times were adjusted so that each spectrum had the same high signal to noise ratio. Thus, the authors claim that the error in temperature, gravity, metallicity, etc. are the same for any individual spectrum. These error values for individual spectra are given in a small table in the SPOCS paper. For example, the error in Teff is 44 K for a measurement based on one spectrum.
When the authors average results for N spectra of the same star, they claim the error in the result should get smaller, going from 44 K down to 44/sqrt(N) K.
The SPOCS spectra come from three observatories: Keck Observatory in Hawaii, the Anglo-Australian Telescope ("AAT") in Australia (duh!), and Lick Observatory in California. The total number of spectra is just the sum of the number of spectra from each of these three observatories.
This 1/sqrt(N) adjustment to the error is probably optimistic. Often real-world errors don't get smaller quite so fast. Sometimes they don't get smaller at all! Also, cooler stars probably have larger errors than warmer stars because their spectra are more complicated (lots more spectral lines). Thus, the assumption that the error in Teff is 44 K for all stars is an oversimplification. However, there really is no good way to assess errors for individual stars. Formal error analysis is way too optimistic and there aren't enough stars with multiple observations to get better empirical results. So this way of handling errors is about the best one can do for these data. (That's the way it is)
SPOCS_Query8
mysql > \. SPOCS_Query8.sql
mysql > \. SPOCS_Query8.sql
-- Q8: Calculates how significant the differences in temperature are
-- for individual stars
-- Demonstrates the use of a computed column as sorting criteria
SELECT
s.ID, s.Name,
s.Teff, s.e_Teff,
e.Teff, e.e_Teff,
(s.Teff - e.Teff) as dTeff,
round(sqrt((s.e_Teff * s.e_Teff) + (e.e_Teff* e.e_Teff))) as e_dTeff,
-- format((s.Teff - e.Teff) /sqrt((s.e_Teff * s.e_Teff) + (e.e_Teff* e.e_Teff)),1) as sigma
(s.Teff - e.Teff) /sqrt((s.e_Teff * s.e_Teff) + (e.e_Teff* e.e_Teff)) as sigma
FROM SPOCS s
JOIN SANTOS e ON substr(s.Name, 3) = trim(e.HD)
ORDER BY sigma; -- if you use format for sigma then the order is alphabetical and looks funny
Q8: Calculate how significant differences in temperature are between the SPOCS catalog and the Santos catalog.
OK, now we are going to do some error analysis. We'll compare the temperature from Santos for each star with the temperature from SPOCS. Of course the two values differ, but the question is whether the difference is reasonable, given the errors (actually given the "uncertainties") provided by the authors.
Here we calculate the difference between the two measures of temperature:
Diff = SPOCS - Santos
This difference should be zero if the two catalogs agree.
What is the expected uncertainty (e_Diff) in this difference? The typical error in "Diff" is given by the following formula, which combines the uncertainty in the SPOCS value (e_SPOCS) and the error in the Santos value (e_Santos):
e_Diff^2 = e_SPOCS^2 + e_Santos^2
Some jargon: the uncertainty squared is called the "variance". So in words the error propagation formula above says when you subtract two quantities, the variance of the difference is just the sum of the variances of the two things you subtracted. For fun, I note that you get the exact same error formula when you add two qunatities:
Sum = A + B
e_Sum^2 = e_A^2 + e_B^2
Fun, eh?
Anyway, back to our goal of deciding whether the differences between SPOCS and Santos are significant. The calculations above give us Diff and e_Diff. Because of the way errors statistics work, if all differences are just due to Gaussian noise (which people often assume), then about 2/3 of the Diff values should be between -e_Diff and +e_Diff. In words, one would say, "If difference are just due to Gaussian noise, then two-thirds of our calculated Diffs should be between minus one-sigma and plus one-sigma." In this sentence, "one-sigma" is just a more general name for e_Diff.
If we divide Diff by e_Diff, then 2/3 of our calculate Diff/e_Diff values should be between -1 and +1. This is easy to test in a table, especially if sorted by Diff.
SPOCS_Query9
mysql > \. SPOCS_Query9.sql
mysql > \. SPOCS_Query9.sql
-- Q9: Compares measurements from 2 catalogs using columns from
-- three tables
-- Demonstrates the use of a 3-way JOIN
SELECT
s.ID, s.Name, e.HD as santosHD, t.Mass, (t.Mass - e.Mass) as dMass,
s.Teff, (s.Teff - e.Teff) as dTeff,
s.Logg, format(s.Logg - e.Logg,2) as dLogg,
s.Fe, format(s.Fe - e.Fe,2) as dFe
FROM SPOCS s
JOIN SANTOS e ON substr(s.Name, 3) = trim(e.HD)
JOIN SPOCS_EXT t ON s.ID = t.ID
ORDER BY s.Teff;
-- Q9: Compares measurements from 2 catalogs using columns from -- three tables -- Demonstrates the use of a 3-way JOIN SELECT s.ID, s.Name, e.HD as santosHD, t.Mass, (t.Mass - e.Mass) as dMass, s.Teff, (s.Teff - e.Teff) as dTeff, s.Logg, format(s.Logg - e.Logg,2) as dLogg, s.Fe, format(s.Fe - e.Fe,2) as dFe FROM SPOCS s JOIN SANTOS e ON substr(s.Name, 3) = trim(e.HD) JOIN SPOCS_EXT t ON s.ID = t.ID ORDER BY s.Teff;
SPOCS_Query10
mysql > \. SPOCS_Query10.sql
-- Q10: Is there a systematic difference between the 2 catalogs?
SELECT
format(avg (s.Teff - e.Teff), 1) as mean_dTeff,
format(avg(s.Logg - e.Logg), 3) as mean_dLogg,
format(avg(s.Fe - e.Fe), 3) as mean_dFe
FROM SPOCS s
JOIN SANTOS e ON substr(s.Name, 3) = trim(e.HD);
-- nr = 1
Q10: Is there a systematic difference between SPOCS and Santos.
OK, in Q8 we saw that individual measurements from SPOCS and Santos seemed to be consistent to within the predicted errors. But, there could still be small systematic offsets between the two catalogs that are too small to see by looking at measurments for only one stars. So, here we look at the average offset for all stars in common.
6. Links of interest
Astronomical databases on-line using SQL interfaces
- The Sloan Digital Sky Survey - SDSS
- Galaxy Evolution Explorer - Galex
- SuperCOSMOS Science Archive - SSA
General SQL Tutorials
- SQL Course.com
- SQL Tutorial
- FirstSQL
SDSS SQL Tutorial
- SQL Introduction
- Sample Queries
Others
- MySQL
- SQL Server site
