SQL

Course Description

In the era of Tera and Peta byte astronomical datasets, Database Management Systems (DBMS) are becoming essential tools to access and mine the data. This class will cover basic concepts related to Relational Database Management Systems (RDBMS) and the Structured Query Language (SQL). The student will learn how to formulate a diversity of queries varying in complexity and will experiment with current on-line astronomical databases. Performance and query optimization will be discussed as well.

1. Introduction

SQL (Structured Query Language) is the language used for manipulating and querying relational databases.

A relational database is a group of related (sets)  tables containing data. Tables are composed of one or more columns and rows. Columns have a name and a data type (float, double, integer, varchar, etc). Columns define the content of the database. Rows are the content of the database. Rows are usually also referred as records. Tables in the database are related to each other using keys. A primary key is a column (or group of columns, composed key) with a unique value for each row. A gn key is a column (or group of columns) which is a primary key in another table. Keys are used to bind objects between tables. In order to enable fast access to the data, tables are indexed. Indexes can be built on one or more columns (covering index). Tables are usually indexed on their primary key by default and often times on foreign keys as well to speed up joins. Some times, different type of users need to see subsets of the table content. To avoid to repeat the content we define views. Views, from the user's point of view, are equivalent to tables.

The description of  tables (columns, keys, and indexes) and their relationships conform the database schema. In order to work with a database it is important to know its schema.

 

2. Getting Started

First, we will need a database. The examples below will be using:  

The Sloan Digital Sky Survey catalog on-line databases from any of the SkyServer Web sites (DR2, DR3, or DR4):

You can get familiar with the database schema by giving a look at its picture  and by using the Schema Browser located on the top menu. Open the browser in a new window. It will help you to formulate your queries.

 

3. SQL Statements

There are three basic types of SQL statements.

We will be covering mostly Data access statements and some for the Schema definition.

 

4. Accessing Data

The most basic statement in SQL is SELECT which is composed of

Basic syntax:

SELECT [ALL | DISTINCT] column1[,column2]
FROM table1[,table2]
[WHERE conditions]
[GROUP BY column-list]
[HAVING conditions]
[ORDER BY column-list [ASC | DESC] ]

SELECT s.name, s.age 
FROM Students s
WHERE s.age > 16 and s.age < 20

For example, using SkyServer

Q1: What are the u and r magnitudes for object X?   In what field is the object located?

SELECT p.objid, p.modelMag_u, p.modelMag_r, p.fieldID 
FROM PhotoTag p
WHERE p.objid = 587726014001185324

objid modelMag_u modelMag_r fieldID
587726014001185324 24.49 21.261 587726014001184768

If you are interested in all columns for a given record you can use the *

Q2: What do you know about object X from the PhotoTag table?

SELECT p.* 
FROM PhotoTag p
WHERE p.objid = 587726014001185324

Q2b: Tell me all you know about X objects in the PhotoTag table.

SELECT TOP 10 p.*
FROM PhotoTag p

TOP is not supported by all systems. It is important to note that when using TOP, there is no guarantee that this will bring the same results always unless is used with an ORDER BY (This is true for any SELECT statemet in general). Often times, the TOP keyword is a "quick and dirty" way to know what columns are available in a table and get a subsample of its content. (In MySQL, instead of TOP is called LIMIT and goes at the end of the SELECT stament after FROM, WHERE, ...)

DISTINCT is useful when you want to know how many unique values you have in a given column or combination of columns.

Q3: Tell me the identifier of all observation runs without repetitions.

SELECT DISTINCT f.run 
FROM Field f

run
745
752
...

In the WHERE clause we specify filters or constraints to our search. The following operators can be used combined with AND and OR:

Operator Description
= Equal
<> Not equal
> Greater than
< Less than
>= Greater than or equal
<= Less than or equal
BETWEEN Between an inclusive range
LIKE Search for a pattern

Note: In some versions of SQL the <> operator may be written as ! =  

Q4: Find objects contained in fields X and Y with u magnitude between M and N. I interested only on those with u magnitude fainter than r by F units.

SELECT p.objid, p.modelMag_u, p.modelMag_r,
       p.modelMag_u - p.modelMag_r as modelMag_ur, p.fieldID
FROM   PhotoTag p
WHERE  p.modelMag_u between 27.5 AND 28
  AND  p.modelMag_u - p.modelMag_r > 0.5
  AND (p.fieldID = 587726014001184768 OR p.fieldID = 587726014001250304)		


objid modelMag_u modelMag_r modelMag_ur fieldID
587726014001250343 27.967 19.904 8.064 587726014001250304

The ORDER BY keyword sorts the results in ascending order by default or in descending order if DESC is specified. More than one column may be included. 

Q5: Find objects contained in fields X and Y with u magnitude between M and N. I interested only on those with u magnitude fainter than r by F units. Show them in descending order of u magnitude.

SELECT p.objid, p.modelMag_u, p.modelMag_r, 
       p.modelMag_u - modelMag_r as modelMag_ur, p.fieldID
FROM   PhotoTag p
WHERE  p.modelMag_u between 27 AND 28
  AND  p.modelMag_u - p.modelMag_r > 0.5
  AND (p.fieldID = 587726014001184768 OR p.fieldID = 587726014001250304) 
ORDER BY p.modelMag_u DESC


objid modelMag_u modelMag_r modelMag_ur fieldID
587726014001250343 27.967 19.904 8.064 587726014001250304
587726014001250816 27.146 19.374 7.771 587726014001250304
587726014001184814 27 20.788 6.213 587726014001184768

The SELECT clause can use agregate functions like COUNT, SUM, AVG, MIN, and MAX that summarize column values 

>Q6: What are the average u and r magnitudes for galaxies in field X?

SELECT avg(g.modelMag_u) as avg_modelMag_u, avg(g.modelMag_r) avg_modelMag_r 
FROM Galaxy g
WHERE g.fieldID = 587726014001184768


avg_modelMag_u avg_modelMag_r
23.3848 21.732277

Aggregate functions are usually used along with the GROUP BY keyword

Q7: What are the average u and r magnitudes for galaxies observed during run X, rerun Y, with camcol Z grouped and ordered by field?

SELECT g.fieldID, avg(g.modelMag_u) as avg_modelMag_u, avg(g.modelMag_r) avg_modelMag_r 
FROM  Galaxy g
WHERE g.run = 1458 AND g.rerun = 40 AND g.camcol = 1
GROUP BY g.fieldID
ORDER BY g.fieldID

fieldID avg_modelMag_u avg_modelMag_r
587726013994565632 23.396385 21.529083
587726013994631168 23.358104 21.549631
... ... ...

Q8: Tell me the average u and r magnitudes for galaxies observed during run X, rerun Y, with camcol Z grouped and ordered by field. I'm only interested in average r magnitudes higher than T.

SELECT g.fieldID, avg(g.modelMag_u) as avg_modelMag_u, avg(g.modelMag_r) avg_modelMag_r 
FROM  Galaxy g
WHERE g.run = 1458 AND g.rerun = 40 AND g.camcol = 1 AND avg(g.modelMag_r) > 21.53
GROUP BY g.fieldID
ORDER BY g.fieldID

This query fails. Aggregate functions are not allowed in the WHERE clause.

We need the HAVING keyword which specifies a search condition for a group or an aggregate. HAVING is usually used with the GROUP BY clause. When GROUP BY is not used, HAVING behaves like a WHERE clause.

Q8b: Tell me the average u and r magnitudes for galaxies observed during run X, rerun Y, with camcol Z grouped and ordered by field. I'm only interested in average r magnitudes higher than T.

SELECT g.fieldID, avg(g.modelMag_u) as avg_modelMag_u, avg(g.modelMag_r) avg_modelMag_r 
FROM  Galaxy g
WHERE g.run = 1458 AND g.rerun = 40 AND g.camcol = 1
GROUP BY g.fieldID
HAVING avg(g.modelMag_r) > 21.53
ORDER BY g.fieldID

fieldID avg_modelMag_u avg_modelMag_r
587726013994631168 23.358104 21.549631
587726013995155456 23.348868 21.740538
... ... ...

Q8c: Tell me how many galaxies have been observed during run X, rerun Y, with camcol Z grouped and ordered by field. I'm only interested in fields with average r magnitudes between S and T.

SELECT g.fieldID, count(g.objid) cnt_obj
FROM Galaxy g
WHERE g.run = 1458 AND g.rerun = 40 AND g.camcol = 1
GROUP BY g.fieldID
HAVING avg(g.modelMag_r) BETWEEN 21.5 AND 21.75
ORDER BY g.fieldID

fieldID cnt_obj
587726013994565632 428
587726013994631168 425
... ...

 

JOINS: Using more than one table

Often times we want to retrieve data from more than one table. To do so we use JOINS. There are three types of JOINS: INNER, OUTER, CROSS

MyPhotoObj:    

SELECT p.objid, p.ra, p.dec, p.i 
FROM PhotoObj p 
WHERE p.objid IN (587726014001185022, 587726014001315956, 587726014001315970,587726031180792552)
			

objid ra dec i
587726014001185022 193.751615 1.548215 16.976
587726014001315956 193.97825 1.467137 17.599
587726014001315970 194.022063 1.535068 16.159
587726031180792552 193.750587 1.257379 22.06

MySpecObj: 

SELECT s.specobjid, s.ra, s.dec, s.z, s.bestobjid 
FROM SpecObj s
WHERE s.specobjid IN (147434863106981888, 147434863341862912, 147434863358640128, 147434863694184448)

specobjid ra dec z bestobjid
147434863106981888 194.992 2.049279 2.9E-3 0
147434863341862912 193.97826 1.467157 0.267 587726014001315956
147434863358640128 194.02205 1.535082 0.071 587726014001315970
147434863694184448 193.75164 1.548211 0.124 587726014001185022

Q9: Given the tables above, what are the redshift and i magnitude of objects contained in MySpecObj? In the following example, we only return those objects which are matched between the multiple tables. We need to perform an INNER JOIN. The old style but still very common way of writting this query would be:

SELECT p.objid, p.i, s.specobjid, s.z 
FROM MyPhotoObj p, MySpecObj s
WHERE p.objid = s.bestobjid

The ANSI, SQL-92 syntax: 

SELECT p.objid, p.i, s.specobjid, s.z
FROM MyPhotoObj p INNER JOIN MySpecObj s 
     ON p.objid = s.bestobjid		

objid i specobjid z
587726014001185022 16.976 147434863694184448 0.124
587726014001315956 17.599 147434863341862912 0.267
587726014001315970 16.159 147434863358640128 0.071

Q9b: To run in SkyServer 

SELECT p.objid, p.i, s.specobjid, s.z
FROM PhotoObj p INNER JOIN SpecObj s 
     ON p.objid = s.bestobjid
WHERE p.objid IN (587726014001185022, 587726014001315956, 587726014001315970, 587726031180792552)		

<** the keyword INNER is optional. By default a JOIN is a INNER JOIN

If we want to include all rows of one of the tables, regardless of whether or not they are matched to another table, we must perform an OUTER JOIN. There three types of OUTER JOIN: LEFT, RIGHT, and FULL

In the example below, we perform a LEFT OUTER JOIN, which means that we will get all entries (regardless of matching) from the table on the left side of the join.   

Q10: What are the redshift and i magnitude of objects contained in MyPhotoObj?  I want to see all photometry objects whether they have spectra or not.

SELECT p.objid, p.i, s.specobjid, s.z
FROM MyPhotoObj p LEFT OUTER JOIN MySpecObj s 
     ON p.objid = s.bestobjid

objid i specobjid z
587726014001185022 16.976 147434863694184448 0.124
587726014001315956 17.599 147434863341862912 0.267
587726014001315970 16.159 147434863358640128 0.071
587726031180792552 22.06 null null

Q10b: To run in SkyServer   

SELECT p.objid, p.i, s.specobjid, s.z 
FROM PhotoObj p LEFT OUTER JOIN SpecObj s 
     ON p.objid = s.bestobjid WHERE p.objid IN (587726014001185022, 587726014001315956, 587726014001315970, 587726031180792552)

In the example below, we perform a RIGHT OUTER JOIN, which means that we will get all entries (regardless of matching) from the table on the RIGHT side of the join.   

Q11: What are the redshift and i magnitude of objects contained in MySpecObj?  I want to see all spec objects whether they have photometry or not.

SELECT p.objid, p.i, s.specobjid, s.z
FROM MyPhotoObj p RIGHT OUTER JOIN MySpecObj s 
     ON p.objid = s.bestobjid

objid i specobjid z
null null 147434863106981888 2.9E-3
587726014001315956 17.599 147434863341862912 0.267
587726014001315970 16.159 147434863358640128 0.071
587726014001185022 16.976 147434863694184448 0.124

Q11b: To run in SkyServer 

SELECT p.objid, p.i, s.specobjid, s.z
FROM PhotoObj p RIGHT OUTER JOIN SpecObj s 
     ON p.objid = s.bestobjid
WHERE s.specobjid IN (147434863106981888, 147434863341862912, 147434863358640128, 147434863694184448)	

Finally, we can also perform a FULL OUTER JOIN, which will show all entries regardeless of matching

Q12: What are the redshift and i magnitude of objects contained in MySpecObj and myPhotoObj?  I want to see all spec and phot objects.

SELECT p.objid, p.i, s.specobjid, s.z
FROM MyPhotoObj p FULL OUTER JOIN MySpecObj s 
     ON p.objid = s.bestobjid

objid i specobjid z
null null 147434863106981888 2.9E-3
587726014001185022 16.976 147434863694184448 0.124
587726014001315956 17.599 147434863341862912 0.267
587726014001315970 16.159 147434863358640128 0.071
587726031180792552 22.06 null null

Q12b: To emulate in SkyServer   

SELECT p.objid, p.i, s.specobjid, s.z
FROM PhotoObj p LEFT OUTER JOIN SpecObj s 
     ON p.objid = s.bestobjid
WHERE p.objid IN (587726014001185022, 587726014001315956, 587726014001315970, 587726031180792552)
UNION 
SELECT p.objid, p.i, s.specobjid, s.z
FROM PhotoObj p RIGHT OUTER JOIN SpecObj s 
     ON p.objid = s.bestobjid
WHERE s.specobjid IN (147434863106981888, 147434863341862912, 147434863358640128, 147434863694184448)

 

5. Creating, modifying, and querying a database

In the following examples we will see how we can create, modify, and query a database using MySQL. All examples have also been implemented using SQL Server 2000.

We will be using SPOCS, which you will find in the SummerSchool package under the SQL directory. SPOCS has a very basic schema. It is composed of three tables SPOCS, SPOCS_EXT and SANTOS. SPOCS and SPOCS_EXT are related through the ID column. SPOCS* and SANTOS are related through the name and HD columns respectively.  The data and content descriptions are included in the SPOCS\data directory. 

In the SPOCS_Schema.sql file you will find first how to CREATE the database

DROP DATABASE IF EXISTS SPOCS; 
CREATE DATABASE SPOCS;

and three CREATE TABLE statements like the following 

CREATE TABLE SPOCS ( 
  ID 		smallint not null,
  Name		nvarchar(20) not null, 
  Teff		smallint	DEFAULT null,
  logg		double		DEFAULT null,
  Metal		double		DEFAULT null,
  Na		double		DEFAULT null,
  Si		double		DEFAULT null,
  Ti		double		DEFAULT null,
  Fe		double		DEFAULT null,
  Ni		double		DEFAULT null,
  Vsini		double		DEFAULT null,
  Vrad		double		DEFAULT null,
  CRMS		double		DEFAULT null,
  LRMS		double		DEFAULT null,
  NKeck		tinyint		DEFAULT null,
  NAAT  	tinyint		DEFAULT null,
  NLick 	tinyint		DEFAULT null,
  CONSTRAINT PRIMARY KEY (id)
);

The nullability of a column determines if the rows in the table can contain a null value for that column. A null value, or NULL, is not the same as zero (0), blank, or a zero-length character string such as ""; NULL means that no entry has been made. The presence NULL usually implies that the value is either unknown or undefined.

After you have created your database you will need to load the data. SPOCS_Load.sql includes statements like 

LOAD DATA INFILE ''c:\\NVOSS2005\\sql\\SPOCS\\data\\SPOCSWH.csv'' 
INTO TABLE SPOCS
	 FIELDS TERMINATED BY '',''
IGNORE 1 LINES; --  Ignores the header

To create and load the SPOCS database go to the SQL\SPOCS\mySQL: and start up mysql

C:\NVOSS2005\sql\SPOCS\mySQL> mysql
	Welcome to the MySQL monitor.  Commands end with ; or \g.
	Your MySQL connection id is 6 to server version: 4.1.12a-nt

	Type ''help;'' or ''\h'' for help. Type ''\c'' to clear the buffer.

mysql > \. SPOCS_Schema.sql

mysql > \. SPOCS_Load.sql  -- if you are working with Windows

mysql > \. SPOCS_Load4LM.sql -- if you are working with Linux or Mac

There is additional information about how to run SQL commands with MySQL in SQL\SPOCS\mySQL\HowTo.txt

From now on, we will be running one by one each query SPOCS_Query*.sql from inside mySQL. 

SPOCS_Query1 

mysql > \. SPOCS_Query1.sql
-- Q1: Find all solar analogs
-- Demonstrates selection of objects based on filtering criteria
-- WARNING: MySQL doesn''t work properly when comparing floats and constants.
--			This query wouldn''t work properly if Metal is declared as a float.
--			If you are going to do this type of comparisons use doubles.		
SELECT s.ID, s.Name, s.Teff, s.Logg, s.Metal
FROM SPOCS s
WHERE	s.Teff  BETWEEN 5720 AND 5820
	AND	s.Logg  BETWEEN 4.34 AND 4.54
	AND	s.Metal BETWEEN -0.1 AND +0.1
ORDER BY s.Teff;

Q1: "Find solar analogs"

The Sun has a temperature of Teff=5770 K and a gravity of logg=4.4 and a metallicity of [M/H]=0.0. Stars like the Sun (analogs) will have values of these three parameters similar to the solar values. In this example, we look for stars with temperatures within 50 K of the Sun and gravities within 0.1 dex of the Sun and metallicities within 0.1 dex. (Here "0.1 dex" means a change of 0.1 in the logarithm of some quantity. This term is commonly used in astronomy.)

 

SPOCS_Query2 series show various ways of putting your results into a file

mysql > \. SPOCS_Query2.sql
-- Q2: Create an HR diagram with data from SPOCS and SPOCS_EXT
-- Demonstrates a INNER JOIN between two tables where the 
-- same ID corresponds to the same object.

-- Shows how to copy the query results into a file. 
-- This syntax only works if you are on the server. 

-- if your %NVOSS_HOME% is other than c:\\NVOSS2005 you will have to change
-- c:\\NVOSS2005 for the value of your own installation
-- in unix change ''\\'' to ''/'' 

-- For clients see SPOCS_Query2b.sql
SELECT s.Teff, t.LogL
  INTO outfile ''c:\\NVOSS2005\\sql\\SPOCS\\data\\HR.csv'' 
  FIELDS terminated by '',''
FROM SPOCS s 
  JOIN SPOCS_EXT t
  ON s.ID = t.ID;
--  nr = 1040

Q2: "Create an HR Diagram"

HR Diagram is short for Hertzsprung-Russell Diagram, named after two famous and now deceased astronomers.

The x-axis is either the color or in this case temperature of the stars. The y-axis is the either absolute magnitude or in this case logarithm of the luminosity of the stars, relative to the Sun. By wacky astronomical convention, temperatures on the x-axis are reversed so that hotter stars are on the left and cooler stars are on the right.

The y-axis is log(L/Lsun). When a star has the same instrinsic brightness (luminosity) as the Sun (L=Lsun), then log(L/Lsun)=0. When a star has 1/10 the luminosity of the Sun, log(L/Lsun)=0.1. When a star has 10 times the luminosity of the Sun, log(L/Lsun)=10.

(The luminosity of a star is a measure of how much total energy the star is emitting every second. Two stars with the same luminosity can have very different apparent brightnesses, if the two stars are at very different distances (near -> bright, far -> faint).

The SPOCS catalog has calculated luminosity from the observed V band magnitude, the Hipparcos distance, and a "bolometric" correction that accounts for extra light from the star outside the measured V band.

 

SPOCS_Query3 

mysql > \. SPOCS_Query3.sql
-- Q3: Calculates differences between entries in two catalogs
-- Looks for systematic trends vs temperatures
-- Demonstrantes a INNER JOIN  between two tables using the HD identifier
-- and how to use some string functions
SELECT 
	s.ID, s.Name, 
	s.Teff, (s.Teff - e.Teff) as dTeff,  
	s.Logg, format(s.Logg - e.Logg,2) as dLogg,
	s.Fe, format(s.Fe - e.Fe,2) as dFe
FROM SPOCS s
  JOIN SANTOS e ON substr(s.Name, 3) = trim(e.HD)
ORDER BY s.Teff;
-- nr = 86

Q3: "Look for systematic trends in gravity vs. temperature and also metallicity vs. temperature."

Making a star a little bit cooler, reducing the gravity a little bit, or increasing the metallicity a little bit all tend to make spectral lines a little bit deeper. This unfortunate reality makes it hard to get the temperature, gravity, and metallicity exactly correct.

When astronomers measure gravity, any error the temperature or metallicity will cause a compensating error in gravity. Similarly, when measuring metallicity, any error in temperature and gravity will cause a compensating error in metallicity. All three parameters are interlinked.

Detecting and characterizing systematic errors can be difficult. Here we compare stellar parameters from two different spectroscopists, using different analysis techniques. Ordering the differences in gravity or metallicity as a function of increasing temperature shows trends. The size of these trends are an indication of the minimum systematic error in one or both of the results from the two analyses.

The next step would be to plot the trends, using Excel or any other plotting package.

-----

But notice that some stars in the Santos catalog have measurements for multiple observations. Frequently, we want to combine multiple measurements of the same quantity into a single value. This is a very likely scenario in the world of VO, which draws upon many different catalogs. In this simple case, the multiple measurements are in the same catalog.

 

SPOCS_Query4 

mysql > \. SPOCS_Query4.sql
-- Q4: The same star appears more than once in SANTOS catalog
-- Shows how to use count, group by, and  descending order
SELECT s.ID, s.Name, count(e.HD) c
FROM SPOCS s
  JOIN SANTOS e ON substr(s.Name, 3) = trim(e.HD)
GROUP BY s.ID, s.Name
ORDER BY c desc;
-- nr = 70

Q4: Count how many times each star appears in the Santos catalog

The SPOCS catalog presents averages for stars with more than one observation. The Santos catalog present results for each individual observation.

This is an SQL exercise that simply counts how many times each star appears in the Santos catalog. The same exercise for the SPOCS catalog would yield "1" for every star, since each star appears only once in the SPOCS catalog.

 

SPOCS_Query5 

mysql > \. SPOCS_Query5.sql  

-- Q5: Combines multiple entries of the same star from SANTOS
-- and prints the avg value of two measurements. 
-- Demonstrates how to use group by to get averages per group of rows
-- Shows how to format a decimal point numbers into strings for nice looking.
-- Discards the rows with Inst = avg
SELECT 
	s.ID, s.Name, count(s.ID) n,
	format(avg(e.Teff), 1) as mean_Teff, format(avg(s.Teff - e.Teff), 1) as mean_dTeff, 
	format(avg(s.Logg - e.Logg), 2) as mean_dLogg,
	format(avg(s.Fe - e.Fe), 2) as mean_dFe
FROM SPOCS s
  JOIN SANTOS e ON substr(s.Name, 3) = trim(e.HD)
WHERE e.Inst != ''avg'' 
GROUP BY s.ID
ORDER BY mean_Teff
-- nr = 70

Q5: Average multiple measurements for the same star in the Santos catalog.

Building on the previous example, here we average together multiple observations of the same star in the Santos catalog. For stars with multiple observations, averaging improves the precision of the result. Combining multiple values for each star also provides a single number to use when comparing results from the two catalogs. (The comparison in Q3 compared each measurement from the Santos catalog separately.)

-----

But notice that some of the rows in the Santos catalog are already averages for the star! These rows are labelled "avg".

 

SPOCS_Query6

mysql > \. SPOCS_Query6.sql   

-- Q6: To avoid problems with the repeated instances with the avg we decided
-- to get rid of rows with Inst = ''avg''
-- We could create a view or remove the rows. 
-- WARNING:  CREATE VIEW doesn''t work 
-- for MySQL 4.1.12 but works for MySQL 5.0 

-- Demonstrates how to use DELETE and CREATE a VIEW 

-- CREATE VIEW SANTOSM AS SELECT * FROM SANTOS WHERE (Inst != ''avg'') 
-- if you chose to create a view instead of deleting objects remember you need to
-- use the name of the VIEW (SANTOS_M) instead of the original table name (SANTOS)

DELETE FROM SANTOS
  WHERE Inst = ''avg'';
-- nr = 7 rows deleted		

Q6: Ignore or delete rows in the Santos catalog that contain the average (avg) of multiple measurements for that star.

Building on the previous example, here we delete rows in the Santos catalog that are averages of all measurements for a star. This is an SQL example that demonstrates how to ignore or delete rows. Here we delete the "avg" rows to leave only actual observations, which we can then combine (as in Q5). This is the VO-like situation.

Alternatively, we could also delete the individual observations and keep the averages, but this is only useful if somebody has already done the work of calculating averages for us.

 

SPOCS_Query7

mysql > \. SPOCS_Query7.sql     

-- Q7:  Adds the error in Teff
-- Demonstrates ALTER and UPDATE

ALTER TABLE SPOCS ADD e_Teff smallint after Teff;
UPDATE SPOCS s
 SET e_Teff = round(44.0/sqrt(s.NKeck + s.NAAT + s.NLick));
-- nr = 1040

Q7: Add the error in Teff

The spectra used to create the SPOCS catalog were all obtained with very similar spectrographs. Exposure times were adjusted so that each spectrum had the same high signal to noise ratio. Thus, the authors claim that the error in temperature, gravity, metallicity, etc. are the same for any individual spectrum. These error values for individual spectra are given in a small table in the SPOCS paper. For example, the error in Teff is 44 K for a measurement based on one spectrum.

When the authors average results for N spectra of the same star, they claim the error in the result should get smaller, going from 44 K down to 44/sqrt(N) K.

The SPOCS spectra come from three observatories: Keck Observatory in Hawaii, the Anglo-Australian Telescope ("AAT") in Australia (duh!), and Lick Observatory in California. The total number of spectra is just the sum of the number of spectra from each of these three observatories.

This 1/sqrt(N) adjustment to the error is probably optimistic. Often real-world errors don't get smaller quite so fast. Sometimes they don't get smaller at all! Also, cooler stars probably have larger errors than warmer stars because their spectra are more complicated (lots more spectral lines). Thus, the assumption that the error in Teff is 44 K for all stars is an oversimplification. However, there really is no good way to assess errors for individual stars. Formal error analysis is way too optimistic and there aren't enough stars with multiple observations to get better empirical results. So this way of handling errors is about the best one can do for these data. (That's the way it is)

 

SPOCS_Query8

mysql > \. SPOCS_Query8.sql     

-- Q8:  Calculates how significant the differences in temperature are
-- for individual stars
-- Demonstrates the use of a computed column as sorting criteria
SELECT 
	s.ID, s.Name,  
	s.Teff, s.e_Teff,
	e.Teff, e.e_Teff,
	(s.Teff - e.Teff) as dTeff,	
	round(sqrt((s.e_Teff * s.e_Teff) + (e.e_Teff* e.e_Teff))) as e_dTeff,
	-- format((s.Teff - e.Teff) /sqrt((s.e_Teff * s.e_Teff) + (e.e_Teff* e.e_Teff)),1) as sigma 
    (s.Teff - e.Teff) /sqrt((s.e_Teff * s.e_Teff) + (e.e_Teff* e.e_Teff)) as sigma
FROM SPOCS s
  JOIN SANTOS e ON substr(s.Name, 3) = trim(e.HD)
ORDER BY sigma; -- if you use format for sigma then the order is alphabetical and looks funny

Q8: Calculate how significant differences in temperature are between the SPOCS catalog and the Santos catalog.

OK, now we are going to do some error analysis. We'll compare the temperature from Santos for each star with the temperature from SPOCS. Of course the two values differ, but the question is whether the difference is reasonable, given the errors (actually given the "uncertainties") provided by the authors.

Here we calculate the difference between the two measures of temperature:

Diff = SPOCS - Santos

This difference should be zero if the two catalogs agree.

What is the expected uncertainty (e_Diff) in this difference? The typical error in "Diff" is given by the following formula, which combines the uncertainty in the SPOCS value (e_SPOCS) and the error in the Santos value (e_Santos):

e_Diff^2 = e_SPOCS^2 + e_Santos^2

Some jargon: the uncertainty squared is called the "variance". So in words the error propagation formula above says when you subtract two quantities, the variance of the difference is just the sum of the variances of the two things you subtracted. For fun, I note that you get the exact same error formula when you add two qunatities:

Sum = A + B

e_Sum^2 = e_A^2 + e_B^2

Fun, eh?

Anyway, back to our goal of deciding whether the differences between SPOCS and Santos are significant. The calculations above give us Diff and e_Diff. Because of the way errors statistics work, if all differences are just due to Gaussian noise (which people often assume), then about 2/3 of the Diff values should be between -e_Diff and +e_Diff. In words, one would say, "If difference are just due to Gaussian noise, then two-thirds of our calculated Diffs should be between minus one-sigma and plus one-sigma." In this sentence, "one-sigma" is just a more general name for e_Diff.

If we divide Diff by e_Diff, then 2/3 of our calculate Diff/e_Diff values should be between -1 and +1. This is easy to test in a table, especially if sorted by Diff.

 

SPOCS_Query9

mysql > \. SPOCS_Query9.sql       

-- Q9: Compares measurements from 2 catalogs using columns from 
-- three tables 
-- Demonstrates the use of a 3-way JOIN
SELECT 
	s.ID, s.Name, e.HD as santosHD, t.Mass, (t.Mass - e.Mass) as dMass,
	s.Teff, (s.Teff - e.Teff) as dTeff, 
	s.Logg, format(s.Logg - e.Logg,2) as dLogg,
	s.Fe, format(s.Fe - e.Fe,2) as dFe
FROM SPOCS s
  JOIN SANTOS e ON substr(s.Name, 3) = trim(e.HD)
  JOIN SPOCS_EXT t ON s.ID = t.ID
ORDER BY s.Teff;

SPOCS_Query10

mysql > \. SPOCS_Query10.sql       

-- Q10: Is there a systematic difference between the 2 catalogs?
SELECT 
	format(avg (s.Teff - e.Teff), 1) as mean_dTeff, 
	format(avg(s.Logg - e.Logg), 3) as mean_dLogg,
	format(avg(s.Fe - e.Fe), 3) as mean_dFe
FROM SPOCS s
  JOIN SANTOS e ON substr(s.Name, 3) = trim(e.HD);
-- nr = 1

Q10: Is there a systematic difference between SPOCS and Santos.

OK, in Q8 we saw that individual measurements from SPOCS and Santos seemed to be consistent to within the predicted errors. But, there could still be small systematic offsets between the two catalogs that are too small to see by looking at measurments for only one stars. So, here we look at the average offset for all stars in common.

 

6. Links of interest

Astronomical databases on-line using SQL interfaces

General SQL Tutorials

SDSS SQL Tutorial

Others