Learning SQL the Badass Way (3)

Do you really want to “learn SQL the badass way”? I outlined the scope of the blog in the first blog entry where I explained why I set the scope to R users and people with some programming languages. You may not find what you are looking for if you are not familiar with R or if you do not have any other experience to work with data, because I focus on the data management part in this blog. Thus, I hope you find some helpful ressources to learn SQL, but I focus on data wrangling aspects, without explaining basic concepts to handle data.

In this session we learn how to use string patterns and ranges to search data. We will learn how to sort and group data to display result. Moreover, we practice composing nested queries and execute select statements to access data from multiple tables. For this reason I created a simple table that contains some attributes (y, z, id) about countries:

SELECT * FROM df ;
Table 1: 4 records
country y z id
Germany 2 3.6 1
Austria 1 8.5 2
Brazil 4 2.5 3
Brazil 3 3.5 3

String values, ranges and set of values

We can use strings and the WHERE clause to search for specific observations. For instance, WHERE countryname LIKE 'A%' means that we search for country name column that start with the corresponding string. And we can use % as a wildcard character:

SELECT * FROM df WHERE country LIKE 'A%';
Table 2: 1 records
country y z id
Austria 1 8.5 2

Use a range to select entries that depending on some criteria (> and <). In SQL, we specify WHERE values are between 200 and 300. Keep in mind that values are inclusive within the range. For instance, we can use the mtcars dataset and restrict the table with cars that have a horsepower (hp) between 100 and 200, we can even use an AND to restrict to cars with a manual transmission (AM = 1)

select * from mtcars
where (hp BETWEEN 100 and 200) and AM = 1 ;
Table 3: 5 records
mpg cyl disp hp drat wt qsec vs am gear carb
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Another option gives us the IN operator. We can select columns by providing a list and the IN operator. As the next example shows, we select only those observations that match the provided list of the the IN operator:

SELECT * FROM df WHERE country IN ('Brazil');
Table 4: 2 records
country y z id
Brazil 4 2.5 3
Brazil 3 3.5 3

Sorting Result Sets

Sometimes we need to sort the entries alphabetically and we can do that with the ORDER BY clause. For instance, ORDER BY country:

SELECT * FROM df ORDER BY country;
Table 5: 4 records
country y z id
Austria 1 8.5 2
Brazil 4 2.5 3
Brazil 3 3.5 3
Germany 2 3.6 1
  • By default, the entries are ordered in an ascending order, but we sort in a descending with DESC option as well:
SELECT * FROM df ORDER BY country DESC;
Table 6: 4 records
country y z id
Germany 2 3.6 1
Brazil 4 2.5 3
Brazil 3 3.5 3
Austria 1 8.5 2

Sometimes we have several observations per unit or any kind of structural order, which is why we may want to order a specific variable. We can sort the data by providing the number of the column we want to use a sort. As the next example shows, we can use y (or 2) to sort the data:

SELECT * FROM df ORDER BY y;
Table 7: 4 records
country y z id
Austria 1 8.5 2
Germany 2 3.6 1
Brazil 3 3.5 3
Brazil 4 2.5 3

Grouping result sets

To work with data, we have to get rid of duplicates and often it is much more easier if we restrict result set (data frame). To exclude duplicates we can use the : distinct() comman, which returns only distinct countries in our example:

SELECT distinct(country) FROM df ;
Table 8: 3 records
country
Germany
Austria
Brazil

In a similar fashion, maybe we have to clarify, how many observations do we have per group? Or in our case, how many entries come from the same country and how often appears each level? In such a case we can count the county column and use the group by clause:

SELECT country, count (country) from df group by country;
Table 9: 3 records
country count (country)
Austria 1
Brazil 2
Germany 1

As the last output showed, the count functions literally counts, but SQL does not give it a name, it simply displays what it does. We can change this ugly behaviour by providing a variable name and tell SQL how the column should be listed.

SELECT country, count (country) AS Count_Variable from df group by country;
Table 10: 3 records
country Count_Variable
Austria 1
Brazil 2
Germany 1

Certainly, counting is not the only function. We can estimate the mean average with AVG() displays the average. And now the average, little SQL monkey!

SELECT country, AVG(z) as Mean from df group by country;
Table 11: 3 records
country Mean
Austria 8.5
Brazil 3.0
Germany 3.6

We can set a further condition with a grouped by clause and add the HAVING option. As the next output shows, the group by country HAVING count (country) > 1 returns only countries with more than one observation counted:

SELECT country, count (country) AS Count from df group by country having count (country) > 1;
Table 12: 1 records
country Count
Brazil 2

Let us try to remember that the WHEREclause is for entire result set; while HAVING works only for the GROUPED BY clause.

Congratulations! At this point you are able to:

  • Use the WHERE clause to refine your query results
  • Use the wildcard character (%) as a substitute for unknown characters in a pattern
  • Use BETWEENAND to specify a range of numbers
  • We can sort query results into ascending or descending order, by using the ORDER BY clause
  • And we can group query results by using the GROUP BY clause.

Built-in Database Functions

We saw in the last section that we can aggregate (count, avg) data and use column functions. Most of the basic statistics functions (sum, avg, min, max) are available and we can specify further conditions, for instance, if we want to summarize groups:

select sum(mpg) as sum_mpg from mtcars where hp > 100
Table 13: 1 records
sum_mpg
401.4

Or we may use the scalar function and round to the nearest integer:

select round(drat, 1) as round_drat from mtcars
Table 14: Displaying records 1 - 10
round_drat
3.9
3.9
3.9
3.1
3.2
2.8
3.2
3.7
3.9
3.9

In SQL, there is a class of scalar function which are called string functions. For instance, we can calculate the length of a string:

select length(country) from df
Table 15: 4 records
length(country)
7
7
6
6

Depending the SQL database you use, in db2 you can use the upper (UCASE) and lower case (LCASE) function for strings.

select upper(country) from df
Table 16: 4 records
upper(country)
GERMANY
AUSTRIA
BRAZIL
BRAZIL

In case of Oracle the functions are called lower and upper.

Date and Time Built-in Functions

Talking about SQL databases, there are three different possibilities to work with date and time DB2.

  • Date: YYYYMMDD (Year/Month/Day)

  • Time: HHMMSS (Hours/Min/Sec)

  • Timestamp: YYYYMMDDHHMMSSZZZZZZ (Date/Time/Microseconds)

Depending on what you are up to do, there are functions to extract the day, month, day of month, day of week, day of year, week, hour, minute, and second. You can also extract the current_date and the current_time. Unfortunately, this does not work in Oracle the same way as in DB2, but to give you an example how to extract the day:

select day(date) from df where country = 'Germany'

Sub-Queries and Nested Selects

Sub-queries or sub selects are like regular queries but placed within parentheses and nested inside another query. This allows you to form more powerful queries than would have been otherwise possible. An example:

select avg(mpg) from mtcars
Table 17: 1 records
avg(mpg)
20.09062

Let’s say we want to select only the observations with higher values than the average of mpg:

select * from mtcars where mpg > avg(mpg)

This would produce the following error: misuse of aggregate function avg() Failed to execute SQL chunk. One of the limitations of built in aggregate functions, like avg(), is that they cannot be evaluated in the WHERE clause always. Thus, we have to use sub-queries.

select * from mtcars where mpg >
  (select avg(mpg) from mtcars);
Table 18: Displaying records 1 - 10
mpg cyl disp hp drat wt qsec vs am gear carb
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1

Colum expressions help to set sub queries as a list of columns. Say we select variable Z:

select country, z from df
Table 19: 4 records
country z
Germany 3.6
Austria 8.5
Brazil 2.5
Brazil 3.5

And in the next step we add the average of all countries:

select country, z, avg(z) as avg_Z from df
Table 20: 1 records
country z avg_Z
Germany 3.6 4.525

This is obviously wrong. We cannot calculate on micro and macro level the same time, but we could use a sub-query (also called table expressions) to achieve it:

select country, z, (select avg(z) from df group by country) as avg_Z from df
Table 21: 4 records
country z avg_Z
Germany 3.6 8.5
Austria 8.5 8.5
Brazil 2.5 8.5
Brazil 3.5 8.5

Sub-queries can also be applied in the from clause. They are called derived tables or table expressions, because the outer query uses the results of the sub-query as a data source

select * from (select hp, vs from mtcars);
Table 22: Displaying records 1 - 10
hp vs
110 0
110 0
93 1
110 1
175 0
105 1
245 0
62 1
95 1
123 1

Working with Multiple Tables

There are several ways to access multiple tables in the same query. Namely, using sub-queries, implicit join, and join operators, such as INNER JOIN and OUTER JOIN. For instance:

select * from df2;
Table 23: 2 records
country valid id
Germany 1 1
Austria 0 2

Let’s say we want only observations from df that are listed in df2. In such a situation we can use a sub-queries:

select * from df
  where country in
  (select country from df2)
Table 24: 2 records
country y z id
Germany 2 3.6 1
Austria 1 8.5 2

Of course, you could add also information of the second table and include only countries with a certain value:

select * from df
  where country in
  (select country from df2 where valid = 1)
Table 25: 1 records
country y z id
Germany 2 3.6 1

Implicit joins implies that we can access multiple tables by specifying them in the FROM clause of the query. This leads to a CROSS JOIN (also known as Cartesian Join).

select * from df, df2
Table 26: 8 records
country y z id country valid id
Germany 2 3.6 1 Germany 1 1
Germany 2 3.6 1 Austria 0 2
Austria 1 8.5 2 Germany 1 1
Austria 1 8.5 2 Austria 0 2
Brazil 4 2.5 3 Germany 1 1
Brazil 4 2.5 3 Austria 0 2
Brazil 3 3.5 3 Germany 1 1
Brazil 3 3.5 3 Austria 0 2

In DBL2 we can use the where clause to match data (see code); in Oracle there are other matching operaters

select * from df, df2 where df.id = df.id;

In case of long names, we can use shorter aliases for table names (or use column names with aliases in the SELECT clause):

select * from df A, df2 B where A.id = B.id;

Summary

  • Most databases come with built-in functions that you can use in SQL statements to perform operations on data within the database itself.

  • When you work with large datasets, you may save time by using built-in functions rather than first retrieving the data into your application and then executing functions on the retrieved data.

  • Use sub-queries to form more powerful queries.

  • A sub-select expression helps to evaluate some built-in aggregate functions like the average function.

  • Derived tables or table expressions are sub-queries where the outer query uses the results of the sub-query as a data source.

Edgar Treischl
Edgar Treischl
Senior Research Fellow

My research interests include quantitative methods, evaluation, causality and so much more. matter.

Related