Learning SQL the Badass Way (3)
Do you really want to “learn SQL the badass way”? I outlined the scope of the blog in the first blog entry where I explained why I set the scope to R users and people with some programming languages. You may not find what you are looking for if you are not familiar with R or if you do not have any other experience to work with data, because I focus on the data management part in this blog. Thus, I hope you find some helpful ressources to learn SQL, but I focus on data wrangling aspects, without explaining basic concepts to handle data.
In this session we learn how to use string patterns and ranges to search data. We will learn how to sort and group data to display result. Moreover, we practice composing nested queries and execute select statements to access data from multiple tables. For this reason I created a simple table that contains some attributes (y, z, id) about countries:
SELECT * FROM df ;
String values, ranges and set of values
We can use strings and the WHERE clause to search for specific observations. For instance,
WHERE countryname LIKE 'A%' means that we search for country name column that start with the corresponding string. And we can use
% as a wildcard character:
SELECT * FROM df WHERE country LIKE 'A%';
Use a range to select entries that depending on some criteria (> and <). In SQL, we specify
WHERE values are between 200 and 300. Keep in mind that values are inclusive within the range. For instance, we can use the mtcars dataset and restrict the table with cars that have a horsepower (hp) between 100 and 200, we can even use an
AND to restrict to cars with a manual transmission (AM = 1)
select * from mtcars where (hp BETWEEN 100 and 200) and AM = 1 ;
Another option gives us the
IN operator. We can select columns by providing a list and the
IN operator. As the next example shows, we select only those observations that match the provided list of the the
SELECT * FROM df WHERE country IN ('Brazil');
Sorting Result Sets
Sometimes we need to sort the entries alphabetically and we can do that with the
ORDER BY clause. For instance,
ORDER BY country:
SELECT * FROM df ORDER BY country;
- By default, the entries are ordered in an ascending order, but we sort in a descending with
DESCoption as well:
SELECT * FROM df ORDER BY country DESC;
Sometimes we have several observations per unit or any kind of structural order, which is why we may want to order a specific variable. We can sort the data by providing the number of the column we want to use a sort. As the next example shows, we can use y (or 2) to sort the data:
SELECT * FROM df ORDER BY y;
Grouping result sets
To work with data, we have to get rid of duplicates and often it is much more easier if we restrict result set (data frame). To exclude duplicates we can use the :
distinct() comman, which returns only distinct countries in our example:
SELECT distinct(country) FROM df ;
In a similar fashion, maybe we have to clarify, how many observations do we have per group? Or in our case, how many entries come from the same country and how often appears each level?
In such a case we can count the county column and use the
group by clause:
SELECT country, count (country) from df group by country;
As the last output showed, the count functions literally counts, but SQL does not give it a name, it simply displays what it does. We can change this ugly behaviour by providing a variable name and tell SQL how the column should be listed.
SELECT country, count (country) AS Count_Variable from df group by country;
Certainly, counting is not the only function. We can estimate the mean average with
AVG() displays the average. And now the average, little SQL monkey!
SELECT country, AVG(z) as Mean from df group by country;
We can set a further condition with a grouped by clause and add the
HAVING option. As the next output shows, the
group by country HAVING count (country) > 1 returns only countries with more than one observation counted:
SELECT country, count (country) AS Count from df group by country having count (country) > 1;
Let us try to remember that the
WHEREclause is for entire result set; while
HAVING works only for the
GROUPED BY clause.
Congratulations! At this point you are able to:
- Use the
WHEREclause to refine your query results
- Use the wildcard character (
%) as a substitute for unknown characters in a pattern
ANDto specify a range of numbers
- We can sort query results into ascending or descending order, by using the
- And we can group query results by using the
Built-in Database Functions
We saw in the last section that we can aggregate (count, avg) data and use column functions. Most of the basic statistics functions (sum, avg, min, max) are available and we can specify further conditions, for instance, if we want to summarize groups:
select sum(mpg) as sum_mpg from mtcars where hp > 100
Or we may use the scalar function and round to the nearest integer:
select round(drat, 1) as round_drat from mtcars
In SQL, there is a class of scalar function which are called string functions. For instance, we can calculate the length of a string:
select length(country) from df
Depending the SQL database you use, in db2 you can use the upper (UCASE) and lower case (LCASE) function for strings.
select upper(country) from df
In case of Oracle the functions are called lower and upper.
Date and Time Built-in Functions
Talking about SQL databases, there are three different possibilities to work with date and time DB2.
Date: YYYYMMDD (Year/Month/Day)
Time: HHMMSS (Hours/Min/Sec)
Timestamp: YYYYMMDDHHMMSSZZZZZZ (Date/Time/Microseconds)
Depending on what you are up to do, there are functions to extract the day, month, day of month, day of week, day of year, week, hour, minute, and second. You can also extract the
current_date and the
current_time. Unfortunately, this does not work in Oracle the same way as in DB2, but to give you an example how to extract the day:
select day(date) from df where country = 'Germany'
Sub-Queries and Nested Selects
Sub-queries or sub selects are like regular queries but placed within parentheses and nested inside another query. This allows you to form more powerful queries than would have been otherwise possible. An example:
select avg(mpg) from mtcars
Let’s say we want to select only the observations with higher values than the average of mpg:
select * from mtcars where mpg > avg(mpg)
This would produce the following error:
misuse of aggregate function avg() Failed to execute SQL chunk. One of the limitations of built in aggregate functions, like avg(), is that they cannot be evaluated in the WHERE clause always. Thus, we have to use sub-queries.
select * from mtcars where mpg > (select avg(mpg) from mtcars);
Colum expressions help to set sub queries as a list of columns. Say we select variable Z:
select country, z from df
And in the next step we add the average of all countries:
select country, z, avg(z) as avg_Z from df
This is obviously wrong. We cannot calculate on micro and macro level the same time, but we could use a sub-query (also called table expressions) to achieve it:
select country, z, (select avg(z) from df group by country) as avg_Z from df
Sub-queries can also be applied in the from clause. They are called derived tables or table expressions, because the outer query uses the results of the sub-query as a data source
select * from (select hp, vs from mtcars);
Working with Multiple Tables
There are several ways to access multiple tables in the same query. Namely, using sub-queries, implicit join, and join operators, such as
INNER JOIN and
OUTER JOIN. For instance:
select * from df2;
Let’s say we want only observations from df that are listed in df2. In such a situation we can use a sub-queries:
select * from df where country in (select country from df2)
Of course, you could add also information of the second table and include only countries with a certain value:
select * from df where country in (select country from df2 where valid = 1)
Implicit joins implies that we can access multiple tables by specifying them in the
FROM clause of the query. This leads to a CROSS JOIN (also known as Cartesian Join).
select * from df, df2
In DBL2 we can use the where clause to match data (see code); in Oracle there are other matching operaters
select * from df, df2 where df.id = df.id;
In case of long names, we can use shorter aliases for table names (or use column names with aliases in the SELECT clause):
select * from df A, df2 B where A.id = B.id;
Most databases come with built-in functions that you can use in SQL statements to perform operations on data within the database itself.
When you work with large datasets, you may save time by using built-in functions rather than first retrieving the data into your application and then executing functions on the retrieved data.
Use sub-queries to form more powerful queries.
A sub-select expression helps to evaluate some built-in aggregate functions like the average function.
Derived tables or table expressions are sub-queries where the outer query uses the results of the sub-query as a data source.