As a data scientist or a data analyst, we usually need to work with SQL when we need to pull data from SQL servers or to combine and manipulate data from several tables, therefore, it is good if we can spot and avoid common mistakes when working with it. Or even a beginner or someone who does not get used to SQL much, you might got some error running queries. Here, I summarize some common mistakes with examples, so we could easily aviod it and we are not a noob SQL anymore! Fingercross if you do not need to work with SQL, you have Pandas!!!
Import libraries and connect to databases
Through out this work, we will run SQL commands in Jupyter Notebook (with ipython-sql) and work on IMDB and WORLD data badabases, therefore, we need to install and config libraries as below,
Import libraries :
import numpy as np
import pandas as pd
%matplotlib inline
%load_ext sql
%config SqlMagic.displaylimit = 20
%config SqlMagic.autolimit = 30
Read credential information from a credential.json file under /data
import json
import urllib.parse
with open('data/credentials.json') as f:
login = json.load(f)
username = login['user']
password = urllib.parse.quote(login['password'])
host = login['host']
port = login['port']
Connect to the database (One at a time)
%sql postgresql://{username}:{password}@{host}:{port}/imdb
%sql postgresql://{username}:{password}@{host}:{port}/world
#1 : Miss up clause orders and execution orders
If you try to run a query below to see the result of movies that have a number of votes more than 1000, you will find an error indicating that it could not find the column "NumberOfVotes" that we aliased in SELECT statement.
%%sql
SELECT
title AS Title,
start_year AS Year,
runtime AS Duration,
rating AS Rating
nvotes AS NumberOfVotes
FROM
movies
WHERE
NumberOfVotes > 1000
;
Why is that?
This is because the execution orders and the clause orders are different!
In SQL, there are two orders that we need to be careful. First, the clause order, it is the order of clause statement that we need to write in this particular orders, it is the same as the syntax of SQL. But when we actually execute the query, SQL will execute the clauses with different orders since it wants to optimize the query to run faster and more efficient. Two orders are listed as below.
Order of clauses in a statement:
SELECT
|
FROM
|
JOIN
|
WHERE
|
GROUP BY
|
HAVING
|
ORDER BY
|
LIMIT
Order of execution/processing:
FROM and JOIN
|
WHERE
|
GROUP BY
|
HAVING
|
SELECT
|
DISTINCT
|
ORDER BY
|
LIMIT
Therefore, in the query example above, the SQL will execute FROM first, then WHERE, and it will throw an error since it does not know "NumberOfVotes" column we derived it in SELECT clause. The SELECT clause is executed later than WHERE.
Therefore, do keep in mind with the difference of these two orders because you might get struck a long time with this tiny mistake, or even you got the wrong result and you don't know it!
#2 : Not beware of the logical statement
Many times, we need to filter some conditions to get the result we want, but we sometimes have a weird or wrong query result even though the statement looks good. Take a look at the query below that we want to have a result of movies that was published during 2010 and 2020 with a rating > 8 and number of votes > 1000,
%%sql
SELECT
*
FROM
movies
WHERE
start_year = 2010
OR
start_year = 2020
AND
rating > 8
AND
nvotes > 1000
LIMIT 15
;
Again, why there are some movies with rating < 8 and nvotes < 1000?
yeah, it is because we do not beware while putting several condition in WHERE clause.
Same as other programming languages, there is an order of logical operators as well, in this case OR got executed later than AND. Therefore, it is better if we put some parenthesis for each condition, so it runs in the correct orders as we want. Check the code below!
%%sql
SELECT
*
FROM
movies
WHERE
(start_year = 2010
OR
start_year = 2020)
AND
rating > 8
AND
nvotes > 1000
LIMIT 15
;
BOOM.. we got what we want!
#3 : Combine aggregration to non-aggregration data
Again, sometimes we might want to run a simple average value of a certain column, so we use aggregation functions, and yeah it look great as below,
%%sql
SELECT
AVG(population)
FROM
country
;
And yeah, why don't we just find the average population for a certain continent and show the name of that continent together, here we go ..
%%sql
SELECT
AVG(population), name
FROM
country
WHERE
continent = 'North America'
;
WHATTTTT? Why it got an error?
Ohh yeah, in SQL, we can not have aggregation columns together with normal columns because SQL does not know what aspects it should group and apply aggregations, but we can solve this by using GROUP BY clause.
%%sql
SELECT
AVG(population), name
FROM
country
WHERE
continent = 'North America'
GROUP BY
name
LIMIT 10
;
By this way, SQL knows that it should group countries in North America and apply AVG for each country and show the result, so we have the correct result as we want.
One tiny thing to remember about aggregations is that we can not put it in WHERE clause because SQL processes rows by WHERE before aggregations.
%%sql
SELECT
name
FROM
country
WHERE
population > AVG(population)
;
Some more examples using aggregations with GROUP BY and HAVING BY.
%%sql
SELECT
countrycode, AVG(population), MAX(population)
FROM
city
WHERE
countrycode IN ('CAN', 'USA', 'THA')
GROUP BY
countrycode
ORDER BY
countrycode
;
%%sql
SELECT
countrycode,
AVG(population)::int,
MAX(population)::int,
COUNT(population) AS city_count
FROM
city
GROUP BY
countrycode
HAVING
COUNT(*) > 10
ORDER BY
city_count DESC
;
%%sql
SELECT
continent, region, AVG(population)::INT
FROM
country
GROUP BY
continent, region
ORDER BY
continent, region
;
These three common mistakes seem to be easy and naive, but believe me, it will come up a lot if you are not familiar or do not use SQL much. Try to keep this in mind so you know and aviod these common mistakes, save your time, and have the query result you want!
Cheers.
!!
Comments