Mastering Common SQL Patterns in Data Analysis

by | SQL

Table of Contents

Setting Up Your Environment

System Requirements

  1. Operating System: Ensure you are using a compatible OS (Windows, macOS, or Linux).
  2. Hardware: At least 4GB of RAM, 10GB of free disk space.

Software Installation

1. Install Database Server (MySQL)

  1. Download:
  2. Install: Follow the installer instructions.
  3. Setup:
    • Set root password.
    • Enable MySQL to start on boot.

2. Install SQL Client (MySQL Workbench)

  1. Download:
  2. Install: Follow the installer instructions.

3. Configure Environment

  1. Create User:
    CREATE USER 'data_analyst'@'localhost' IDENTIFIED BY 'password123';

  2. Grant Permissions:
    GRANT ALL PRIVILEGES ON *.* TO 'data_analyst'@'localhost' WITH GRANT OPTION;
    FLUSH PRIVILEGES;

  3. Create Database:
    CREATE DATABASE data_analysis;

Testing the Setup

  1. Login to MySQL:
    mysql -u data_analyst -p

  2. Verify Database:
    SHOW DATABASES;
    USE data_analysis;

Directory Structure

  1. Create Main Project Directory:
    mkdir SQL_Data_Analysis_Project
    cd SQL_Data_Analysis_Project

  2. Subdirectories:
    mkdir -p scripts data outputs

Access and Connectivity

  1. Connection Script: Save as connect_db.sh
    #!/bin/bash
    mysql -u data_analyst -p'password123' -h localhost data_analysis

Make connect_db.sh executable:

chmod +x connect_db.sh
  1. Run Script:
    ./connect_db.sh

Verification

Run a sample query in the SQL client or via terminal:

SELECT DATABASE();  -- Should return 'data_analysis'

Conclusion

Your environment is now set up and ready for applying SQL techniques in data analysis.

Basic SQL Queries Recap

Selecting Data

Select all columns

SELECT * FROM table_name;

Select specific columns

SELECT column1, column2 FROM table_name;

Filtering Data

Simple condition

SELECT * FROM table_name
WHERE condition;

Multiple conditions

SELECT * FROM table_name
WHERE condition1 AND condition2;

Using LIKE for pattern matching

SELECT * FROM table_name
WHERE column LIKE 'pattern%';

Sorting Data

SELECT * FROM table_name
ORDER BY column ASC;  -- or DESC

Aggregating Data

COUNT

SELECT COUNT(*) FROM table_name;

SUM

SELECT SUM(column) FROM table_name;

AVG

SELECT AVG(column) FROM table_name;

MIN and MAX

SELECT MIN(column), MAX(column) FROM table_name;

Grouping Data

SELECT column, COUNT(*)
FROM table_name
GROUP BY column;

Joining Tables

Inner Join

SELECT a.column1, b.column2
FROM table1 a
INNER JOIN table2 b ON a.common_column = b.common_column;

Left Join

SELECT a.column1, b.column2
FROM table1 a
LEFT JOIN table2 b ON a.common_column = b.common_column;

Right Join

SELECT a.column1, b.column2
FROM table1 a
RIGHT JOIN table2 b ON a.common_column = b.common_column;

Subqueries

In SELECT clause

SELECT column1,
       (SELECT column2 FROM table2 WHERE table2.reference_column = table1.reference_column) as alias_column
FROM table1;

In WHERE clause

SELECT column1 FROM table1
WHERE column2 IN (SELECT column2 FROM table2 WHERE condition);

Updating Data

UPDATE table_name
SET column1 = value1, column2 = value2
WHERE condition;

Deleting Data

DELETE FROM table_name
WHERE condition;

Inserting Data

Single row

INSERT INTO table_name (column1, column2)
VALUES (value1, value2);

Multiple rows

INSERT INTO table_name (column1, column2)
VALUES
(value1, value2),
(value3, value4);

Summary

This list of basic SQL queries should cover most common use cases in daily data analysis work. Copy and modify these snippets as required for your specific tables and use cases.

3. Aggregation and Grouping Patterns

3.1 Aggregation Functions

-- Calculate the total sales
SELECT SUM(sales_amount) AS total_sales
FROM sales_data;

-- Average sales amount
SELECT AVG(sales_amount) AS average_sales
FROM sales_data;

-- Maximum and minimum sales amount
SELECT MAX(sales_amount) AS max_sales, MIN(sales_amount) AS min_sales
FROM sales_data;

-- Count the number of transactions
SELECT COUNT(*) AS number_of_transactions
FROM sales_data;

3.2 Grouping Data

-- Total sales by product
SELECT product_id, SUM(sales_amount) AS total_sales
FROM sales_data
GROUP BY product_id;

-- Average sales by region
SELECT region, AVG(sales_amount) AS average_sales
FROM sales_data
GROUP BY region;

-- Maximum sales by salesperson
SELECT salesperson_id, MAX(sales_amount) AS max_sales
FROM sales_data
GROUP BY salesperson_id;

3.3 Using HAVING for Filtering Groups

-- Total sales by product where total sales exceed 1000
SELECT product_id, SUM(sales_amount) AS total_sales
FROM sales_data
GROUP BY product_id
HAVING SUM(sales_amount) > 1000;

-- Count of transactions by region where the count is more than 50
SELECT region, COUNT(*) AS number_of_transactions
FROM sales_data
GROUP BY region
HAVING COUNT(*) > 50;

3.4 Combining Aggregation with JOIN

-- Total sales by product category
SELECT c.category_name, SUM(s.sales_amount) AS total_sales
FROM sales_data s
JOIN products p ON s.product_id = p.product_id
JOIN categories c ON p.category_id = c.category_id
GROUP BY c.category_name;

-- Average sales by region and product category
SELECT r.region_name, c.category_name, AVG(s.sales_amount) AS average_sales
FROM sales_data s
JOIN regions r ON s.region_id = r.region_id
JOIN products p ON s.product_id = p.product_id
JOIN categories c ON p.category_id = c.category_id
GROUP BY r.region_name, c.category_name;

3.5 Window Functions for Advanced Aggregation

-- Running total of sales over time
SELECT sales_date, sales_amount,
       SUM(sales_amount) OVER (ORDER BY sales_date) AS running_total
FROM sales_data;

-- Average sales per day, partitioned by product
SELECT sales_date, product_id, sales_amount,
       AVG(sales_amount) OVER (PARTITION BY product_id ORDER BY sales_date) AS avg_daily_sales
FROM sales_data;

-- Sales rank by product
SELECT product_id, sales_amount,
       RANK() OVER (PARTITION BY product_id ORDER BY sales_amount DESC) AS sales_rank
FROM sales_data;

Focus on applying these patterns directly to your data as needed, adjusting column and table names to fit your schema.

Subqueries and Derived Tables in SQL

1. Subqueries

Example: Find students who scored above the average score in a test.

SELECT student_id, student_name
FROM students
WHERE score > (SELECT AVG(score) FROM students);

Example: Retrieve employees who work in the same department as ‘John Doe’.

SELECT employee_id, employee_name
FROM employees
WHERE department_id = (SELECT department_id FROM employees WHERE employee_name = 'John Doe');

2. Derived Tables

Example: Calculate the total sales by department and list departments with sales greater than $10,000.

SELECT department_name, total_sales
FROM 
  (SELECT department_id, SUM(sales_amount) AS total_sales
   FROM sales
   GROUP BY department_id) AS dept_sales
JOIN departments ON dept_sales.department_id = departments.department_id
WHERE total_sales > 10000;

Example: Find the name of the most expensive product in each category.

SELECT category_name, product_name, max_price
FROM 
  (SELECT category_id, MAX(price) AS max_price
   FROM products
   GROUP BY category_id) AS max_prices
JOIN products ON max_prices.category_id = products.category_id AND max_prices.max_price = products.price
JOIN categories ON products.category_id = categories.category_id;

3. Combining Subqueries and Derived Tables

Example: List students along with their respective class averages.

SELECT students.student_id, students.student_name, class_avgs.class_avg
FROM students
JOIN
  (SELECT class_id, AVG(score) AS class_avg
   FROM students
   GROUP BY class_id) AS class_avgs
ON students.class_id = class_avgs.class_id;

Example: Get the top-selling product in each department.

SELECT departments.department_name, products.product_name, max_sales.max_sale
FROM
  (SELECT department_id, MAX(sales_amount) AS max_sale
   FROM sales
   GROUP BY department_id) AS max_sales
JOIN products ON sales.product_id = products.product_id AND sales.sales_amount = max_sales.max_sale
JOIN departments ON sales.department_id = departments.department_id;

You can directly use these SQL practices for real-life data analysis tasks, incorporating subqueries and derived tables into your SQL queries.

Window Functions and Advanced Analytical Queries

Ranking Functions

-- Rank employees by salary within each department
SELECT 
    employee_id, 
    department_id, 
    salary, 
    RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank
FROM 
    employees;

Window Aggregate Functions

-- Calculate the moving average of sales for the past 5 days
SELECT 
    sale_date, 
    sales_amount,
    AVG(sales_amount) OVER (ORDER BY sale_date ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS moving_avg
FROM 
    sales;

Cumulative Sums

-- Calculate the cumulative sum of sales for each employee
SELECT
    employee_id,
    sale_date,
    sales_amount,
    SUM(sales_amount) OVER (PARTITION BY employee_id ORDER BY sale_date) AS cumulative_sales
FROM
    sales;

Percentile Ranking

-- Calculate the percentile rank of each student's score
SELECT
    student_id,
    score,
    PERCENT_RANK() OVER (ORDER BY score DESC) AS percentile_rank
FROM
    student_scores;

Advanced Analytical Queries

Lead and Lag Functions

-- Compare this month's sales to the previous month's sales
SELECT
    sale_date,
    sales_amount,
    LAG(sales_amount, 1) OVER (ORDER BY sale_date) AS previous_month_sales
FROM
    monthly_sales;

First and Last Functions

-- Find the first and last sale amount for each product
SELECT
    product_id,
    FIRST_VALUE(sales_amount) OVER (PARTITION BY product_id ORDER BY sale_date) AS first_sale,
    LAST_VALUE(sales_amount) OVER (PARTITION BY product_id ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_sale
FROM
    sales;

NTILE Function

-- Divide employees into quartiles based on their salary
SELECT
    employee_id,
    salary,
    NTILE(4) OVER (ORDER BY salary DESC) AS salary_quartile
FROM
    employees;

Example of Combining Window Functions

-- Calculate rolling 7-day sales total and rank salespeople within each department
SELECT
    employee_id,
    department_id,
    sale_date,
    sales_amount,
    SUM(sales_amount) OVER (PARTITION BY department_id ORDER BY sale_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_week_sales,
    RANK() OVER (PARTITION BY department_id ORDER BY sales_amount DESC) AS sales_rank
FROM
    sales;

These queries can be executed in any SQL-compliant database to perform advanced data analysis tasks using window functions.

Part 6: Common Table Expressions (CTEs)

Example 1: Simple CTE

SQL Code

WITH SalesCTE AS (
    SELECT SalesID, CustomerID, Amount
    FROM Sales
    WHERE Amount > 500
)
SELECT * FROM SalesCTE;

Example 2: Recursive CTE

SQL Code

WITH RECURSIVE EmployeeHierarchy AS (
    SELECT EmployeeID, ManagerID, FirstName, LastName
    FROM Employees
    WHERE ManagerID IS NULL
    UNION ALL
    SELECT e.EmployeeID, e.ManagerID, e.FirstName, e.LastName
    FROM Employees e
    INNER JOIN EmployeeHierarchy eh ON e.ManagerID = eh.EmployeeID
)
SELECT * FROM EmployeeHierarchy;

Example 3: CTE with Aggregation

SQL Code

WITH DepartmentSales AS (
    SELECT 
        d.DepartmentID,
        d.DepartmentName,
        SUM(s.Amount) AS TotalSales
    FROM Sales s
    JOIN Departments d ON s.DepartmentID = d.DepartmentID
    GROUP BY d.DepartmentID, d.DepartmentName
)
SELECT * FROM DepartmentSales;

Example 4: CTE for Complex Joins

SQL Code

WITH CustomerOrders AS (
    SELECT 
        c.CustomerID, 
        c.CustomerName, 
        o.OrderID, 
        o.OrderDate
    FROM Customers c
    JOIN Orders o ON c.CustomerID = o.CustomerID
),
OrderDetails AS (
    SELECT 
        co.CustomerID, 
        co.CustomerName, 
        co.OrderID, 
        co.OrderDate, 
        od.ProductID, 
        od.Quantity, 
        od.Price
    FROM CustomerOrders co
    JOIN OrderDetails od ON co.OrderID = od.OrderID
)
SELECT * FROM OrderDetails;

Example 5: Nested CTEs

SQL Code

WITH HighValueOrders AS (
    SELECT 
        OrderID,
        CustomerID,
        TotalAmount
    FROM Orders
    WHERE TotalAmount > 1000
),
RecentHighValueOrders AS (
    SELECT * 
    FROM HighValueOrders
    WHERE OrderDate >= '2023-01-01'
)
SELECT * 
FROM RecentHighValueOrders;

Final Note

These implementations of CTEs can be applied to various practical scenarios in SQL for data analysis. Adjust the table and column names according to your specific database schema.

Practical Implementation of Performance Optimization and Indexing Tips

Indexing

Create Index

-- Create an index on a single column
CREATE INDEX idx_column_name ON table_name(column_name);

-- Create a composite index on multiple columns
CREATE INDEX idx_multiple_columns ON table_name(column1, column2);

Drop Index

-- Drop an index
DROP INDEX idx_column_name;

Query Optimization

Use Indexes

-- Make sure the query utilizes the index
EXPLAIN SELECT column_name FROM table_name WHERE column_name = 'value';

Avoid SELECT *

-- Instead of using SELECT *, specify only the required columns
SELECT column1, column2 FROM table_name WHERE column_name = 'value';

Limit Dataset

-- Fetch only the rows you need using LIMIT
SELECT column1, column2 
FROM table_name 
WHERE column_name = 'value' 
LIMIT 100;

Efficient Joins

Index Join Columns

-- Ensure indexes exist on columns used in JOINs
CREATE INDEX idx_join_column1 ON table_name1(column_name);
CREATE INDEX idx_join_column2 ON table_name2(column_name);

Use INNER JOIN over OUTER JOIN when possible

-- Inner join example
SELECT a.column1, b.column2
FROM table_name1 a
INNER JOIN table_name2 b ON a.join_column = b.join_column
WHERE a.column_name = 'value';

Avoid Unnecessary Sorting

Use ORDER BY only when necessary

-- Order by specific columns only if required
SELECT column1, column2 
FROM table_name 
WHERE column_name = 'value' 
ORDER BY column1;

Optimizing Subqueries and CTEs

Materialize Intermediate Results with CTEs

-- Use CTE to optimize complex subqueries
WITH cte_name AS (
    SELECT column1, column2 
    FROM table_name 
    WHERE column_name = 'value'
)
SELECT column1 
FROM cte_name;

Partitioning

Partition Large Tables

-- Create partition on large table to enhance performance
CREATE TABLE partitioned_table (
    id INT,
    name VARCHAR(50),
    created_date TIMESTAMP
) PARTITION BY RANGE (created_date);

Example of Range Partition

-- Partition by date ranges
CREATE TABLE orders (
    order_id INT,
    order_date DATE
) PARTITION BY RANGE (order_date) (
    PARTITION p0 VALUES LESS THAN ('2023-01-01'),
    PARTITION p1 VALUES LESS THAN ('2024-01-01')
);

Updating Statistics

Regularly Update Statistics

-- Update statistics to help the optimizer
ANALYZE TABLE table_name;

Automatic Statistics Update

-- Enable automatic statistics update
SET GLOBAL innodb_stats_auto_update = ON;

Apply these practical implementations to optimize performance and indexing in your SQL data analysis projects.

Related Posts