Setting Up Your Environment
System Requirements
- Operating System: Ensure you are using a compatible OS (Windows, macOS, or Linux).
- Hardware: At least 4GB of RAM, 10GB of free disk space.
Software Installation
1. Install Database Server (MySQL)
- Download:
- Install: Follow the installer instructions.
- Setup:
- Set
root
password. - Enable
MySQL
to start on boot.
- Set
2. Install SQL Client (MySQL Workbench)
- Download:
- Install: Follow the installer instructions.
3. Configure Environment
- Create User:
CREATE USER 'data_analyst'@'localhost' IDENTIFIED BY 'password123';
- Grant Permissions:
GRANT ALL PRIVILEGES ON *.* TO 'data_analyst'@'localhost' WITH GRANT OPTION;
FLUSH PRIVILEGES; - Create Database:
CREATE DATABASE data_analysis;
Testing the Setup
- Login to MySQL:
mysql -u data_analyst -p
- Verify Database:
SHOW DATABASES;
USE data_analysis;
Directory Structure
- Create Main Project Directory:
mkdir SQL_Data_Analysis_Project
cd SQL_Data_Analysis_Project - Subdirectories:
mkdir -p scripts data outputs
Access and Connectivity
- Connection Script: Save as
connect_db.sh
#!/bin/bash
mysql -u data_analyst -p'password123' -h localhost data_analysis
Make connect_db.sh
executable:
chmod +x connect_db.sh
- Run Script:
./connect_db.sh
Verification
Run a sample query in the SQL client or via terminal:
SELECT DATABASE(); -- Should return 'data_analysis'
Conclusion
Your environment is now set up and ready for applying SQL techniques in data analysis.
Basic SQL Queries Recap
Selecting Data
Select all columns
SELECT * FROM table_name;
Select specific columns
SELECT column1, column2 FROM table_name;
Filtering Data
Simple condition
SELECT * FROM table_name
WHERE condition;
Multiple conditions
SELECT * FROM table_name
WHERE condition1 AND condition2;
Using LIKE for pattern matching
SELECT * FROM table_name
WHERE column LIKE 'pattern%';
Sorting Data
SELECT * FROM table_name
ORDER BY column ASC; -- or DESC
Aggregating Data
COUNT
SELECT COUNT(*) FROM table_name;
SUM
SELECT SUM(column) FROM table_name;
AVG
SELECT AVG(column) FROM table_name;
MIN and MAX
SELECT MIN(column), MAX(column) FROM table_name;
Grouping Data
SELECT column, COUNT(*)
FROM table_name
GROUP BY column;
Joining Tables
Inner Join
SELECT a.column1, b.column2
FROM table1 a
INNER JOIN table2 b ON a.common_column = b.common_column;
Left Join
SELECT a.column1, b.column2
FROM table1 a
LEFT JOIN table2 b ON a.common_column = b.common_column;
Right Join
SELECT a.column1, b.column2
FROM table1 a
RIGHT JOIN table2 b ON a.common_column = b.common_column;
Subqueries
In SELECT clause
SELECT column1,
(SELECT column2 FROM table2 WHERE table2.reference_column = table1.reference_column) as alias_column
FROM table1;
In WHERE clause
SELECT column1 FROM table1
WHERE column2 IN (SELECT column2 FROM table2 WHERE condition);
Updating Data
UPDATE table_name
SET column1 = value1, column2 = value2
WHERE condition;
Deleting Data
DELETE FROM table_name
WHERE condition;
Inserting Data
Single row
INSERT INTO table_name (column1, column2)
VALUES (value1, value2);
Multiple rows
INSERT INTO table_name (column1, column2)
VALUES
(value1, value2),
(value3, value4);
Summary
This list of basic SQL queries should cover most common use cases in daily data analysis work. Copy and modify these snippets as required for your specific tables and use cases.
3. Aggregation and Grouping Patterns
3.1 Aggregation Functions
-- Calculate the total sales
SELECT SUM(sales_amount) AS total_sales
FROM sales_data;
-- Average sales amount
SELECT AVG(sales_amount) AS average_sales
FROM sales_data;
-- Maximum and minimum sales amount
SELECT MAX(sales_amount) AS max_sales, MIN(sales_amount) AS min_sales
FROM sales_data;
-- Count the number of transactions
SELECT COUNT(*) AS number_of_transactions
FROM sales_data;
3.2 Grouping Data
-- Total sales by product
SELECT product_id, SUM(sales_amount) AS total_sales
FROM sales_data
GROUP BY product_id;
-- Average sales by region
SELECT region, AVG(sales_amount) AS average_sales
FROM sales_data
GROUP BY region;
-- Maximum sales by salesperson
SELECT salesperson_id, MAX(sales_amount) AS max_sales
FROM sales_data
GROUP BY salesperson_id;
3.3 Using HAVING
for Filtering Groups
-- Total sales by product where total sales exceed 1000
SELECT product_id, SUM(sales_amount) AS total_sales
FROM sales_data
GROUP BY product_id
HAVING SUM(sales_amount) > 1000;
-- Count of transactions by region where the count is more than 50
SELECT region, COUNT(*) AS number_of_transactions
FROM sales_data
GROUP BY region
HAVING COUNT(*) > 50;
3.4 Combining Aggregation with JOIN
-- Total sales by product category
SELECT c.category_name, SUM(s.sales_amount) AS total_sales
FROM sales_data s
JOIN products p ON s.product_id = p.product_id
JOIN categories c ON p.category_id = c.category_id
GROUP BY c.category_name;
-- Average sales by region and product category
SELECT r.region_name, c.category_name, AVG(s.sales_amount) AS average_sales
FROM sales_data s
JOIN regions r ON s.region_id = r.region_id
JOIN products p ON s.product_id = p.product_id
JOIN categories c ON p.category_id = c.category_id
GROUP BY r.region_name, c.category_name;
3.5 Window Functions for Advanced Aggregation
-- Running total of sales over time
SELECT sales_date, sales_amount,
SUM(sales_amount) OVER (ORDER BY sales_date) AS running_total
FROM sales_data;
-- Average sales per day, partitioned by product
SELECT sales_date, product_id, sales_amount,
AVG(sales_amount) OVER (PARTITION BY product_id ORDER BY sales_date) AS avg_daily_sales
FROM sales_data;
-- Sales rank by product
SELECT product_id, sales_amount,
RANK() OVER (PARTITION BY product_id ORDER BY sales_amount DESC) AS sales_rank
FROM sales_data;
Focus on applying these patterns directly to your data as needed, adjusting column and table names to fit your schema.
Subqueries and Derived Tables in SQL
1. Subqueries
Example: Find students who scored above the average score in a test.
SELECT student_id, student_name
FROM students
WHERE score > (SELECT AVG(score) FROM students);
Example: Retrieve employees who work in the same department as ‘John Doe’.
SELECT employee_id, employee_name
FROM employees
WHERE department_id = (SELECT department_id FROM employees WHERE employee_name = 'John Doe');
2. Derived Tables
Example: Calculate the total sales by department and list departments with sales greater than $10,000.
SELECT department_name, total_sales
FROM
(SELECT department_id, SUM(sales_amount) AS total_sales
FROM sales
GROUP BY department_id) AS dept_sales
JOIN departments ON dept_sales.department_id = departments.department_id
WHERE total_sales > 10000;
Example: Find the name of the most expensive product in each category.
SELECT category_name, product_name, max_price
FROM
(SELECT category_id, MAX(price) AS max_price
FROM products
GROUP BY category_id) AS max_prices
JOIN products ON max_prices.category_id = products.category_id AND max_prices.max_price = products.price
JOIN categories ON products.category_id = categories.category_id;
3. Combining Subqueries and Derived Tables
Example: List students along with their respective class averages.
SELECT students.student_id, students.student_name, class_avgs.class_avg
FROM students
JOIN
(SELECT class_id, AVG(score) AS class_avg
FROM students
GROUP BY class_id) AS class_avgs
ON students.class_id = class_avgs.class_id;
Example: Get the top-selling product in each department.
SELECT departments.department_name, products.product_name, max_sales.max_sale
FROM
(SELECT department_id, MAX(sales_amount) AS max_sale
FROM sales
GROUP BY department_id) AS max_sales
JOIN products ON sales.product_id = products.product_id AND sales.sales_amount = max_sales.max_sale
JOIN departments ON sales.department_id = departments.department_id;
You can directly use these SQL practices for real-life data analysis tasks, incorporating subqueries and derived tables into your SQL queries.
Window Functions and Advanced Analytical Queries
Ranking Functions
-- Rank employees by salary within each department
SELECT
employee_id,
department_id,
salary,
RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank
FROM
employees;
Window Aggregate Functions
-- Calculate the moving average of sales for the past 5 days
SELECT
sale_date,
sales_amount,
AVG(sales_amount) OVER (ORDER BY sale_date ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS moving_avg
FROM
sales;
Cumulative Sums
-- Calculate the cumulative sum of sales for each employee
SELECT
employee_id,
sale_date,
sales_amount,
SUM(sales_amount) OVER (PARTITION BY employee_id ORDER BY sale_date) AS cumulative_sales
FROM
sales;
Percentile Ranking
-- Calculate the percentile rank of each student's score
SELECT
student_id,
score,
PERCENT_RANK() OVER (ORDER BY score DESC) AS percentile_rank
FROM
student_scores;
Advanced Analytical Queries
Lead and Lag Functions
-- Compare this month's sales to the previous month's sales
SELECT
sale_date,
sales_amount,
LAG(sales_amount, 1) OVER (ORDER BY sale_date) AS previous_month_sales
FROM
monthly_sales;
First and Last Functions
-- Find the first and last sale amount for each product
SELECT
product_id,
FIRST_VALUE(sales_amount) OVER (PARTITION BY product_id ORDER BY sale_date) AS first_sale,
LAST_VALUE(sales_amount) OVER (PARTITION BY product_id ORDER BY sale_date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_sale
FROM
sales;
NTILE Function
-- Divide employees into quartiles based on their salary
SELECT
employee_id,
salary,
NTILE(4) OVER (ORDER BY salary DESC) AS salary_quartile
FROM
employees;
Example of Combining Window Functions
-- Calculate rolling 7-day sales total and rank salespeople within each department
SELECT
employee_id,
department_id,
sale_date,
sales_amount,
SUM(sales_amount) OVER (PARTITION BY department_id ORDER BY sale_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_week_sales,
RANK() OVER (PARTITION BY department_id ORDER BY sales_amount DESC) AS sales_rank
FROM
sales;
These queries can be executed in any SQL-compliant database to perform advanced data analysis tasks using window functions.
Part 6: Common Table Expressions (CTEs)
Example 1: Simple CTE
SQL Code
WITH SalesCTE AS (
SELECT SalesID, CustomerID, Amount
FROM Sales
WHERE Amount > 500
)
SELECT * FROM SalesCTE;
Example 2: Recursive CTE
SQL Code
WITH RECURSIVE EmployeeHierarchy AS (
SELECT EmployeeID, ManagerID, FirstName, LastName
FROM Employees
WHERE ManagerID IS NULL
UNION ALL
SELECT e.EmployeeID, e.ManagerID, e.FirstName, e.LastName
FROM Employees e
INNER JOIN EmployeeHierarchy eh ON e.ManagerID = eh.EmployeeID
)
SELECT * FROM EmployeeHierarchy;
Example 3: CTE with Aggregation
SQL Code
WITH DepartmentSales AS (
SELECT
d.DepartmentID,
d.DepartmentName,
SUM(s.Amount) AS TotalSales
FROM Sales s
JOIN Departments d ON s.DepartmentID = d.DepartmentID
GROUP BY d.DepartmentID, d.DepartmentName
)
SELECT * FROM DepartmentSales;
Example 4: CTE for Complex Joins
SQL Code
WITH CustomerOrders AS (
SELECT
c.CustomerID,
c.CustomerName,
o.OrderID,
o.OrderDate
FROM Customers c
JOIN Orders o ON c.CustomerID = o.CustomerID
),
OrderDetails AS (
SELECT
co.CustomerID,
co.CustomerName,
co.OrderID,
co.OrderDate,
od.ProductID,
od.Quantity,
od.Price
FROM CustomerOrders co
JOIN OrderDetails od ON co.OrderID = od.OrderID
)
SELECT * FROM OrderDetails;
Example 5: Nested CTEs
SQL Code
WITH HighValueOrders AS (
SELECT
OrderID,
CustomerID,
TotalAmount
FROM Orders
WHERE TotalAmount > 1000
),
RecentHighValueOrders AS (
SELECT *
FROM HighValueOrders
WHERE OrderDate >= '2023-01-01'
)
SELECT *
FROM RecentHighValueOrders;
Final Note
These implementations of CTEs can be applied to various practical scenarios in SQL for data analysis. Adjust the table and column names according to your specific database schema.
Practical Implementation of Performance Optimization and Indexing Tips
Indexing
Create Index
-- Create an index on a single column
CREATE INDEX idx_column_name ON table_name(column_name);
-- Create a composite index on multiple columns
CREATE INDEX idx_multiple_columns ON table_name(column1, column2);
Drop Index
-- Drop an index
DROP INDEX idx_column_name;
Query Optimization
Use Indexes
-- Make sure the query utilizes the index
EXPLAIN SELECT column_name FROM table_name WHERE column_name = 'value';
Avoid SELECT *
-- Instead of using SELECT *, specify only the required columns
SELECT column1, column2 FROM table_name WHERE column_name = 'value';
Limit Dataset
-- Fetch only the rows you need using LIMIT
SELECT column1, column2
FROM table_name
WHERE column_name = 'value'
LIMIT 100;
Efficient Joins
Index Join Columns
-- Ensure indexes exist on columns used in JOINs
CREATE INDEX idx_join_column1 ON table_name1(column_name);
CREATE INDEX idx_join_column2 ON table_name2(column_name);
Use INNER JOIN over OUTER JOIN when possible
-- Inner join example
SELECT a.column1, b.column2
FROM table_name1 a
INNER JOIN table_name2 b ON a.join_column = b.join_column
WHERE a.column_name = 'value';
Avoid Unnecessary Sorting
Use ORDER BY only when necessary
-- Order by specific columns only if required
SELECT column1, column2
FROM table_name
WHERE column_name = 'value'
ORDER BY column1;
Optimizing Subqueries and CTEs
Materialize Intermediate Results with CTEs
-- Use CTE to optimize complex subqueries
WITH cte_name AS (
SELECT column1, column2
FROM table_name
WHERE column_name = 'value'
)
SELECT column1
FROM cte_name;
Partitioning
Partition Large Tables
-- Create partition on large table to enhance performance
CREATE TABLE partitioned_table (
id INT,
name VARCHAR(50),
created_date TIMESTAMP
) PARTITION BY RANGE (created_date);
Example of Range Partition
-- Partition by date ranges
CREATE TABLE orders (
order_id INT,
order_date DATE
) PARTITION BY RANGE (order_date) (
PARTITION p0 VALUES LESS THAN ('2023-01-01'),
PARTITION p1 VALUES LESS THAN ('2024-01-01')
);
Updating Statistics
Regularly Update Statistics
-- Update statistics to help the optimizer
ANALYZE TABLE table_name;
Automatic Statistics Update
-- Enable automatic statistics update
SET GLOBAL innodb_stats_auto_update = ON;
Apply these practical implementations to optimize performance and indexing in your SQL data analysis projects.