SQL for Data Science: Essential Techniques for Preprocessing and Analysis

by | SQL

Table of Contents

Introduction to SQL and Data Science

Overview

SQL (Structured Query Language) is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS). In data science, SQL is indispensable for manipulating and querying data efficiently. The following sections will guide you through foundational SQL concepts and their application in data science.

Setup Instructions

1. Installation

MySQL

For practical implementation, you will need a relational database. Follow the steps below to install MySQL, a popular RDBMS.

  1. Download MySQL:

    • Go to the MySQL website.
    • Download the MySQL Installer suitable for your operating system.
  2. Install MySQL:

    • Run the downloaded installer.
    • Follow the setup wizard instructions. Choose the “Developer Default” setup type for a typical data science environment.
    • Set root password and note it down for future use.
    • Complete the installation.

Alternative: SQLite

SQLite is a self-contained, serverless SQL database engine. It’s easier to set up than MySQL and perfect for beginners.

  1. Download SQLite:

  2. Install SQLite:

    • For Windows, unzip the binary and put it in a directory in your PATH.
    • For macOS, you can use Homebrew: brew install sqlite.

2. Tools

MySQL Workbench

Download and install MySQL Workbench, a graphical tool for MySQL.

SQLite Browser

Install DB Browser for SQLite, which provides a visual interface to interact with SQLite databases.

Basic SQL Operations

1. Creating a Database

MySQL

CREATE DATABASE data_science_db;
USE data_science_db;

SQLite

-- In your SQLite shell or DB Browser, simply open a new database file
.open data_science_db.sqlite

2. Creating Tables and Inserting Data

Create Table

CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    name VARCHAR(50),
    email VARCHAR(50),
    signup_date DATE
);

Insert Data

INSERT INTO customers (customer_id, name, email, signup_date)
VALUES 
(1, 'John Doe', 'john.doe@example.com', '2020-01-01'),
(2, 'Jane Smith', 'jane.smith@example.com', '2020-02-01');

3. Querying Data

Select Query

SELECT * FROM customers;

Conditional Query

SELECT name, email FROM customers WHERE signup_date > '2020-01-31';

4. Updating and Deleting Data

Update Data

UPDATE customers SET email = 'jane.new@example.com' WHERE customer_id = 2;

Delete Data

DELETE FROM customers WHERE customer_id = 1;

Data Science Applications of SQL

Aggregation and Group By

Total Customers by Signup Month

SELECT COUNT(*) AS total_customers, MONTH(signup_date) AS signup_month
FROM customers
GROUP BY signup_month;

Joining Tables

Suppose we have another table, orders:

CREATE TABLE orders (
    order_id INT PRIMARY KEY,
    customer_id INT,
    order_amount DECIMAL(10, 2),
    order_date DATE,
    FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);

Insert some data into orders:

INSERT INTO orders (order_id, customer_id, order_amount, order_date)
VALUES 
(101, 1, 100.00, '2020-01-15'),
(102, 2, 200.00, '2020-02-20');

Query with Join

SELECT customers.name, orders.order_id, orders.order_amount
FROM customers
JOIN orders ON customers.customer_id = orders.customer_id;

Advanced SQL: Subqueries and CTEs

Subquery for Average Order Amount

SELECT name, (
    SELECT AVG(order_amount) 
    FROM orders 
    WHERE orders.customer_id = customers.customer_id
) AS avg_order_amount
FROM customers;

Common Table Expressions (CTEs)

WITH OrderTotals AS (
    SELECT customer_id, SUM(order_amount) AS total_amount
    FROM orders
    GROUP BY customer_id
)
SELECT customers.name, OrderTotals.total_amount
FROM customers
JOIN OrderTotals ON customers.customer_id = OrderTotals.customer_id;

By following through this structured approach, you will gain a foundational understanding of SQL and its applications in data science.

Working with Databases and Tables in SQL

This segment focuses on practical implementations of how to efficiently create, modify, and interact with databases and tables within SQL, specifically tailored for data science applications.

Creating a Database

CREATE DATABASE DataScienceDB;

Selecting a Database

USE DataScienceDB;

Creating Tables

Example: Creating a Table for Storing User Data

CREATE TABLE Users (
    UserID INT PRIMARY KEY AUTO_INCREMENT,
    UserName VARCHAR(50) NOT NULL,
    Email VARCHAR(100) UNIQUE NOT NULL,
    JoinDate DATE NOT NULL
);

Example: Creating a Table for Storing Transaction Data

CREATE TABLE Transactions (
    TransactionID INT PRIMARY KEY AUTO_INCREMENT,
    UserID INT,
    TransactionAmount DECIMAL(10, 2),
    TransactionDate DATE,
    FOREIGN KEY (UserID) REFERENCES Users(UserID)
);

Inserting Data into Tables

Insert Data into Users Table

INSERT INTO Users (UserName, Email, JoinDate)
VALUES ('JohnDoe', 'john.doe@example.com', '2023-01-15');

Insert Data into Transactions Table

INSERT INTO Transactions (UserID, TransactionAmount, TransactionDate)
VALUES (1, 150.75, '2023-03-10');

Querying Data from Tables

Simple SELECT Query

SELECT * FROM Users;

Filtering Data with WHERE Clause

SELECT * FROM Transactions
WHERE TransactionAmount > 100.00;

Joining Tables

SELECT Users.UserName, Transactions.TransactionAmount, Transactions.TransactionDate
FROM Users
JOIN Transactions ON Users.UserID = Transactions.UserID;

Updating Data in Tables

Update a User’s Email

UPDATE Users
SET Email = 'new.email@example.com'
WHERE UserName = 'JohnDoe';

Deleting Data from Tables

Delete a User

DELETE FROM Users
WHERE UserName = 'JohnDoe';

Altering Tables

Adding a New Column

ALTER TABLE Users
ADD COLUMN LastLoginDate DATETIME;

Modifying an Existing Column

ALTER TABLE Users
MODIFY COLUMN Email VARCHAR(150) NOT NULL;

Dropping a Column

ALTER TABLE Users
DROP COLUMN LastLoginDate;

Aggregating Data

Counting the Number of Users

SELECT COUNT(*) FROM Users;

Summing Transaction Amounts

SELECT SUM(TransactionAmount) FROM Transactions;

Grouping Data

Group Transactions by User

SELECT UserID, SUM(TransactionAmount) as TotalSpent
FROM Transactions
GROUP BY UserID;

Indexing for Performance

Creating an Index on UserName

CREATE INDEX idx_user_name ON Users(UserName);

Practical Use Case: Summary of Daily Transactions

Creating a Summary Table

CREATE TABLE DailyTransactionSummary (
    TransactionDate DATE PRIMARY KEY,
    TotalTransactions INT,
    TotalAmount DECIMAL(10, 2)
);

Populating the Summary Table

INSERT INTO DailyTransactionSummary (TransactionDate, TotalTransactions, TotalAmount)
SELECT TransactionDate, COUNT(*), SUM(TransactionAmount)
FROM Transactions
GROUP BY TransactionDate;

This practical implementation covers the essential SQL techniques needed for managing databases and performing key operations on tables, fostering the ability to handle large data sets and derive meaningful insights in data science contexts.

Data Selection and Filtering Techniques in SQL

1. SELECT Statements

Basic SELECT

-- Select all columns from a table
SELECT * 
FROM employees;

Selecting Specific Columns

-- Select specific columns from a table
SELECT first_name, last_name, department
FROM employees;

2. Filtering Data with WHERE

Basic Filtering with WHERE

-- Select employees from the 'Sales' department
SELECT * 
FROM employees
WHERE department = 'Sales';

Using Comparison Operators

-- Select employees with a salary greater than 50000
SELECT * 
FROM employees
WHERE salary > 50000;

Using Multiple Conditions with AND/OR

-- Select employees from 'Sales' department with salary more than 50000
SELECT * 
FROM employees
WHERE department = 'Sales' 
AND salary > 50000;

-- Select employees from 'Sales' or 'Marketing' department
SELECT * 
FROM employees
WHERE department = 'Sales' 
   OR department = 'Marketing';

3. Advanced Filtering

Using IN

-- Select employees from the 'Sales', 'Marketing' or 'HR' departments
SELECT * 
FROM employees
WHERE department IN ('Sales', 'Marketing', 'HR');

Using BETWEEN

-- Select employees with a salary between 40000 and 60000
SELECT * 
FROM employees
WHERE salary BETWEEN 40000 AND 60000;

Using LIKE for Pattern Matching

-- Select employees whose names start with 'J'
SELECT * 
FROM employees
WHERE first_name LIKE 'J%';

-- Select employees whose names contain 'son'
SELECT * 
FROM employees
WHERE last_name LIKE '%son%';

4. NULL Handling

IS NULL and IS NOT NULL

-- Select employees where the email is not provided
SELECT * 
FROM employees
WHERE email IS NULL;

-- Select employees where the phone number is provided
SELECT * 
FROM employees
WHERE phone_number IS NOT NULL;

5. Combining Conditions with ORDER BY

Basic Sorting

-- Select all employees sorted by last name
SELECT * 
FROM employees
ORDER BY last_name;

Sorting with Multiple Columns

-- Select all employees sorted by department and then by salary (descending)
SELECT * 
FROM employees
ORDER BY department, salary DESC;

6. LIMIT for Sampling Data

Limiting Results

-- Select the first 10 employees
SELECT * 
FROM employees
LIMIT 10;

Pagination with OFFSET

-- Select employees for the second page (assuming 10 records per page)
SELECT * 
FROM employees
ORDER BY last_name
LIMIT 10 OFFSET 10;

These SQL techniques are essential for data selection and filtering, enabling efficient data retrieval and manipulation for data science tasks. They form a core part of querying databases, allowing you to extract relevant datasets for analysis.

Advanced Data Aggregation and Grouping in SQL

In this section, we’ll cover advanced SQL techniques for data aggregation and grouping, tailored for data science applications. These techniques are crucial for summarizing large datasets and uncovering insights.

Grouping Data: GROUP BY

The GROUP BY clause is used to arrange identical data into groups. This is often combined with aggregate functions like SUM, COUNT, AVG, MIN, and MAX.

Example: Basic Grouping

Let’s say we have a table sales:

SELECT product_id, SUM(sales_amount) as total_sales
FROM sales
GROUP BY product_id;

This query calculates the total sales for each product.

Filtering Groups: HAVING

The HAVING clause filters groups based on a condition (similar to WHERE, but for groups).

Example: Filtering Groups

SELECT product_id, SUM(sales_amount) as total_sales
FROM sales
GROUP BY product_id
HAVING SUM(sales_amount) > 1000;

This query returns only products with total sales greater than 1000.

Using Window Functions for Advanced Aggregation

Window functions allow computation across a set of table rows that are related to the current row. They don’t collapse rows into groups.

Example: Calculating a Rolling Average

Consider we have a sales table with columns order_date, product_id, and sales_amount. To calculate a rolling average of sales over the last 7 days:

SELECT 
    order_date,
    product_id,
    sales_amount,
    AVG(sales_amount) OVER (
        PARTITION BY product_id
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) as rolling_avg
FROM sales;

This query computes a 7-day rolling average of sales for each product.

Using CTEs for Complex Aggregations

Common Table Expressions (CTEs) allow for intermediate results which simplify complex queries.

Example: Cumulative Sum

WITH product_sales AS (
    SELECT 
        product_id,
        order_date,
        sales_amount,
        SUM(sales_amount) OVER (
            PARTITION BY product_id
            ORDER BY order_date
            ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
        ) as cumulative_sales
    FROM sales
)
SELECT * FROM product_sales;

This query computes a cumulative sum of sales for each product.

Combining Multiple Aggregations

Sometimes, multiple aggregations are needed in a single result set.

Example: Multiple Aggregations

SELECT 
    product_id,
    COUNT(*) as total_orders,
    SUM(sales_amount) as total_sales,
    AVG(sales_amount) as avg_sales,
    MAX(sales_amount) as max_sales,
    MIN(sales_amount) as min_sales
FROM sales
GROUP BY product_id;

This query calculates various aggregations for each product.

Conclusion

Advanced data aggregation and grouping techniques in SQL are powerful tools for data scientists. By leveraging functions like GROUP BY, HAVING, window functions, CTEs, and combining multiple aggregations, one can perform complex and insightful data analysis efficiently. These examples should provide a solid foundation for applying these techniques in real-life scenarios.

Data Cleaning and Preparation in SQL

To effectively clean and prepare your data in SQL, follow these practical steps using SQL queries. Each step is crucial in ensuring the quality and usability of your dataset for downstream data science applications.

1. Handling Missing Values

Identify Missing Values

SELECT *
FROM your_table
WHERE column_name IS NULL;

Fill Missing Values

To replace NULL values with a default value:

UPDATE your_table
SET column_name = 'default_value'
WHERE column_name IS NULL;

Alternatively, to fill NULL values for numerical columns:

UPDATE your_table
SET column_name = 0
WHERE column_name IS NULL;

2. Removing Duplicates

Identify Duplicates

SELECT column_name, COUNT(*)
FROM your_table
GROUP BY column_name
HAVING COUNT(*) > 1;

Remove Duplicates

To remove exact duplicate rows:

DELETE FROM your_table
WHERE rowid NOT IN (
    SELECT MIN(rowid)
    FROM your_table
    GROUP BY column_name1, column_name2, ...);

3. Standardizing Data Formats

Standardize Text Case

To convert a text column to lowercase:

UPDATE your_table
SET text_column = LOWER(text_column);

To convert a text column to uppercase:

UPDATE your_table
SET text_column = UPPER(text_column);

Standardize Date Formats

To convert dates to a standard format (e.g., ‘YYYY-MM-DD’):

UPDATE your_table
SET date_column = STRFTIME('%Y-%m-%d', date_column);

4. Removing Outliers

Identify Outliers

For a numerical column, identify outliers using standard deviation:

SELECT *
FROM your_table
WHERE column_name > (SELECT AVG(column_name) + 3 * STDDEV(column_name)
                     FROM your_table)
   OR column_name < (SELECT AVG(column_name) - 3 * STDDEV(column_name)
                     FROM your_table);

Remove Outliers

DELETE FROM your_table
WHERE column_name > (SELECT AVG(column_name) + 3 * STDDEV(column_name)
                     FROM your_table)
   OR column_name < (SELECT AVG(column_name) - 3 * STDDEV(column_name)
                     FROM your_table);

5. Data Type Conversion

Convert Data Types

To convert a text column to an integer:

ALTER TABLE your_table
ADD COLUMN new_int_column INTEGER;

UPDATE your_table
SET new_int_column = CAST(old_text_column AS INTEGER);

To convert a text column to a date:

ALTER TABLE your_table
ADD COLUMN new_date_column DATE;

UPDATE your_table
SET new_date_column = CAST(old_text_column AS DATE);

Conclusion

These SQL queries cover practical steps for data cleaning and preparation, essential for ensuring that your data is in the best shape for analysis. Implement these steps directly in your SQL environment to achieve clean, standardized, and ready-to-use datasets.

Joining and Combining Datasets

In SQL, joining and combining datasets is fundamental for enhancing the power of data analysis. Here’s an implementation guide for various types of joins and combinations.

Inner Join

An Inner Join combines rows from two tables where there is a match in the columns specified in the JOIN condition.

SELECT A.column1, A.column2, B.column3
FROM TableA A
INNER JOIN TableB B ON A.common_column = B.common_column;

Left Join

A Left Join returns all rows from the left table and matching rows from the right table. If no match is found, NULL values are returned for columns from the right table.

SELECT A.column1, A.column2, B.column3
FROM TableA A
LEFT JOIN TableB B ON A.common_column = B.common_column;

Right Join

A Right Join returns all rows from the right table and matching rows from the left table. If no match is found, NULL values are returned for columns from the left table.

SELECT A.column1, A.column2, B.column3
FROM TableA A
RIGHT JOIN TableB B ON A.common_column = B.common_column;

Full Outer Join

A Full Outer Join returns rows when there is a match in one of the tables. If there is no match, the result is NULL on the matching column positions.

SELECT A.column1, A.column2, B.column3
FROM TableA A
FULL OUTER JOIN TableB B ON A.common_column = B.common_column;

Cross Join

A Cross Join returns the Cartesian product of the two tables, i.e., it combines each row of the first table with each row of the second table.

SELECT A.column1, B.column3
FROM TableA A
CROSS JOIN TableB B;

Union

Union combines the result-set of two or more SELECT statements. Each SELECT statement within the UNION must have the same number of columns in the same order with similar data types.

SELECT column1, column2 FROM TableA
UNION
SELECT column1, column2 FROM TableB;

Union All

Union All is similar to Union but it includes duplicate rows from both tables.

SELECT column1, column2 FROM TableA
UNION ALL
SELECT column1, column2 FROM TableB;

Combining Aggregated Data

Combining aggregated data can also be useful:

SELECT A.department, A.total_sales, B.total_expenses
FROM 
  (SELECT department, SUM(sales) AS total_sales FROM Sales GROUP BY department) A
INNER JOIN 
  (SELECT department, SUM(expenses) AS total_expenses FROM Expenses GROUP BY department) B
ON A.department = B.department;

Implement these join and combination techniques in your SQL queries to leverage powerful data manipulation capabilities essential for advanced data analysis in data science applications.

Subqueries and Nested Queries in SQL

Subqueries and nested queries are pivotal in SQL for performing complex operations and retrieving advanced data insights. This section covers various practical implementations of subqueries and nested queries.

Subqueries in the SELECT Clause

Example: Select Employees with Salaries Higher Than the Average Salary

SELECT employee_id, employee_name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

Subqueries in the FROM Clause

Example: List Department Name and Average Salary for each Department

SELECT departments.department_name, avg_dept_salary.avg_salary
FROM departments
JOIN (
    SELECT department_id, AVG(salary) as avg_salary
    FROM employees
    GROUP BY department_id
) as avg_dept_salary
ON departments.department_id = avg_dept_salary.department_id;

Subqueries in the WHERE Clause

Example: Find Employees Who Belong to Departments Located in a Specific City

SELECT employee_id, employee_name
FROM employees
WHERE department_id IN (
    SELECT department_id
    FROM departments
    WHERE location = 'New York'
);

Correlated Subqueries

Example: Find Employees whose Salary is Greater than the Average Salary in Their Department

SELECT employee_id, employee_name, salary
FROM employees e1
WHERE salary > (
    SELECT AVG(salary)
    FROM employees e2
    WHERE e1.department_id = e2.department_id
);

Subqueries with EXISTS

Example: Determine Departments with No Employees

SELECT department_name
FROM departments d
WHERE NOT EXISTS (
    SELECT 1
    FROM employees e
    WHERE e.department_id = d.department_id
);

Subqueries with UPDATE Statements

Example: Increase Salary by 10% for Employees in Departments Located in ‘Chicago’

UPDATE employees
SET salary = salary * 1.10
WHERE department_id IN (
    SELECT department_id
    FROM departments
    WHERE location = 'Chicago'
);

Subqueries and nested queries are powerful features in SQL that enable sophisticated data extraction and manipulation techniques. These examples illustrate their practical applications tailored to data science needs. Each can be directly implemented to gain invaluable insights and execute complex database operations.

Implementing Statistical Functions in SQL

For data science applications, it’s essential to perform various statistical calculations directly within SQL databases. Below are practical implementations of common statistical functions such as mean, median, mode, variance, and standard deviation using SQL.

Mean (Average)

SELECT AVG(column_name) AS mean_value
FROM table_name;

Median

Obtaining the median can be more complex since SQL does not have a built-in median function. The following example demonstrates how to calculate the median using a subquery and window functions.

SELECT
    AVG(column_value) AS median_value
FROM (
    SELECT
        column_name AS column_value,
        ROW_NUMBER() OVER (ORDER BY column_name) AS row_num,
        COUNT(*) OVER () AS total_rows
    FROM table_name
) AS subquery
WHERE row_num IN ((total_rows + 1) / 2, (total_rows + 2) / 2);

Mode

Mode can also require more complex querying since it involves finding the most frequent value(s) in a column.

SELECT column_name AS mode_value
FROM table_name
GROUP BY column_name
ORDER BY COUNT(column_name) DESC
LIMIT 1;

Variance

Variance can be calculated using built-in aggregate functions for variation.

SELECT VARIANCE(column_name) AS variance_value
FROM table_name;

Standard Deviation

Standard deviation is often required in data science for understanding data dispersion and can be calculated as follows:

SELECT STDDEV(column_name) AS std_dev_value
FROM table_name;

Example Table and Column

Here are the SQL implementation examples using a hypothetical table sales_data with a column named sale_amount.

Mean

SELECT AVG(sale_amount) AS mean_value
FROM sales_data;

Median

SELECT
    AVG(column_value) AS median_value
FROM (
    SELECT
        sale_amount AS column_value,
        ROW_NUMBER() OVER (ORDER BY sale_amount) AS row_num,
        COUNT(*) OVER () AS total_rows
    FROM sales_data
) AS subquery
WHERE row_num IN ((total_rows + 1) / 2, (total_rows + 2) / 2);

Mode

SELECT sale_amount AS mode_value
FROM sales_data
GROUP BY sale_amount
ORDER BY COUNT(sale_amount) DESC
LIMIT 1;

Variance

SELECT VARIANCE(sale_amount) AS variance_value
FROM sales_data;

Standard Deviation

SELECT STDDEV(sale_amount) AS std_dev_value
FROM sales_data;

Utilize these SQL queries to implement statistical functions directly within your data science projects. These calculations are fundamental for understanding and analyzing your dataset effectively.

Part 9: Performance Optimization and Best Practices

Indexing

Creating indexes on columns that are frequently used in WHERE, JOIN, and ORDER BY clauses can greatly improve query performance.

CREATE INDEX idx_customer_id ON orders (customer_id);
CREATE INDEX idx_order_date ON orders (order_date);

Avoiding Select *

Selecting only the necessary columns reduces the amount of data processed and transferred.

-- Inefficient
SELECT * FROM customers WHERE city = 'New York';

-- Efficient
SELECT customer_id, first_name, last_name, email FROM customers WHERE city = 'New York';

Use EXPLAIN for Query Analysis

Using EXPLAIN allows you to understand how SQL queries are executed and identify bottlenecks.

EXPLAIN SELECT customer_id, first_name, last_name FROM customers WHERE city = 'New York';

Proper Use of Joins

Ensure joins are done on indexed columns and use the appropriate join type to match the specific use case.

SELECT 
    a.order_id, 
    a.order_date, 
    b.customer_name 
FROM 
    orders a
JOIN 
    customers b ON a.customer_id = b.customer_id;

Limiting Data with LIMIT

Limit the number of rows returned when you only need a subset of the data.

SELECT customer_id, first_name, last_name FROM customers WHERE city = 'New York' LIMIT 100;

Using WHERE Clauses Efficiently

Use WHERE clauses to filter data as early as possible in the query execution process.

-- Inefficient: No WHERE clause
SELECT * FROM orders;

-- Efficient: Filtered with WHERE clause
SELECT order_id, order_date FROM orders WHERE order_date > '2023-01-01';

Avoiding Complex Expressions in WHERE Clauses

Avoid using complex expressions that prevent the use of indexes.

-- Inefficient: Complex expression
SELECT * FROM orders WHERE YEAR(order_date) = 2023;

-- Efficient: Use a range
SELECT * FROM orders WHERE order_date BETWEEN '2023-01-01' AND '2023-12-31';

Use of INNER JOIN vs. OUTER JOIN

Use INNER JOIN when you need only the matching rows and avoid OUTER JOIN unless necessary, as it can be more resource-intensive.

-- Inner Join
SELECT orders.order_id, customers.customer_name 
FROM orders 
INNER JOIN customers ON orders.customer_id = customers.customer_id;

-- Outer Join (use only if necessary)
SELECT orders.order_id, customers.customer_name 
FROM orders 
LEFT JOIN customers ON orders.customer_id = customers.customer_id;

Optimizing GROUP BY and ORDER BY

Ensure indexed columns are used in GROUP BY and ORDER BY clauses.

-- Inefficient
SELECT customer_id, COUNT(*) 
FROM orders 
GROUP BY customer_id 
ORDER BY COUNT(*) DESC;

-- Efficient
CREATE INDEX idx_customer_id_count ON orders (customer_id);
SELECT customer_id, COUNT(*) 
FROM orders 
GROUP BY customer_id 
ORDER BY COUNT(*) DESC;

Avoiding Unnecessary DISTINCT

Only use DISTINCT when necessary as it requires additional computation.

-- Inefficient: Unnecessary DISTINCT
SELECT DISTINCT customer_id FROM orders WHERE order_date > '2023-01-01';

-- Efficient: If DISTINCT is necessary
SELECT customer_id FROM orders WHERE order_date > '2023-01-01' GROUP BY customer_id;

By following these practices and optimizing your SQL queries, you can significantly enhance the performance and efficiency of your data science applications.

Case Studies and Real-World Applications

Case Study 1: Sales Analytics

Objective:

Analyze a company’s sales performance and identify trends and patterns to improve decision-making.

Real-World Application:

-- Calculate the total sales for each product category.
SELECT 
    category,
    SUM(sales_amount) AS total_sales
FROM 
    sales
JOIN 
    products ON sales.product_id = products.product_id
GROUP BY 
    category;

-- Identify the top 5 products with the highest sales in the last quarter.
SELECT 
    p.product_name,
    SUM(s.sales_amount) AS total_sales
FROM 
    sales s
JOIN 
    products p ON s.product_id = p.product_id
WHERE 
    s.sale_date BETWEEN '2023-07-01' AND '2023-09-30'
GROUP BY 
    p.product_name
ORDER BY 
    total_sales DESC
LIMIT 
    5;

-- Calculate the average monthly sales for each region.
SELECT 
    region,
    AVG(monthly_sales) AS avg_monthly_sales
FROM (
    SELECT 
        region,
        DATE_TRUNC('month', sale_date) AS month,
        SUM(sales_amount) AS monthly_sales
    FROM 
        sales
    JOIN 
        stores ON sales.store_id = stores.store_id
    GROUP BY 
        region, month
) AS subquery
GROUP BY 
    region;

Case Study 2: Customer Segmentation

Objective:

Perform customer segmentation to help target marketing campaigns more effectively.

Real-World Application:

-- Segment customers based on their total spending.
SELECT 
    customer_id,
    CASE 
        WHEN total_spending < 1000 THEN 'Bronze'
        WHEN total_spending BETWEEN 1000 AND 5000 THEN 'Silver'
        ELSE 'Gold'
    END AS customer_segment
FROM (
    SELECT 
        customer_id,
        SUM(sales_amount) AS total_spending
    FROM 
        sales
    GROUP BY 
        customer_id
) AS subquery;

-- Calculate the average purchase frequency per segment.
SELECT 
    customer_segment,
    AVG(purchase_count) AS avg_purchase_frequency
FROM (
    SELECT 
        customer_id,
        COUNT(*) AS purchase_count,
        CASE 
            WHEN SUM(sales_amount) < 1000 THEN 'Bronze'
            WHEN SUM(sales_amount) BETWEEN 1000 AND 5000 THEN 'Silver'
            ELSE 'Gold'
        END AS customer_segment
    FROM 
        sales
    GROUP BY 
        customer_id
) AS subquery
GROUP BY 
    customer_segment;

Case Study 3: Inventory Management

Objective:

Optimize inventory levels by analyzing stock turnover and product demand.

Real-World Application:

-- Calculate stock turnover ratio for each product.
SELECT 
    product_id,
    SUM(sales_quantity) / AVG(stock_quantity) AS stock_turnover_ratio
FROM 
    sales
JOIN 
    inventory ON sales.product_id = inventory.product_id
GROUP BY 
    product_id;

-- Identify fast-moving and slow-moving products.
SELECT 
    product_id,
    stock_turnover_ratio,
    CASE 
        WHEN stock_turnover_ratio > 5 THEN 'Fast-Moving'
        ELSE 'Slow-Moving'
    END AS product_movement_category
FROM (
    SELECT 
        product_id,
        SUM(sales_quantity) / AVG(stock_quantity) AS stock_turnover_ratio
    FROM 
        sales
    JOIN 
        inventory ON sales.product_id = inventory.product_id
    GROUP BY 
        product_id
) AS subquery;

-- Forecast future inventory needs based on historical sales data.
SELECT 
    product_id,
    AVG(monthly_sales) AS forecasted_monthly_sales,
    AVG(monthly_sales) * 3 AS forecasted_quarterly_sales
FROM (
    SELECT 
        product_id,
        DATE_TRUNC('month', sale_date) AS month,
        SUM(sales_quantity) AS monthly_sales
    FROM 
        sales
    GROUP BY 
        product_id, month
) AS subquery
GROUP BY 
    product_id;

These SQL queries can be integrated into your educational journey to provide practical and real-world examples of how SQL can be used in data science applications. By working through these case studies, students will gain hands-on experience with SQL techniques designed to solve actual business problems.

Related Posts