Practical SQL Exercises: Analyzing Real-World Datasets

by | SQL

Accessing and Understanding Real-World Datasets

Prerequisites

  1. Ensure you have a coding environment setup (Jupyter Notebook, any IDE)
  2. Install necessary libraries (if applicable)
    • For Python: pandas, matplotlib, seaborn
pip install pandas matplotlib seaborn

Step 1: Accessing the Dataset


  1. Download the Dataset: Obtain a real-world dataset from sources like Kaggle, UCI Machine Learning Repository, etc.


    Example URL: https://example.com/dataset.csv



  2. Loading the Dataset:


    import pandas as pd

    # Load dataset from a local file or a remote URL
    dataset_url = 'https://example.com/dataset.csv'
    df = pd.read_csv(dataset_url)

Step 2: Understanding the Dataset


  1. Display the First Few Rows:


    print(df.head())  # Display the first 5 rows


  2. Summary Statistics:


    print(df.describe())  # Summary statistics of numerical columns


  3. Data Types and Missing Values:


    print(df.info())  # Info on data types and missing values


  4. Check for Null Values:


    print(df.isnull().sum())  # Count of null values in each column

Step 3: Data Cleaning


  1. Handle Missing Values:


    # Example: Fill missing values with the mean
    df.fillna(df.mean(), inplace=True)


  2. Remove Duplicates:


    df.drop_duplicates(inplace=True)


  3. Convert Data Types (if necessary):


    df['column_name'] = df['column_name'].astype(expected_type)

Step 4: Data Analysis


  1. Correlation Matrix:


    print(df.corr())


  2. Grouping and Aggregation:


    grouped_df = df.groupby('category_column').agg({'value_column': 'sum'})
    print(grouped_df)

Step 5: Data Visualization


  1. Basic Plotting with Matplotlib:


    import matplotlib.pyplot as plt

    # Histogram
    df['column_name'].hist()
    plt.show()

    # Scatter Plot
    plt.scatter(df['x_column'], df['y_column'])
    plt.show()


  2. Advanced Plotting with Seaborn:


    import seaborn as sns

    # Heatmap of Correlation Matrix
    sns.heatmap(df.corr(), annot=True)
    plt.show()

    # Box Plot
    sns.boxplot(x='category_column', y='value_column', data=df)
    plt.show()


With these steps, you can access, understand, clean, analyze, and visualize real-world datasets. Adapt the code snippets according to your specific dataset and project needs.

Data Cleaning and Preparation

Requirements

  1. Remove missing values.
  2. Normalize numerical data.
  3. Handle duplicate entries.
  4. Encode categorical variables.
  5. Adjust for outliers.

Implementation Steps


  1. Remove Missing Values


    FOR each column IN dataset:
    IF column HAS missing values:
    REMOVE rows WITH missing values


  2. Normalize Numerical Data


    FOR each numerical_column IN dataset:
    mean = MEAN(numerical_column)
    std_dev = STD_DEV(numerical_column)

    dataset[numerical_column] = (dataset[numerical_column] - mean) / std_dev


  3. Handle Duplicate Entries


    dataset = REMOVE_DUPLICATES(dataset)


  4. Encode Categorical Variables


    FOR each column IN dataset:
    IF column IS categorical:
    unique_values = UNIQUE(column)
    encoding_dictionary = CREATE_DICTIONARY(unique_values)

    dataset[column] = APPLY_ENCODING(column, encoding_dictionary)


  5. Adjust for Outliers


    FOR each numerical_column IN dataset:
    q1 = QUANTILE(numerical_column, 0.25)
    q3 = QUANTILE(numerical_column, 0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr

    dataset = REMOVE_OUTSIDE_BOUND(numerical_column, lower_bound, upper_bound)

Final Output

  • dataset is now cleaned and prepared for analysis.

Exploratory Data Analysis (EDA)

Load the Dataset

import pandas as pd

# Assume 'data.csv' is our dataset
df = pd.read_csv("data.csv")

Display Basic Information

# Shape of the dataset
print("Shape of dataset:", df.shape)

# Data type of each column
print("Data types:n", df.dtypes)

# First few rows of the dataset
print("First few rows:n", df.head())

# Basic statistics of numerical columns
print("Descriptive statistics:n", df.describe())

Univariate Analysis

import matplotlib.pyplot as plt
import seaborn as sns

# Histograms for numerical columns
df.hist(figsize=(10, 8))
plt.tight_layout()
plt.show()

# Count plot for categorical columns
cat_columns = df.select_dtypes(include=['object']).columns
for col in cat_columns:
    sns.countplot(x=col, data=df)
    plt.show()

Bivariate Analysis

# Correlation matrix
corr_matrix = df.corr()
print("Correlation matrix:n", corr_matrix)

# Heatmap of the correlation matrix
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()

# Scatterplot for pairs of numerical features
num_columns = df.select_dtypes(include=['int64', 'float64']).columns
sns.pairplot(df[num_columns])
plt.show()

Outlier Detection and Handling

# Boxplots to detect outliers
for col in num_columns:
    sns.boxplot(x=df[col])
    plt.title(f"Boxplot of {col}")
    plt.show()

# Example of handling outliers (capping)
for col in num_columns:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    df[col] = df[col].clip(lower=lower_bound, upper=upper_bound)

Missing Data Analysis

# Check for missing values
print("Missing values:n", df.isnull().sum())

# Visualize missing values
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.show()

# Handling missing values (example: fill with median)
for col in num_columns:
    df[col].fillna(df[col].median(), inplace=True)

Feature Engineering (if applicable)

# Example: Creating a new feature based on existing ones
df['new_feature'] = df['feature1'] / df['feature2']

# Convert categorical features to numerical using one-hot encoding
df = pd.get_dummies(df)

Summary of EDA

# Summary statistics after EDA
print("Updated dataset shape:", df.shape)
print("Updated descriptive statistics:n", df.describe())

Save Cleaned Dataset

# Save the cleaned dataset for further analysis
df.to_csv("cleaned_data.csv", index=False)

[Test and validate each section independently to ensure correctness.]

Data Visualization Techniques

1. Import Necessary Libraries

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

2. Load Data (Pandas)

df = pd.read_csv('your_dataset.csv')

3. Line Plot

plt.figure(figsize=(10, 5))
plt.plot(df['date'], df['value'], marker='o')
plt.title('Line Plot Example')
plt.xlabel('Date')
plt.ylabel('Value')
plt.grid(True)
plt.show()

4. Bar Plot

plt.figure(figsize=(10, 5))
plt.bar(df['category'], df['value'])
plt.title('Bar Plot Example')
plt.xlabel('Category')
plt.ylabel('Value')
plt.xticks(rotation=90)
plt.show()

5. Histogram

plt.figure(figsize=(10, 5))
plt.hist(df['value'], bins=20, edgecolor='black')
plt.title('Histogram Example')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

6. Scatter Plot

plt.figure(figsize=(10, 5))
plt.scatter(df['variable1'], df['variable2'])
plt.title('Scatter Plot Example')
plt.xlabel('Variable 1')
plt.ylabel('Variable 2')
plt.show()

7. Box Plot

plt.figure(figsize=(10, 5))
sns.boxplot(x='category', y='value', data=df)
plt.title('Box Plot Example')
plt.xlabel('Category')
plt.ylabel('Value')
plt.xticks(rotation=90)
plt.show()

8. Heatmap

plt.figure(figsize=(10, 5))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Heatmap Example')
plt.show()

9. Pair Plot

plt.figure(figsize=(10, 5))
sns.pairplot(df)
plt.title('Pair Plot Example')
plt.show()

Ensure you adapt your df calls to reflect the columns in your actual dataset to apply these visualizations.

Advanced Data Analysis Methods

Clustering Analysis with K-Means

from sklearn.cluster import KMeans

# Assuming 'data' is preprocessed and ready for clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(data)

# Append the clusters to the dataset
data['Cluster'] = clusters

Principal Component Analysis (PCA)

from sklearn.decomposition import PCA

# Standardize the data before applying PCA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# Perform PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_data)
data['PC1'], data['PC2'] = principal_components[:, 0], principal_components[:, 1]

Feature Engineering using Polynomial Features

from sklearn.preprocessing import PolynomialFeatures

# Generate polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(data)

# Convert to DataFrame for ease of use
poly_data = pd.DataFrame(poly_features, columns=poly.get_feature_names_out(data.columns))

Time Series Analysis – ARIMA Model

from statsmodels.tsa.arima.model import ARIMA

# Assuming 'time_series_data' is your preprocessed time series data
model = ARIMA(time_series_data['value'], order=(1, 1, 1))
fitted_model = model.fit()

# Forecasting next 10 steps
forecast = fitted_model.forecast(steps=10)

Association Rule Mining using Apriori

from mlxtend.frequent_patterns import apriori, association_rules

# Assuming 'transactions' is the preprocessed transactional data
frequent_itemsets = apriori(transactions, min_support=0.1, use_colnames=True)
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)

# Filter rules for high confidence
high_confidence_rules = rules[rules['confidence'] > 0.75]

Anomaly Detection with Isolation Forest

from sklearn.ensemble import IsolationForest

# Setup and fit the Isolation Forest model
isolation_forest = IsolationForest(contamination=0.1, random_state=42)
data['anomaly_score'] = isolation_forest.fit_predict(data)

# Filter anomalies based on score
anomalies = data[data['anomaly_score'] == -1]

Network Analysis with Centrality Measures

import networkx as nx

# Assuming 'edges' is a list of tuples containing edge information
G = nx.Graph()
G.add_edges_from(edges)

# Calculate centrality
centrality = nx.degree_centrality(G)

# Append centrality measures to node attributes
nx.set_node_attributes(G, centrality, 'centrality')

Implement each of these advanced data analysis methods to enhance your data analysis project and draw deeper insights.

Reporting and Presenting Findings

Sections in the Report

1. Introduction

  • Objective of the Analysis
  • Description of dataset used
  • Brief overview of methodology

2. Data Summary

  • Key statistics
  • Visual summary (charts, graphs)

3. Analysis Insights

  • Results from data cleaning and preparation
  • Insights from exploratory data analysis (EDA)
  • Key findings from advanced data analysis

4. Conclusion

  • Summary of key findings
  • Implications of the study
  • Suggestions for further research

Report Template

Introduction

# Report on [Project Title]

## Introduction
The objective of this analysis is to [state objective]. The dataset used is [brief description]. The methodology followed includes [brief overview].

Table of Contents:
1. Data Summary
2. Analysis Insights
3. Conclusion

Data Summary

## Data Summary

### Key Statistics
- Number of observations: [number]
- Number of variables: [number]
- Mean: [mean of key variable]
- Median: [median of key variable]
- Standard Deviation: [std dev of key variable]

### Visual Summary
Insert line charts, bar charts, histograms here.

Analysis Insights

## Analysis Insights

### Data Cleaning and Preparation
List key steps taken:
- Removed missing values
- Normalized data
- Feature engineering

### EDA Insights
Key findings from initial data analysis:
- Trend 1: [description]
- Trend 2: [description]

Visual representation of trends:
Insert scatter plots, pie charts, etc.

### Advanced Analysis Results
Advanced insights:
- Model 1: [accuracy, precision, recall]
- Model 2: [accuracy, precision, recall]

Comparison of models:
Include comparative tables or charts.

Conclusion

## Conclusion

### Summary of Key Findings
- Insight 1: [summary]
- Insight 2: [summary]

### Implications
- Implication 1: [impact]

### Suggestions for Further Research
- Suggestion 1: [future work]
- Suggestion 2: [improvements]

_End of Report_

Following this structured approach will ensure that findings are reported clearly and conclusions are easy to understand.

Related Posts