Project: Comprehensive Business Intelligence – Unit 1: Data Import and Cleaning
Introduction
This section will cover the first part of our project – importing and cleaning the retail sales data. We will use Python and Google Colab. By the end, our data should be ready for analysis and visualization.
Setup Instructions
- Sign in to your Google account and open Google Colab.
- Create a new notebook.
Data Import and Cleaning
Step 1: Import Libraries
import pandas as pd
import numpy as np
Step 2: Mount Google Drive
Make sure your data file is uploaded to your Google Drive.
from google.colab import drive
drive.mount('/content/drive')
Step 3: Load the Dataset
Replace 'path_to_your_file.csv'
with the actual path to your CSV file in your Google Drive.
file_path = '/content/drive/My Drive/path_to_your_file.csv'
data = pd.read_csv(file_path)
Step 4: Inspect the Dataset
# Display the first few rows
data.head()
# Display data types of each column
data.info()
# Get summary statistics
data.describe()
Step 5: Handle Missing Values
# Check for missing values
missing_values = data.isnull().sum()
print(missing_values)
# Drop rows with missing values (if necessary)
data_cleaned = data.dropna()
# Alternatively, fill missing values
# data_cleaned = data.fillna({
# 'column1': 'default_value1',
# 'column2': 'default_value2'
# })
Step 6: Remove Duplicates
# Check for duplicates
duplicate_rows = data_cleaned[data_cleaned.duplicated()]
print(duplicate_rows)
# Drop duplicates
data_cleaned = data_cleaned.drop_duplicates()
Step 7: Data Type Conversion
Convert columns to appropriate data types if necessary.
data_cleaned['date_column'] = pd.to_datetime(data_cleaned['date_column'])
data_cleaned['numeric_column'] = pd.to_numeric(data_cleaned['numeric_column'])
Step 8: Feature Engineering
Create additional features if needed.
# Example: Extract year and month from date
data_cleaned['year'] = data_cleaned['date_column'].dt.year
data_cleaned['month'] = data_cleaned['date_column'].dt.month
Conclusion
By this point, you should have a cleaned dataset ready for analysis and visualization. This sets the foundation for subsequent steps in our comprehensive business intelligence project.
Data Exploration and Preliminary Insights
Once data import and cleaning are completed, the next step is to explore the dataset and derive some preliminary insights. Below is a practical implementation in Python:
# Import required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Assume df is your cleaned DataFrame
# 1. Basic DataFrame Information
print("DataFrame Information:")
print(df.info())
print("nSummary Statistics:")
print(df.describe())
# 2. Missing Values
print("nMissing values per column:")
print(df.isnull().sum())
# 3. Distribution of Numerical Features
numerical_features = df.select_dtypes(include=['float64', 'int64']).columns
print("nNumerical Features Distribution:")
df[numerical_features].hist(figsize=(14, 10))
plt.show()
# 4. Distribution of Categorical Features
categorical_features = df.select_dtypes(include=['object']).columns
print("nCategorical Features Distribution:")
for feature in categorical_features:
plt.figure(figsize=(10, 5))
sns.countplot(data=df, x=feature)
plt.title(f'Distribution of {feature}')
plt.show()
# 5. Correlation Analysis
print("nCorrelation Matrix:")
correlation_matrix = df.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
# 6. Sales Trend Analysis
# Example: Assuming you have 'Date' and 'Sales' columns
df['Date'] = pd.to_datetime(df['Date'])
sales_trend = df.groupby(df['Date'].dt.to_period('M')).sum()['Sales']
print("nSales Trend Over Time:")
sales_trend.plot(figsize=(14, 7))
plt.title('Sales Trend Over Time')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.show()
# 7. Top Products
# Example: Assuming you have 'Product' and 'Sales' columns
top_products = df.groupby('Product').sum()['Sales'].sort_values(ascending=False).head(10)
print("nTop 10 Products by Sales:")
print(top_products)
plt.figure(figsize=(12, 6))
sns.barplot(x=top_products.index, y=top_products.values)
plt.title('Top 10 Products by Sales')
plt.xlabel('Product')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.show()
# 8. Customer Segmentation
# Example: Assuming you have 'CustomerID' and 'Sales' columns
customer_sales = df.groupby('CustomerID').sum()['Sales']
print("nCustomer Sales Distribution:")
plt.figure(figsize=(12, 6))
sns.histplot(customer_sales, bins=30, kde=True)
plt.title('Customer Sales Distribution')
plt.xlabel('Total Sales')
plt.ylabel('Number of Customers')
plt.show()
Ensure the DataFrame df
is pre-loaded in your Google Colab environment, based on your previous steps (Data Import and Cleaning). The code will provide:
- Overview of the DataFrame.
- Detailed summary statistics.
- Visualization of missing data.
- Histogram distributions for numerical features.
- Count plots for categorical features.
- Correlation heatmap.
- Time series analysis of sales.
- Visualization of top products by sales.
- Distribution of customer sales.
Customize column names as per your dataset if they differ. This implementation can be applied directly to explore and derive initial insights from the cleaned retail sales data.
Analyzing Sales Trends Over Time
Loading Libraries and Data
We are focusing on time-series analysis to evaluate sales trends.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming the data has been cleaned and imported already, stored in df
# Make sure the 'date' column is in datetime format:
df['date'] = pd.to_datetime(df['date'])
Resampling Data for Time Series Analysis
Aggregating Monthly Sales
# Assuming 'sales' is the column name for sales amounts
monthly_sales = df.resample('M', on='date').sum()
Visualizing Sales Trends
Line Plot of Monthly Sales
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales.index, monthly_sales['sales'], marker='o', linestyle='-')
plt.title('Monthly Sales Trend')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.grid(True)
plt.show()
Visualizing Yearly Sales Trends
# Extracting year and month for grouping
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
# Grouping by year and month for heatmap
year_month_sales = df.pivot_table(values='sales', index='year', columns='month', aggfunc='sum')
plt.figure(figsize=(12, 8))
sns.heatmap(year_month_sales, annot=True, fmt=".1f", linewidths=.5, cmap="YlGnBu")
plt.title('Monthly Sales Heatmap Over Years')
plt.xlabel('Month')
plt.ylabel('Year')
plt.show()
Rolling Mean for Trend Analysis
Calculating and Plotting Rolling Mean
# Set rolling window by months, e.g., 3-month window
df.set_index('date', inplace=True)
rolling_sales = df['sales'].rolling(window=3).mean()
plt.figure(figsize=(12, 6))
plt.plot(df.index, df['sales'], marker='.', linestyle='-', label='Original Sales')
plt.plot(rolling_sales.index, rolling_sales, marker='o', color='red', linestyle='-', label='3-Month Rolling Mean')
plt.title('Sales Trend with 3-Month Rolling Mean')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.grid(True)
plt.show()
Seasonal Decomposition
Using Statsmodels for Seasonal Decomposition
import statsmodels.api as sm
# Decompose the time series using seasonal decomposition
decomposition = sm.tsa.seasonal_decompose(df['sales'], model='additive')
# Plotting the decomposition
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(15, 12))
decomposition.observed.plot(ax=ax1, title='Observed')
decomposition.trend.plot(ax=ax2, title='Trend')
decomposition.seasonal.plot(ax=ax3, title='Seasonal')
decomposition.resid.plot(ax=ax4, title='Residual')
plt.tight_layout()
plt.show()
This implementation covers the analysis of sales trends over time and provides visual insights using line plots, heatmaps, rolling means, and seasonal decomposition. The code should be directly executable in Google Colab, given that the data is pre-cleaned and properly formatted.
Category-Level Sales Analysis in Python
Here’s a step-by-step practical implementation of category-level sales analysis using Python, assuming you’re using a Jupyter environment such as Google Colab.
Import Necessary Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Load Cleaned Data
Here, it is assumed that you have a cleaned DataFrame named sales_data
loaded from your previous steps.
Group Sales Data by Category
category_sales = sales_data.groupby('Category')['Sales'].sum().reset_index()
Plot Sales by Category
plt.figure(figsize=(12, 6))
sns.barplot(x='Category', y='Sales', data=category_sales, palette='viridis')
plt.title('Total Sales by Category')
plt.xlabel('Category')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.show()
Analyze Category with Highest and Lowest Sales
highest_sales_category = category_sales.loc[category_sales['Sales'].idxmax()]
lowest_sales_category = category_sales.loc[category_sales['Sales'].idxmin()]
print(f"The category with the highest sales is: {highest_sales_category['Category']} with total sales of {highest_sales_category['Sales']:.2f}")
print(f"The category with the lowest sales is: {lowest_sales_category['Category']} with total sales of {lowest_sales_category['Sales']:.2f}")
Sales Distribution by Category Over Time
plt.figure(figsize=(14, 7))
category_sales_over_time = sales_data.groupby(['Category', 'Order Date'])['Sales'].sum().unstack('Category').fillna(0)
category_sales_over_time.plot(ax=plt.gca(), linewidth=2)
plt.title('Sales Distribution by Category Over Time')
plt.xlabel('Order Date')
plt.ylabel('Sales')
plt.legend(title='Category', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
Analysis of Average Sales per Transaction by Category
average_sales_by_category = sales_data.groupby('Category')['Sales'].mean().reset_index()
plt.figure(figsize=(12, 6))
sns.barplot(x='Category', y='Sales', data=average_sales_by_category, palette='viridis')
plt.title('Average Sales per Transaction by Category')
plt.xlabel('Category')
plt.ylabel('Average Sales')
plt.xticks(rotation=45)
plt.show()
print("Average Sales per Transaction by Category:")
print(average_sales_by_category)
Summary
With this implementation, you’ve performed an analysis on the sales data at the category level, visualized the data, and identified key insights such as highest and lowest sales categories, sales distribution over time, and average sales per transaction for each category. This provides a comprehensive understanding of how different product categories are performing in terms of sales.
Apply this implementation in your comprehensive BI project to enhance your insights on retail sales data.
Regional Sales Analysis
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming `df_sales` is the DataFrame containing cleaned sales data
# and the DataFrame has columns: 'Region', 'Sales', 'Date', 'Category', etc.
# Aggregate sales data by region
region_sales = df_sales.groupby('Region')['Sales'].sum().reset_index()
# Sorting the sales data by region for better visualization
region_sales = region_sales.sort_values(by='Sales', ascending=False)
# Visualization of sales data by region using a bar plot
plt.figure(figsize=(12, 6))
sns.barplot(x='Region', y='Sales', data=region_sales, palette='viridis')
plt.title('Total Sales by Region')
plt.xlabel('Region')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.show()
# Further analysis: Sales Trends by Region Over Time
# Aggregating sales data by region and date
region_date_sales = df_sales.groupby(['Region', 'Date'])['Sales'].sum().reset_index()
# Creating a line plot to visualize sales trends for each region over time
plt.figure(figsize=(14, 8))
sns.lineplot(x='Date', y='Sales', hue='Region', data=region_date_sales, palette='tab10')
plt.title('Sales Trends by Region Over Time')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend(title='Region', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.xticks(rotation=45)
plt.show()
# More granular insights: Regional Sales by Category
region_category_sales = df_sales.groupby(['Region', 'Category'])['Sales'].sum().reset_index()
# Pivoting data to have a better view for plotting
pivot_data = region_category_sales.pivot(index='Region', columns='Category', values='Sales')
# Heatmap for visualizing sales distribution across categories in different regions
plt.figure(figsize=(14, 8))
sns.heatmap(pivot_data, annot=True, fmt=".1f", cmap='YlGnBu', linewidths=.5)
plt.title('Sales by Category and Region')
plt.xlabel('Category')
plt.ylabel('Region')
plt.show()
This implementation covers:
- Aggregating and visualizing total sales by region.
- Analyzing sales trends over time per region.
- Comparing regional sales by category using a heatmap for more nuanced insights.
This approach provides clear, actionable insights into regional performance and potential areas for business improvement.
6. Customer Segmentation
Introduction
Customer segmentation involves dividing customers into distinct groups based on specific characteristics. For this section, we’ll use clustering algorithms from the sklearn
library to segment customers based on their purchasing behavior. Specifically, we will use the K-Means clustering algorithm.
Implementation
Import Necessary Libraries
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt
Load Data
Assuming the cleaned data is stored in a DataFrame called df
. The DataFrame should have columns: CustomerID
, InvoiceNo
, InvoiceDate
, Quantity
, and UnitPrice
.
# Sample structure of DataFrame, adjust column names if different
df = pd.read_csv('cleaned_retail_sales_data.csv')
# Creating monetary value column
df['TotalSpent'] = df['Quantity'] * df['UnitPrice']
# Remove any negative values or returns (if not already done)
df = df[df['TotalSpent'] > 0]
RFM (Recency, Frequency, Monetary)
Calculate Recency
# Assuming the most recent transaction date is the cutoff date
import datetime as dt
cutoff_date = df['InvoiceDate'].max() + dt.timedelta(1)
recency_df = df.groupby('CustomerID').InvoiceDate.max().reset_index()
recency_df.columns = ['CustomerID', 'LastPurchaseDate']
recency_df['Recency'] = (cutoff_date - recency_df['LastPurchaseDate']).dt.days
Calculate Frequency
frequency_df = df.groupby('CustomerID').InvoiceNo.nunique().reset_index()
frequency_df.columns = ['CustomerID', 'Frequency']
Calculate Monetary
monetary_df = df.groupby('CustomerID').TotalSpent.sum().reset_index()
monetary_df.columns = ['CustomerID', 'Monetary']
Combine RFM
rfm_df = recency_df.merge(frequency_df, on='CustomerID').merge(monetary_df, on='CustomerID')
rfm_df.set_index('CustomerID', inplace=True)
Standardize the Values
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm_df)
K-Means Clustering
# Determine optimal number of clusters using Elbow Method
sse = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k)
kmeans.fit(rfm_scaled)
sse.append(kmeans.inertia_)
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), sse, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.title('Elbow Method For Optimal k')
plt.show()
# Apply KMeans with chosen number of clusters (example uses k=4)
kmeans = KMeans(n_clusters=4)
rfm_df['Cluster'] = kmeans.fit_predict(rfm_scaled)
Analyze Cluster Profiles
cluster_profiles = rfm_df.groupby('Cluster').mean()
print(cluster_profiles)
# Visualize the clusters
sns.pairplot(rfm_df, hue='Cluster', palette='tab10')
plt.show()
Conclusion
With the above steps, we successfully segmented customers based on their purchasing behavior. These segments can further be analyzed for targeted marketing strategies or personalized customer service.
Sales Forecasting with Time Series
Below is the practical implementation of Sales Forecasting with Time Series in Python. We use the statsmodels
library to build an ARIMA model for the sales forecasting.
Step 1: Import Necessary Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_error
import warnings
# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')
Step 2: Prepare Time Series Data
Assumption: Your cleaned sales data is contained in a DataFrame named sales_data
with columns date
and sales
.
# Ensure the date column is in datetime format
sales_data['date'] = pd.to_datetime(sales_data['date'])
# Set the date column as the index
sales_data.set_index('date', inplace=True)
# Resample the data by day, month, etc. Modify as per your requirement. Example: Monthly resampling
monthly_sales = sales_data.resample('M').sum()
Step 3: Split Data into Training and Testing Set
# Define the split point
train_size = int(len(monthly_sales) * 0.8)
# Split data
train, test = monthly_sales[:train_size], monthly_sales[train_size:]
Step 4: Build and Train the ARIMA Model
# Define the ARIMA model
model = SARIMAX(train['sales'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 12))
# Fit the model
model_fit = model.fit(disp=False)
Step 5: Make Forecast
# Forecast the sales
forecast = model_fit.forecast(steps=len(test))
# Convert forecast to DataFrame for easy comparison
forecast = pd.DataFrame(forecast, index=test.index, columns=['forecast'])
Step 6: Evaluate the Model
# Calculate Mean Absolute Error
mae = mean_absolute_error(test['sales'], forecast['forecast'])
print(f'Mean Absolute Error: {mae}')
Step 7: Visualize the Forecast
# Plot the results
plt.figure(figsize=(12, 6))
plt.plot(train.index, train['sales'], label='Train')
plt.plot(test.index, test['sales'], label='Test')
plt.plot(forecast.index, forecast['forecast'], label='Forecast', linestyle='dashed')
plt.title('Sales Forecast')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
Conclusion
Now you have implemented the sales forecasting using a time series approach in Python. This can be integrated into your existing project in Google Colab as part of the comprehensive business intelligence project.
Visualizing Insights Using Dashboards
In this section, we’ll create interactive dashboards to visualize our insights. We’ll utilize the Plotly and Dash libraries in Python, which are well-suited for creating interactive visualizations. The following code provides a practical implementation for setting up a dashboard to visualize retail sales data.
Step 1: Install Required Libraries
Ensure you have Plotly and Dash installed in your Google Colab environment:
!pip install plotly dash
Step 2: Import Necessary Libraries
import dash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
import plotly.express as px
import pandas as pd
# Assuming the cleaned and analyzed data is already stored in a DataFrame named 'df'
# df = ... (Your processed DataFrame)
Step 3: Setup the Dash Application
app = dash.Dash(__name__)
Step 4: Define Layout of the Dashboard
app.layout = html.Div([
html.H1("Retail Sales Dashboard"),
# Dropdown for selecting the type of analysis
html.Label("Choose Analysis"),
dcc.Dropdown(
id='analysis-type',
options=[
{'label': 'Sales Trends Over Time', 'value': 'time'},
{'label': 'Category-Level Sales', 'value': 'category'},
{'label': 'Regional Sales', 'value': 'region'},
{'label': 'Customer Segmentation', 'value': 'customer_seg'}
],
value='time' # Default value
),
# Graph to display the selected analysis
dcc.Graph(id='main-graph')
])
Step 5: Define Callback to Update Graph Based on Dropdown Selection
@app.callback(
Output('main-graph', 'figure'),
Input('analysis-type', 'value')
)
def update_graph(analysis_type):
if analysis_type == 'time':
fig = px.line(df, x='date', y='sales', title='Sales Trends Over Time')
elif analysis_type == 'category':
fig = px.bar(df, x='category', y='sales', title='Category-Level Sales')
elif analysis_type == 'region':
fig = px.choropleth(df, locations='region', locationmode='country names',
color='sales', title='Regional Sales',
hover_name='region', color_continuous_scale=px.colors.sequential.Plasma)
elif analysis_type == 'customer_seg':
fig = px.pie(df, names='customer_segment', values='sales', title='Customer Segmentation')
return fig
Step 6: Run the Dash Application
if __name__ == '__main__':
app.run_server(debug=True, use_reloader=False)
Summary
Copy and paste the code segments into your Google Colab. This implementation provides an interactive dashboard for visualizing various aspects of the retail sales data. Modify the DataFrame variable df
to use your processed data. Run the Dash application to visualize insights interactively on your browser.