In this blog we will walk through a detailed example analyzing retail sales data using Python code.
You can work through this project and showcase your results here – Data Science Workout 4
1. Introduction
In this section, we will load and understand a retail sales dataset. This involves setting up the Python environment, loading the dataset, and performing initial exploratory data analysis (EDA).
2. Setup Instructions
Before we start loading the dataset, make sure you have the necessary Python libraries installed. You can install them using pip if they are not already available.
pip install pandas matplotlib seaborn
3. Loading the Dataset
We’ll use the pandas
library to load the dataset. Assume the dataset is a CSV file named retail_sales.csv
.
You can download a demo dataset here for this particularly project. Please note – the file name will likely be different to the above.
import pandas as pd
# Loading the dataset
file_path = 'retail_sales.csv'
sales_data = pd.read_csv(file_path)
# Display the first few rows of the dataset
print(sales_data.head())
4. Understanding the Dataset
4.1 Basic Information
Let’s understand the structure of the dataset by checking its basic information such as data types, missing values, and basic statistics.
# Checking the structure of the dataset
print(sales_data.info())
# Summary statistics
print(sales_data.describe())
4.2 Checking for Missing Values
Identify columns with missing values to understand the extent of missing data.
# Checking for missing values
missing_values = sales_data.isnull().sum()
print(missing_values)
4.3 Exploratory Data Analysis (EDA)
4.3.1 Distribution of Sales
Visualize the distribution of sales using histograms and box plots.
import matplotlib.pyplot as plt
import seaborn as sns
# Histogram of sales data
plt.figure(figsize=(10, 6))
sns.histplot(sales_data['sales_amount'], kde=True)
plt.title('distribution of sales')
plt.xlabel('Sales Amount')
plt.ylabel('Frequency')
plt.show()
# Box plot of sales data
plt.figure(figsize=(10, 6))
sns.boxplot(x=sales_data['sales_amount'])
plt.title('Box plot of Sales Amount')
plt.xlabel('Sales Amount')
plt.show()
4.3.2 Time Series Analysis
If the dataset includes a time component, we can analyze sales over time.
# Convert the date column to datetime if it's not already
sales_data['date'] = pd.to_datetime(sales_data['date'])
# Set the date column as the index
sales_data.set_index('date', inplace=True)
# Plotting sales over time
plt.figure(figsize=(14, 8))
sales_data['sales_amount'].plot()
plt.title('Sales Over Time')
plt.xlabel('Date')
plt.ylabel('Sales Amount')
plt.show()
Conclusion
This concludes the first part of our project – loading and understanding the retail sales data. In the next steps, we will proceed with data cleaning, feature engineering, and forecasting sales using advanced data science techniques.
Exploratory Data Analysis for Key Trends and Patterns
Here is a practical implementation of an exploratory data analysis (EDA) to identify key trends and patterns using Python. We will use pandas
, numpy
, matplotlib
, and seaborn
libraries.
1. Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
2. Load the Data (Assuming this is done in the previous steps)
# data = pd.read_csv('retail_sales_data.csv')
# Ensure the above line is already handled in the earlier steps
# Here, I'll just proceed assuming `data` is your DataFrame
3. Basic Data Cleaning
# Handling missing values
data.dropna(inplace=True)
# Convert date column to datetime if not already done
data['Date'] = pd.to_datetime(data['Date'])
4. Descriptive Statistics
# General statistics
print(data.describe())
# Checking the data types
print(data.info())
5. Univariate Analysis
Histograms
# Plot histograms for numerical variables
data.hist(bins=30, figsize=(15, 10))
plt.tight_layout()
plt.show()
Box Plots
# Plot boxplots to detect outliers in numerical columns
plt.figure(figsize=(10, 6))
sns.boxplot(data=data.select_dtypes(include=np.number))
plt.xticks(rotation=90)
plt.show()
6. Bivariate Analysis
Scatter Plots
# Scatter plot to see the relationship between two numerical variables
plt.figure(figsize=(10, 6))
plt.scatter(data['Sales'], data['Profit'])
plt.xlabel('Sales')
plt.ylabel('Profit')
plt.title('Sales vs Profit')
plt.show()
Correlation Matrix
# Correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
7. Time Series Analysis
# Setting the date column as the index
data.set_index('Date', inplace=True)
# Resampling the data to monthly frequency
monthly_data = data.resample('M').sum()
# Plotting time series
plt.figure(figsize=(12, 6))
plt.plot(monthly_data['Sales'], label='Monthly Sales')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.title('Monthly Sales Over Time')
plt.legend()
plt.show()
8. Seasonality and Trend
Decomposition
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(monthly_data['Sales'], model='additive')
result.plot()
plt.show()
9. Segment Analysis
Sales by Category
# Assuming there is a 'Category' column
plt.figure(figsize=(12, 6))
category_sales = data.groupby('Category')['Sales'].sum().sort_values()
category_sales.plot(kind='bar')
plt.xlabel('Category')
plt.ylabel('Sales')
plt.title('Total Sales by Category')
plt.show()
Sales by Region
# Assuming there is a 'Region' column
plt.figure(figsize=(12, 6))
region_sales = data.groupby('Region')['Sales'].sum().sort_values()
region_sales.plot(kind='bar')
plt.xlabel('Region')
plt.ylabel('Sales')
plt.title('Total Sales by Region')
plt.show()
10. Key Insights Summary
At the conclusion of your EDA, print or log a summary of key insights:
print("Key Insights:")
print(f"Total Sales: {data['Sales'].sum()}")
print(f"Total Profit: {data['Profit'].sum()}")
print(f"Top Category by Sales: {category_sales.idxmax()} with {category_sales.max()} sales")
print(f"Top Region by Sales: {region_sales.idxmax()} with {region_sales.max()} sales")
This code summarizes and visualizes the key trends and patterns of your retail sales data, arming you with critical insights for further analysis or forecasting tasks.
Feature Engineering for Sales Forecasting
In this part of the project, we will perform feature engineering to create meaningful features that can improve the accuracy of our sales forecasting model.
Step 1: Date Features
Extract useful features from the date column in the dataset such as year, month, day of the week, and whether it is a weekend or holiday.
# Assuming the DataFrame is named df and the date column is 'date'
# Import necessary libraries
import pandas as pd
import numpy as np
# Ensure the 'date' column is in datetime format
df['date'] = pd.to_datetime(df['date'])
# Create new date-related features
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek
df['is_weekend'] = df['dayofweek'].apply(lambda x: 1 if x >= 5 else 0)
# Function to check if a date is a holiday
# (You may need to define your own holidays as per your country's calendar)
def is_holiday(date):
holidays = [
"2022-01-01", "2022-12-25", "2022-07-04", # add more holidays
]
return 1 if date.strftime('%Y-%m-%d') in holidays else 0
df['is_holiday'] = df['date'].apply(is_holiday)
Step 2: Lag Features
Create lag features to capture the temporal patterns in sales.
# Create lag features
# Assuming the sales data is recorded daily and the 'sales' column contains the sales figures
for lag in [1, 7, 14, 30]: # For daily data, common lags are 1 day, 7 days (weekly), 14 days, and 30 days (monthly)
df[f'sales_lag_{lag}'] = df['sales'].shift(lag)
# Fill NaN values with 0 (or any other appropriate value or imputation method)
df.fillna(0, inplace=True)
Step 3: Rolling Window Features
Calculate rolling statistics such as mean and standard deviation to capture trends and seasonality.
# Create rolling statistics features
for window in [7, 14, 30]: # Windows of 1 week, 2 weeks, and 1 month
df[f'sales_roll_mean_{window}'] = df['sales'].rolling(window=window).mean()
df[f'sales_roll_std_{window}'] = df['sales'].rolling(window=window).std()
# Again, fill NaN values with 0
df.fillna(0, inplace=True)
Step 4: Dealing with Categorical Variables
Convert categorical variables into numeric form (e.g. using one-hot encoding).
# Example categorical variable: 'store_type'
# Assuming the column 'store_type' exists in the DataFrame
# Convert categorical feature 'store_type' into numeric form using one-hot encoding
df = pd.get_dummies(df, columns=['store_type'])
# Check the DataFrame to ensure all changes are applied
print(df.head())
This completes the feature engineering step for your sales forecasting project. You should now have a DataFrame with enriched features that can be used to train a predictive model.
Data Splitting: Preparing Training and Test Sets
To prepare training and test sets in Python, you can use the train_test_split
function from the sklearn.model_selection
module. This function splits the data into training and test subsets, which will allow you to evaluate the performance of your predictive models.
Here’s a practical implementation:
import pandas as pd
from sklearn.model_selection import train_test_split
# Load your dataset
data = pd.read_csv('retail_sales_data.csv')
# Assume 'sales' is the target variable and the rest are features
X = data.drop(columns='sales')
y = data['sales']
# Split the data into 80% training and 20% test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Display shapes to verify splits
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")
Explanation
Importing Libraries: The necessary libraries are imported first.
pandas
is used for data manipulation andtrain_test_split
for splitting the data.Loading Dataset: The dataset is loaded using
pd.read_csv
. Ensure thatretail_sales_data.csv
is in your working directory or provide the correct path to the file.Preparing Features and Target Variable: The feature set
X
is prepared by dropping the ‘sales’ column from the dataset. The target variabley
is set as the ‘sales’ column.Splitting the Data: The
train_test_split
function is used to splitX
andy
into training and test sets. Here, 80% of the data is used for training (X_train
,y_train
) and 20% for testing (X_test
,y_test
). Therandom_state
parameter is set to ensure reproducibility.Verification: The shapes of the resulting splits are printed to verify that the splitting was successful. The training set should be 80% and the test set should be 20% of the original dataset.
You can now proceed to build and evaluate your models using X_train
, y_train
for training, and X_test
, y_test
for testing.
Part 5: Introduction to Linear Regression for Sales Prediction
Objective
In this part, we will use linear regression to predict future sales based on the historical retail sales data you’ve already prepared. Linear regression models the relationship between a dependent variable (sales) and one or more independent variables (features).
Steps
- Import Necessary Libraries
- Load the Preprocessed Data
- Instantiate and Train the Linear Regression Model
- Evaluate the Model
- Predict Future Sales
# Step 1: Import Necessary Libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Step 2: Load the Preprocessed Data
# Assuming the preprocessed data is stored in 'train_data.csv' and 'test_data.csv'
train_data = pd.read_csv('train_data.csv')
test_data = pd.read_csv('test_data.csv')
# Separate features (X) and target variable (y)
X_train = train_data.drop(columns='sales')
y_train = train_data['sales']
X_test = test_data.drop(columns='sales')
y_test = test_data['sales']
# Step 3: Instantiate and Train the Linear Regression Model
model = LinearRegression()
model.fit(X_train, y_train)
# Step 4: Evaluate the Model
# Predict on training data
train_predictions = model.predict(X_train)
# Predict on test data
test_predictions = model.predict(X_test)
# Calculate Mean Squared Error and R^2 Score for training data
train_mse = mean_squared_error(y_train, train_predictions)
train_r2 = r2_score(y_train, train_predictions)
# Calculate Mean Squared Error and R^2 Score for test data
test_mse = mean_squared_error(y_test, test_predictions)
test_r2 = r2_score(y_test, test_predictions)
print(f'Training Data: MSE = {train_mse}, R^2 = {train_r2}')
print(f'Test Data: MSE = {test_mse}, R^2 = {test_r2}')
# Step 5: Predict Future Sales
# Assuming 'future_data.csv' contains the future dates and their corresponding features
future_data = pd.read_csv('future_data.csv')
future_sales_predictions = model.predict(future_data)
# Save the predictions into a CSV file
future_data['Predicted_Sales'] = future_sales_predictions
future_data.to_csv('future_sales_predictions.csv', index=False)
print("Future sales predictions have been saved to 'future_sales_predictions.csv'")
Explanation
- Importing Libraries: The necessary libraries for data handling, implementation of linear regression, and error metrics are imported.
- Loading Data: Preprocessed training and test data are loaded. These data should already have features engineered and are split into training and testing sets.
- Model Training: A linear regression model is instantiated and fitted using the training data.
- Evaluation: The model’s performance is evaluated on both the training and test data using Mean Squared Error (MSE) and R² Score.
- Prediction: The model predicts future sales based on new input data, and the predictions are saved to a CSV file.
This implementation allows you to apply linear regression to predict sales based on historical data, evaluate the performance of the model, and use it to make future sales predictions suitable for real-life applications.
Evaluating Model Performance with MAE and RMSE
In this section, we’ll compute two key metrics, Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), to evaluate the performance of the linear regression model you trained for retail sales forecasting.
Mean Absolute Error (MAE)
MAE represents the average of absolute differences between predictions and actual observations. It gives an idea of how wrong the predictions were, without considering their direction.
Root Mean Squared Error (RMSE)
RMSE is the square root of the average of squared differences between predictions and actual observations. It penalizes larger errors more significantly due to the squaring part.
Below is the complete implementation to calculate MAE and RMSE using Python:
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Assuming y_test is the actual test set values and y_pred is the predicted values from the model
y_test = np.array([actual values]) # replace with actual test set values
y_pred = np.array([predicted values]) # replace with predicted values from your model
# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")
# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error (RMSE): {rmse}")
Explanation
Import Required Libraries: We need
numpy
for numerical operations andmean_absolute_error
,mean_squared_error
fromsklearn.metrics
to compute MAE and RMSE.Prepare Data:
y_test
holds the actual values of the test set, andy_pred
holds the predicted values from your linear regression model.Calculate MAE:
mean_absolute_error(y_test, y_pred)
computes the MAE by averaging the absolute differences between actual values and predicted values.Calculate RMSE:
mean_squared_error(y_test, y_pred)
first computes the MSE by averaging the squared differences, thennp.sqrt
computes the square root of MSE to obtain RMSE.
Apply in Real Life
Replace the placeholder arrays for y_test
and y_pred
with your actual test dataset and predicted values from your sales forecasting model. The printed MAE and RMSE will give you concrete numbers to evaluate how well your model performs.
# Example with real data
y_test = np.array([100, 150, 200, 250, 300])
y_pred = np.array([110, 145, 210, 240, 310])
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error (MAE): {mae}")
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error (RMSE): {rmse}")
By running this fragment with your specific data, you can effectively gauge the prediction accuracy and performance of your retail sales forecasting model.
Visualizing Actual vs. Predicted Sales
To visualize the actual vs. predicted sales, we will use the matplotlib
library to produce the comparison plot. This visualization helps in understanding how well our model is performing in predicting sales compared to actual sales data.
Here is the complete implementation:
Import Necessary Libraries: Ensure you have the required libraries installed and imported.
Load the Data: Make sure your dataset including actual and predicted sales is ready.
Create Visualization: Use
matplotlib
to plot the actual vs. predicted sales.
Implementation
import matplotlib.pyplot as plt
import pandas as pd
# Assuming you have your sales predictions and actual sales in a DataFrame:
# sales_data should contain columns like 'date', 'actual_sales', 'predicted_sales'
# Example DataFrame creation
data = {
'date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05'],
'actual_sales': [100, 150, 200, 250, 300],
'predicted_sales': [110, 160, 195, 240, 310]
}
sales_data = pd.DataFrame(data)
# Convert 'date' column to datetime if not already in that format
sales_data['date'] = pd.to_datetime(sales_data['date'])
# Plotting the Data
plt.figure(figsize=(10, 6))
plt.plot(sales_data['date'], sales_data['actual_sales'], marker='o', linestyle='-', color='b', label='Actual Sales')
plt.plot(sales_data['date'], sales_data['predicted_sales'], marker='x', linestyle='--', color='r', label='Predicted Sales')
# Add titles and labels
plt.title('Actual vs. Predicted Sales')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.grid(True)
# Show the plot
plt.show()
Explanation
- Loading the Libraries: Matplotlib is used for plotting, and Pandas is used to manage the data.
- Creating Dummy Data: For demonstration purposes, a sample DataFrame
sales_data
is created. In practice, you should replace this with your actual dataset. - Data Preparation: Ensure the date column is in datetime format.
- Plotting Actual vs. Predicted Sales:
- The actual sales are plotted with a solid line and round markers.
- The predicted sales are plotted with a dashed line and ‘x’ markers.
- Labels and Grid: Titles, labels, and grid are added for better readability.
This code will generate a plot comparing the actual and predicted sales values over time, allowing for a straightforward visual analysis of your model’s performance.
Insights and Reporting
Once we have built our sales prediction model and evaluated its performance, it is crucial to generate insights and report them effectively. Below is a step-by-step guide to create insights and reporting using Python.
Step 1: Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
Step 2: Load Data
Assume sales_test.csv
contains the test set with actual sales and predicted sales columns.
df = pd.read_csv("sales_test.csv")
Step 3: Calculate Performance Metrics
# Calculating Mean Absolute Error (MAE)
mae = mean_absolute_error(df['actual_sales'], df['predicted_sales'])
# Calculating Root Mean Squared Error (RMSE)
rmse = np.sqrt(mean_squared_error(df['actual_sales'], df['predicted_sales']))
Step 4: Generate Summary Statistics
summary_statistics = df.describe()
# Calculating additional statistics if necessary
mean_actual_sales = df['actual_sales'].mean()
mean_predicted_sales = df['predicted_sales'].mean()
# Computing summary insights
summary_insights = {
"MAE": mae,
"RMSE": rmse,
"Mean Actual Sales": mean_actual_sales,
"Mean Predicted Sales": mean_predicted_sales
}
for key, value in summary_insights.items():
print(f"{key}: {value:.2f}")
Step 5: Visualize the Results
Plot 1: Actual vs. Predicted Sales
plt.figure(figsize=(10, 6))
sns.lineplot(x=df.index, y=df['actual_sales'], label='Actual Sales')
sns.lineplot(x=df.index, y=df['predicted_sales'], label='Predicted Sales')
plt.title('Actual vs Predicted Sales')
plt.xlabel('Time')
plt.ylabel('Sales')
plt.legend()
plt.show()
Plot 2: Error Distribution
df['error'] = df['actual_sales'] - df['predicted_sales']
plt.figure(figsize=(10, 6))
sns.histplot(df['error'], kde=True)
plt.title('Error Distribution (Actual - Predicted)')
plt.xlabel('Error')
plt.ylabel('Frequency')
plt.show()
Step 6: Generate a Report
Below is the code to generate a simple text-based report about the metrics and key insights.
report = f"""
Sales Prediction Model Report
Performance Metrics:
---------------------
Mean Absolute Error (MAE): {mae:.2f}
Root Mean Squared Error (RMSE): {rmse:.2f}
Summary Statistics:
--------------------
Mean Actual Sales: {mean_actual_sales:.2f}
Mean Predicted Sales: {mean_predicted_sales:.2f}
Error Analysis:
---------------
The error distribution suggests that the model is [describe the bias or any anomalies if applicable].
Visualization:
--------------
1. The line plot of 'Actual vs Predicted Sales' demonstrates the model's effectiveness in following the sales trend.
2. The histogram of 'Error Distribution' helps in understanding the discrepancy between actual and predicted values.
"""
print(report)
Conclusion
Once you have this code integrated, you will not only be able to gather critical insights but also provide a comprehensive report that can be used for further analysis and decision-making. This implementation should fit seamlessly into your project, providing the necessary insights and reporting capabilities for your retail sales forecast.