Sales Forecasting for Retail Chain using Random Forest in R

Table of Contents

Introduction to Sales Forecasting and Data Science

Overview

This project aims to predict future sales for a retail chain using Random Forest algorithms implemented in R. This will assist in making informed supply chain and inventory management decisions.

Setup Instructions

Step 1: Install Required Packages

Before proceeding, ensure that you have R and the necessary packages installed. Run the following commands to install the required packages:

install.packages("tidyverse")
install.packages("randomForest")
install.packages("caret")
install.packages("lubridate")

Step 2: Load Libraries

Load the libraries to be used in this project:

library(tidyverse)
library(randomForest)
library(caret)
library(lubridate)

Data Preparation

Step 1: Load Data

Load the sales data into R for processing. Replace file_path with the actual path to the dataset.

sales_data <- read_csv("file_path/sales_data.csv")

Step 2: Data Cleaning

Clean and preprocess the data for analysis. This typically involves handling missing values, encoding categorical variables, and converting data types.

# Handle missing values
sales_data <- na.omit(sales_data)

# Convert date to Date type
sales_data$date <- ymd(sales_data$date)

# Encode categorical variables if necessary
sales_data$store_id <- as.factor(sales_data$store_id)
sales_data$product_id <- as.factor(sales_data$product_id)

Step 3: Feature Engineering

Create additional features that may help the model. For example, extracting day of the week, month, and year from the date.

sales_data <- sales_data %>%
  mutate(day_of_week = wday(date, label = TRUE),
         month = month(date),
         year = year(date))

Model Implementation

Step 1: Data Splitting

Split the data into training and testing sets.

set.seed(123) # For reproducibility
train_index <- createDataPartition(sales_data$sales, p = 0.8, list = FALSE)
train_data <- sales_data[train_index,]
test_data <- sales_data[-train_index,]

Step 2: Model Training

Train the Random Forest model on the training data.

set.seed(123) # For reproducibility
rf_model <- randomForest(sales ~ store_id + product_id + day_of_week + month + year, 
                         data = train_data, 
                         ntree = 100)

Step 3: Model Evaluation

Evaluate the model on the test data.

predictions <- predict(rf_model, test_data)
confusionMatrix(predictions, test_data$sales)

Conclusion

In this introductory unit, we set up our R environment, loaded and cleaned the dataset, performed feature engineering, and implemented a Random Forest model to predict future sales. This serves as a foundational step for more in-depth analysis and modeling in subsequent units.

Data Collection and Management in R

Data Collection

# Load necessary libraries
library(readr)

# Define file paths for the datasets
sales_data_path <- "path/to/sales_data.csv"
inventory_data_path <- "path/to/inventory_data.csv"
store_info_path <- "path/to/store_info.csv"

# Read datasets
sales_data <- read_csv(sales_data_path)
inventory_data <- read_csv(inventory_data_path)
store_info <- read_csv(store_info_path)

Data Management

# Load necessary libraries
library(dplyr)

# Merge datasets
merged_data <- sales_data %>%
  inner_join(inventory_data, by = "product_id") %>%
  inner_join(store_info, by = "store_id")

# Handle missing values
merged_data <- merged_data %>%
  mutate(across(everything(),
                ~ ifelse(is.na(.), 0, .)))

# Feature Engineering
merged_data <- merged_data %>%
  mutate(
    day_of_week = weekdays(as.Date(date)),
    month = format(as.Date(date), "%m")
  )

# Remove irrelevant columns if any, e.g., 'transaction_id'
cleaned_data <- merged_data %>%
  select(-transaction_id)

Data Storage

# Save the cleaned data for future use
cleaned_data_path <- "path/to/cleaned_data.csv"
write_csv(cleaned_data, cleaned_data_path)

Summary of Key Points

Data Loading: Utilize readr package to load the sales, inventory, and store info datasets.
Data Merging: Merge the datasets on common keys such as product_id and store_id using the dplyr package.
Missing Values Handling: Replace missing values with 0.
Feature Engineering: Create new features like day_of_week and month.
Cleaning: Remove irrelevant columns.
Storing: Save the cleaned data to a CSV file for future usage.

You can now use cleaned_data for building and training your Random Forest model in the next phases of your project.

Data Preprocessing and Cleaning in R for Future Sales Prediction

Load Necessary Libraries

library(dplyr)
library(tidyr)
library(lubridate)

Set Working Directory and Load Data

setwd("path/to/your/directory")
sales_data <- read.csv("sales_data.csv")

Convert Date Columns to Date Format

sales_data$date <- as.Date(sales_data$date, format = "%Y-%m-%d")

Handle Missing Values

Step 1: Identify Missing Values

summary(sales_data)

Step 2: Impute or Remove Missing Values

# Impute missing values with mean (for numeric columns)
sales_data <- sales_data %>%
  mutate_if(is.numeric, ~ifelse(is.na(.), mean(., na.rm = TRUE), .))

# Remove rows with missing values (for non-numeric columns)
sales_data <- sales_data %>%
  drop_na()

Handle Outliers

# Function to cap outliers at 1.5 IQR
cap_outliers <- function(x) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = TRUE)
  caps <- quantile(x, probs=c(.05, .95), na.rm = TRUE)
  H <- 1.5 * IQR(x, na.rm = TRUE)
  x[x < (qnt[1] - H)] <- caps[1]
  x[x > (qnt[2] + H)] <- caps[2]
  return(x)
}

# Apply the function to appropriate columns
sales_data <- sales_data %>%
  mutate_if(is.numeric, cap_outliers)

Feature Engineering

Step 1: Extract Temporal Features

sales_data <- sales_data %>%
  mutate(year = year(date),
         month = month(date),
         day = day(date),
         weekday = wday(date, label = TRUE))

Step 2: Create Lagged Features (e.g., lagged sales for the past week)

sales_data <- sales_data %>%
  arrange(date) %>%
  group_by(store) %>%
  mutate(lagged_sales = lag(sales, 7))

Encode Categorical Variables

# Encoding categorical variables using one-hot encoding
sales_data <- sales_data %>% 
  mutate_if(is.factor, as.character) %>%
  mutate_at(vars(starts_with("category_")), list(~ as.integer(factor(.))))

Divide Data into Training and Testing Sets

set.seed(123)
train_indices <- sample(1:nrow(sales_data), 0.8 * nrow(sales_data))
train_data <- sales_data[train_indices, ]
test_data <- sales_data[-train_indices, ]

Resulting Data Overview

summary(train_data)
summary(test_data)

That’s it! You’ve prepared your data for further processing to predict future sales using Random Forest algorithms in R. The steps include data loading, conversion, handling missing values, outlier treatment, feature engineering, encoding, and splitting the datasets. Apply these preprocesses meticulously to ensure robust model performance.

Exploratory Data Analysis (EDA)

1. Load Required Libraries

library(ggplot2)
library(dplyr)

2. Load the Dataset

Assuming you have already preprocessed and cleaned your dataset.

sales_data <- read.csv("cleaned_sales_data.csv")

3. Summary Statistics

summary(sales_data)

4. Check for Missing Values

sum(is.na(sales_data))

5. Distribution of Sales

ggplot(sales_data, aes(x = Sales)) +
  geom_histogram(binwidth = 50, fill = "blue", color = "black") +
  labs(title = "Distribution of Sales", x = "Sales", y = "Frequency")

6. Sales Over Time

sales_data$Date <- as.Date(sales_data$Date)
ggplot(sales_data, aes(x = Date, y = Sales)) +
  geom_line(color = "blue") +
  labs(title = "Sales Over Time", x = "Date", y = "Sales")

7. Sales by Product Category

ggplot(sales_data, aes(x = ProductCategory, y = Sales)) +
  geom_boxplot(fill = "orange", color = "black") +
  labs(title = "Sales by Product Category", x = "Product Category", y = "Sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

8. Correlation Matrix

numeric_columns <- sales_data %>% select(where(is.numeric))
cor_matrix <- cor(numeric_columns)

library(corrplot)
corrplot(cor_matrix, method = "circle")

9. Sales by Store

ggplot(sales_data, aes(x = StoreID, y = Sales)) +
  geom_boxplot(fill = "green", color = "black") +
  labs(title = "Sales by Store", x = "Store ID", y = "Sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

10. Feature Relationships

pairs(~ Sales + Feature1 + Feature2 + Feature3, data = sales_data,
      main = "Scatterplot Matrix")

11. Conclusion of EDA

cat("EDA completed. Summary statistics, visualizations, and correlation matrix have been generated.")

Make sure to replace Feature1, Feature2, and Feature3 with actual column names in your dataset.

By following these steps, you will be able to comprehensively understand the distribution, trends, and relationships in your sales data, which will help in building an effective predictive model.

Feature Engineering for Sales Forecasting

In this section, we will create new features from the existing data that can improve the predictive power of our Random Forest model.

Step 1: Load Required Libraries

library(dplyr)
library(lubridate)

Step 2: Generate Time-Based Features

Extract Date Components

We’ll extract year, month, day, and day of the week from the sales date.

sales_data <- sales_data %>%
  mutate(Year = year(SalesDate),
         Month = month(SalesDate),
         Day = day(SalesDate),
         DayOfWeek = wday(SalesDate, label = TRUE))

Generate Holiday Features

Assuming we have a dataset holidays that includes holiday dates, we’ll create a feature to indicate if a day is a holiday.

sales_data <- sales_data %>%
  mutate(IsHoliday = ifelse(SalesDate %in% holidays$Date, 1, 0))

Step 3: Create Lag Features

Lag features help capture the sales trend from previous days.

sales_data <- sales_data %>%
  arrange(Store, SalesDate) %>%
  group_by(Store) %>%
  mutate(Lag1 = lag(Sales, 1),
         Lag7 = lag(Sales, 7),
         Lag14 = lag(Sales, 14))

Step 4: Rolling Average Features

Calculate rolling averages to smooth out daily fluctuations.

sales_data <- sales_data %>%
  arrange(Store, SalesDate) %>%
  group_by(Store) %>%
  mutate(RollingMean7 = rollmean(Sales, 7, fill = NA, align = "right"),
         RollingMean28 = rollmean(Sales, 28, fill = NA, align = "right"))

Step 5: Interaction Features

Create interaction terms between features that show interaction effects.

sales_data <- sales_data %>%
  mutate(PromoAndHoliday = Promo * IsHoliday)

Step 6: Store-Specific Features

If you have other store-specific features like store size, location, etc., you can merge them with the sales data.

sales_data <- sales_data %>%
  left_join(store_info, by = "StoreID")

Step 7: Handle Missing Values

Ensure those lag and rolling mean computations do not result in NA values in your dataset.

sales_data <- sales_data %>%
  mutate(across(c(Lag1, Lag7, Lag14, RollingMean7, RollingMean28), ~replace_na(., 0)))

Step 8: Final Data Preparation

Select the features we want to use for modeling.

features <- sales_data %>%
  select(Store, SalesDate, Year, Month, Day, DayOfWeek, IsHoliday, Lag1, Lag7, Lag14, RollingMean7, RollingMean28, PromoAndHoliday, StoreSize, StoreLocation)

Now the dataset features is ready for use in training your Random Forest model. This concludes the feature engineering section of your sales forecasting project.

Introduction to Random Forest Algorithm

Overview

The Random Forest algorithm is a widely-used machine learning method for classification and regression tasks. It consists of constructing multiple decision trees during training and outputting the mean prediction (regression) or mode class (classification) of the individual trees.

Implementation in R for Sales Forecasting

Loading Libraries

First, ensure you have the necessary libraries loaded in your R environment.

library(randomForest)
library(dplyr)

Data Preparation

Load your preprocessed dataset. Assume you have a dataset sales_data and it’s already cleaned and preprocessed as per your previous steps.

# Assuming sales_data is already loaded and preprocessed
# sales_data <- read.csv("path_to_preprocessed_data.csv")

Splitting Data into Training and Testing Sets

For this example, we’ll split the data into training and testing sets.

set.seed(42)  # For reproducibility
sample_indices <- sample(1:nrow(sales_data), size = 0.7 * nrow(sales_data))

train_data <- sales_data[sample_indices, ]
test_data  <- sales_data[-sample_indices, ]

Building the Random Forest Model

We’ll train the Random Forest model on the training data. We assume Sales is the target variable.

# Training the Random Forest model
rf_model <- randomForest(Sales ~ ., data = train_data, ntree = 500, mtry = floor(sqrt(ncol(train_data) - 1)), importance = TRUE)

Predictions

We’ll then use the trained model to predict on the test set.

# Making predictions on the test set
predicted_sales <- predict(rf_model, test_data)

Model Evaluation

Evaluating the performance using metrics such as Mean Absolute Error (MAE).

mae <- mean(abs(predicted_sales - test_data$Sales))
cat("Mean Absolute Error (MAE):", mae, "\n")

Feature Importance

Understanding which features are most important to the model.

# Get importance
importance_values <- importance(rf_model)
# Plot importance
varImpPlot(rf_model)

# Display importance values
print(importance_values)

Conclusion

This implementation provides an overview of using Random Forest for sales forecasting in R. The model can now help inform supply chain and inventory management decisions based on future sales predictions.

Building the Random Forest Model in R

In this section, we will implement the Random Forest model to predict future sales for a retail chain.

Load Required Libraries

library(randomForest)
library(caret)  # For splitting data and evaluating the model

Load and Prepare the Data

Assume we have a dataframe named sales_data with the preprocessed and cleaned data.

# Load sales data
sales_data <- read.csv("path_to_your_sales_data.csv")

# Split the data into training and testing sets
set.seed(123)  # For reproducibility
index <- createDataPartition(sales_data$sales, p = 0.8, list = FALSE)
train_data <- sales_data[index, ]
test_data <- sales_data[-index, ]

Building and Training the Model

# Define the formula for the random forest model
formula <- sales ~ .

# Train the Random Forest model
rf_model <- randomForest(formula, data = train_data, ntree=500, mtry=3, importance=TRUE)

# Print the model summary
print(rf_model)

Model Evaluation

Evaluate the model performance on the test data.

# Predict on test data
predictions <- predict(rf_model, test_data)

# Calculate performance metrics
mse <- mean((predictions - test_data$sales)^2)
rmse <- sqrt(mse)
cat("Mean Squared Error: ", mse, "\n")
cat("Root Mean Squared Error: ", rmse, "\n")

# For more detailed evaluation
postResample(pred = predictions, obs = test_data$sales)

Feature Importance

Evaluate the importance of features.

# Get feature importance
importance <- importance(rf_model)
varImpPlot(rf_model)

# Print feature importance
print(importance)

Save the Model

Save the trained model for future use.

# Save the model to a file
save(rf_model, file = "random_forest_model.RData")

Conclusion

With this implementation, the Random Forest model has been built and trained effectively on sales data. You can now use this model to make predictions and support inventory management decisions.

This practical implementation ensures you have a comprehensive solution that you can apply directly to predict future sales for a retail chain using Random Forest in R. Save the random_forest_model.RData file, and you can load it later to make predictions on new data.

Model Validation and Performance Metrics

Model Validation

Model validation is crucial to ensure that the Random Forest model performs well on unseen data. The common practice is to split the dataset into training and testing sets. In this example, we’ll use the caret package for splitting the data.

# Load necessary libraries
library(caret)
library(randomForest)

# Assume 'sales_data' is your dataset and 'sales' is the target variable
set.seed(123)
trainIndex <- createDataPartition(sales_data$sales, p = .8, 
                                  list = FALSE, 
                                  times = 1)
trainData <- sales_data[trainIndex,]
testData  <- sales_data[-trainIndex,]

# Train the Random Forest model
rf_model <- randomForest(sales ~ ., data = trainData, ntree = 100)

# Make predictions on the test set
predictions <- predict(rf_model, newdata = testData)

Performance Metrics

To evaluate the Random Forest model, we will use the following performance metrics:

Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)

These metrics give a sense of how well the model’s predictions match the actual sales data.

# Function to calculate performance metrics
performance_metrics <- function(actual, predicted) {
  mae <- mean(abs(actual - predicted))
  mse <- mean((actual - predicted)^2)
  rmse <- sqrt(mse)
  
  return(list(MAE = mae, MSE = mse, RMSE = rmse))
}

# Calculate the performance metrics
actual_sales <- testData$sales
performance <- performance_metrics(actual_sales, predictions)

# Print the performance metrics
print(performance)

# A more comprehensive evaluation using caret's built-in functions
MAE <- MAE(predictions, actual_sales)
MSE <- postResample(pred = predictions, obs = actual_sales)[2]  # RMSE includes MSE internally

cat("Mean Absolute Error (MAE): ", MAE, "\n")
cat("Mean Squared Error (MSE): ", MSE, "\n")
cat("Root Mean Squared Error (RMSE): ", sqrt(MSE), "\n")

K-Fold Cross-Validation

To get a more robust estimate of model performance, K-fold cross-validation can be applied. Here, we’ll perform 10-fold cross-validation.

# Perform 10-fold cross-validation
set.seed(123)
control <- trainControl(method="cv", number=10)
rf_cv_model <- train(sales ~ ., data=sales_data, method="rf", trControl=control)

# Print cross-validation results
print(rf_cv_model)

Conclusion

By properly validating the model and leveraging performance metrics, we ensure that our Random Forest model is generalizable and accurate in predicting future sales. The steps provided above can directly be used in real-life applications to evaluate the performance of the Random Forest model in R.

Hyperparameter Tuning and Optimization for Random Forest in R

In this section, we will focus on optimizing the hyperparameters of the Random Forest model to improve its performance. We will utilize the caret package which provides a streamlined method to perform hyperparameter tuning.

Step 1: Load Required Libraries and Data

# Load necessary libraries
library(caret)
library(randomForest)

# Assuming the data has already been preprocessed and split into training and test sets
# train_data and test_data should be your preprocessed datasets

train_data <- read.csv("path_to_your_train_data.csv")
test_data <- read.csv("path_to_your_test_data.csv")

Step 2: Define the Model Training Control and Grid

# Define training control
train_control <- trainControl(method = "cv", number = 5, search = "grid")

# Define the grid for hyperparameter tuning
hyper_grid <- expand.grid(
  mtry = c(2, 4, 6, 8),  # Number of variables available for splitting at each tree node
  splitrule = "variance",
  min.node.size = c(1, 5, 10)  # Minimum size of terminal nodes
)

Step 3: Train the Model with Hyperparameter Tuning

# Train the model
rf_model <- train(
  x = train_data[, -ncol(train_data)],   # Exclude response column for features
  y = train_data[, ncol(train_data)],    # Response column
  method = "ranger",  # Using 'ranger' as it supports tuning
  trControl = train_control,
  tuneGrid = hyper_grid
)

# Print best model parameters
print(rf_model$bestTune)

Step 4: Evaluate the Best Model

# Make predictions on the test dataset
predictions <- predict(rf_model, newdata = test_data[, -ncol(test_data)])

# Calculate performance metrics (e.g., RMSE)
actuals <- test_data[, ncol(test_data)]
rmse <- sqrt(mean((predictions - actuals)^2))

# Print RMSE
print(paste("Test RMSE:", rmse))

Step 5: Save the Optimized Model

# Save the trained model
saveRDS(rf_model, file = "optimized_rf_model.rds")

# Load the model for future use
# loaded_model <- readRDS("optimized_rf_model.rds")

This code will allow you to optimize your Random Forest model and evaluate its performance efficiently, providing a well-tuned model for predicting future sales.

Deployment and Reporting Results

Deployment

Save the Model:
Save the trained Random Forest model so it can be reused without having to retrain it.
```
# Save the model to an RDS file
saveRDS(rf_model, file = "random_forest_model.rds")
```
Load the Model:
When redeploying or using the model, load it from the saved RDS file.
```
# Load the model
rf_model <- readRDS("random_forest_model.rds")
```

Predict Sales Using New Data:
For predicted sales, use the model on new data.

# Assuming `new_data` is a DataFrame with same structure as the data used for training
new_data <- read.csv("new_sales_data.csv")

# Predict using the loaded model
predicted_sales <- predict(rf_model, new_data)

# Append the predictions to the new_data DataFrame
new_data$Predicted_Sales <- predicted_sales

# Save predictions to a new CSV
write.csv(new_data, "predicted_sales.csv", row.names = FALSE)

Reporting Results

Generate Summary Report:
Create a summary report of the prediction.

# Load necessary libraries
library(dplyr)

# Create summary statistics
summary_stats <- new_data %>%
    summarise(
        total_actual_sales = sum(Actual_Sales, na.rm = TRUE),
        total_predicted_sales = sum(Predicted_Sales, na.rm = TRUE),
        error = total_actual_sales - total_predicted_sales,
        mean_actual_sales = mean(Actual_Sales, na.rm = TRUE),
        mean_predicted_sales = mean(Predicted_Sales, na.rm = TRUE),
        rmse = sqrt(mean((Actual_Sales - Predicted_Sales)^2, na.rm = TRUE))
    )

# Print summary statistics
print(summary_stats)

Visualize Results:
Create visualizations to compare actual vs. predicted sales.

# Load necessary libraries
library(ggplot2)

# Plot actual vs predicted sales
ggplot(new_data, aes(x = Actual_Sales, y = Predicted_Sales)) +
    geom_point(color = 'blue') +
    geom_abline(slope = 1, intercept = 0, color = 'red', linetype = 'dashed') +
    labs(title = "Actual vs Predicted Sales",
         x = "Actual Sales",
         y = "Predicted Sales") +
    theme_minimal()

# Save the plot
ggsave("actual_vs_predicted_sales.png")

Generate and Send Report:
Compile the results and send the report.

# Create a report document using RMarkdown
rmarkdown::render("sales_forecasting_report.Rmd", output_format = "pdf_document")

# Assuming we have the emailR package or any other preferred method for sending emails
library(emailR)

# Sending the report via email
send_mail(to = "stakeholder@example.com",
          subject = "Sales Forecasting Report",
          body = "Please find the attached sales forecast report.",
          attachments = "sales_forecasting_report.pdf")

Example of `sales_forecasting_report.Rmd` (RMarkdown file):

---
title: "Sales Forecasting Report"
author: "Your Name"
date: "`r Sys.Date()`"
output: pdf_document
---

# Sales Forecasting Report

## Summary Statistics

```{r echo=FALSE}
print(summary_stats)

Description

This report includes the actual vs. predicted sales comparison along with key performance metrics. The model used for this prediction is the Random Forest algorithm implemented in R.

This implementation ensures the model can be deployed for predictions on new data and comprehensive results are generated and communicated effectively.