Customer Data Analysis with Python using Google Colab

Table of Contents

Customer Data Analysis Project

1. Introduction to the Project

Welcome to the Customer Data Analysis Project. The objective of this project is to analyze customer data and derive actionable insights using Python. In this introductory section, we will set up our environment and prepare to explore the dataset.

Project Overview

This project is divided into several units, each focusing on a different aspect of data analysis. The project will be implemented in Google Colab, leveraging Python's powerful data analysis libraries.

Setting Up the Environment

To ensure smooth execution, follow the steps below to set up your environment in Google Colab.

Step 1: Import Required Libraries

To start, we need to import essential Python libraries that will assist us throughout our analysis. Below is a practical implementation of importing these libraries.

# Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Setting up visualization styles
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

# Display inline plots in Google Colab
%matplotlib inline

Step 2: Loading the Dataset

Next, load the customer dataset into a Pandas DataFrame. This dataset will be the primary focus of our analysis.

# Assuming the dataset is stored in a CSV file in Google Drive or directly uploaded to Colab

# Upload the file manually (if not done through Google Drive)
from google.colab import files
uploaded = files.upload()

# Read the uploaded dataset
df = pd.read_csv(next(iter(uploaded.keys())))

# Display the first few rows of the dataset to verify the load
df.head()

Step 3: Initial Data Exploration

Perform a preliminary exploration of the dataset to understand its structure and content.

# Checking the data types and non-null counts
df.info()

# Describing the statistical properties of the dataset
df.describe()

# Checking for null values in the dataset
df.isnull().sum()

With these steps, you have successfully set up your environment and performed a preliminary exploration of the customer dataset. Proceeding with this foundational understanding will enable you to derive meaningful insights in the subsequent units of this project.

Conclusion

This concludes the introduction to the Customer Data Analysis Project. You now have a functional environment in Google Colab, equipped with the necessary libraries and an initial understanding of the dataset. In the next unit, we will dive deeper into data cleaning and preprocessing.

Stay tuned for the next section!

Setting Up Google Colab Environment

Uploading Data to Google Colab

Before diving into the analysis, you need to upload the customer data for further processing. To make sure your environment is properly set up for loading customer data, follow these steps:

Mount Google Drive

First, mount Google Drive to access the necessary datasets easily.

from google.colab import drive
drive.mount('/content/drive')

Verify Data Access

Ensure that the data file, e.g., customer_data.csv, is in your Google Drive. You can list directory contents to confirm:

!ls /content/drive/MyDrive/

Import Required Libraries

Next, import the essential libraries required for your analysis:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Load Data

Now load the dataset into a DataFrame for examination and preprocessing:

data_path = '/content/drive/MyDrive/customer_data.csv'
df = pd.read_csv(data_path)

Data Exploration

Perform initial data exploration to understand the structure and content of the dataset:

# Show first few rows
df.head()

# Summary statistics
df.describe()

# Check for missing values
df.isnull().sum()

Data Preprocessing

Clean and preprocess the data to prepare it for analysis:

Handling Missing Values:

# Drop rows with missing values (example)
df.dropna(inplace=True)

# Or fill them with mean/median/mode (example for filling with mean)
df.fillna(df.mean(), inplace=True)

Convert Categorical Features:

# Convert categorical columns to numerical (example using pd.get_dummies)
df = pd.get_dummies(df, drop_first=True)

Data Visualization

Create visualizations to get further insights:

# Set visual aesthetics
sns.set(style="whitegrid")

# Plot a histogram for a numerical column
plt.figure(figsize=(10, 6))
sns.histplot(df['age'], bins=30, kde=True)
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

Correlation Analysis

Analyze correlations between numerical features:

# Computing correlation matrix
corr_matrix = df.corr()

# Plotting the heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

Save Processed Data

Save the cleaned and preprocessed data for future use:

processed_data_path = '/content/drive/MyDrive/processed_customer_data.csv'
df.to_csv(processed_data_path, index=False)

Next Steps

Now that your environment is set up and your data is loaded and preprocessed, you can proceed to implement various analytical models and derive actionable insights from the customer data.

Remember to always document your analysis and findings thoroughly to provide a clear narrative on how you derived your insights. Happy analyzing!

Part 3: Uploading and Previewing the Dataset

In this section, we will walk through the process of uploading a customer data file to Google Colab and previewing the dataset to understand its structure and contents.

Step 1: Uploading the Dataset to Google Colab

Google Colab provides a convenient way to upload files for analysis. Use the following code snippet to upload your customer data file:

from google.colab import files

# Prompt the user to upload a file
uploaded = files.upload()

Step 2: Loading the Dataset into a DataFrame

Once the file is uploaded, we can load it into a Pandas DataFrame for easy manipulation and analysis:

import pandas as pd

# Assuming the uploaded file is a CSV
filename = list(uploaded.keys())[0]
df = pd.read_csv(filename)

Step 3: Previewing the Dataset

To understand the structure and contents of the dataset, you should preview it using the following methods:

First Few Rows: Display the first 5 rows of the dataset:
```
df.head()
```
Dataset Information: Get a summary of the dataset, including the number of non-null entries and data types:
```
df.info()
```
Descriptive Statistics: Generate descriptive statistics that summarize the central tendency, dispersion, and shape of the dataset’s distribution:
```
df.describe()
```

Full Implementation

Here is the full implementation of uploading and previewing the dataset in Google Colab:

# Part 3: Uploading and Previewing the Dataset

from google.colab import files
import pandas as pd

# Prompt the user to upload a file
uploaded = files.upload()

# Load the uploaded file into a DataFrame
filename = list(uploaded.keys())[0]
df = pd.read_csv(filename)

# Preview the first few rows of the dataset
print("First five rows of the dataset:")
print(df.head())

# Summarize the dataset
print("\nDataset Information:")
print(df.info())

# Generate descriptive statistics
print("\nDescriptive Statistics:")
print(df.describe())

This implementation allows you to upload a dataset, load it into a dataframe, and perform some basic preview steps to understand its structure and contents, critical for any further analysis.

Data Cleaning and Preprocessing

Libraries and Initial Setup

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Assuming df is the DataFrame that has been uploaded and previewed
# df = pd.read_csv('your_dataset.csv')

Handling Missing Values

# Check for missing values
missing_values = df.isnull().sum()

# Impute missing values: For numerical columns, fill with mean; for categorical columns, fill with mode
for column in df.columns:
    if df[column].dtype == np.number:
        df[column].fillna(df[column].mean(), inplace=True)
    else:
        df[column].fillna(df[column].mode()[0], inplace=True)

Removing Duplicates

# Removing duplicates
df = df.drop_duplicates()

Encoding Categorical Variables

# Encoding categorical variables using one-hot encoding
df = pd.get_dummies(df, drop_first=True)

Scaling Numerical Features

# Identify numerical features
numerical_features = df.select_dtypes(include=[np.number]).columns

# Scaling numerical features using Standard Scaler
scaler = StandardScaler()
df[numerical_features] = scaler.fit_transform(df[numerical_features])

Date-Time Processing (if applicable)

# Example: Converting a 'date' column to datetime and extracting features
if 'date' in df.columns:
    df['date'] = pd.to_datetime(df['date'])
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month
    df['day'] = df['date'].dt.day
    df['day_of_week'] = df['date'].dt.dayofweek
    # Dropping the original date column
    df = df.drop('date', axis=1)

Final DataFrame Overview

# Final check on the cleaned and preprocessed DataFrame
print(df.info())
print(df.describe())

Saving the Cleaned DataFrame

# Save cleaned DataFrame to a new CSV file
df.to_csv('cleaned_customer_data.csv', index=False)

The above steps provide a comprehensive procedure to clean and preprocess your dataset. Ensure that the columns and types fit your specific dataset when applying the solution.

Exploratory Data Analysis (EDA)

In this section, we will conduct an Exploratory Data Analysis (EDA) on the customer dataset to understand its underlying structure and extract useful insights. We'll use Python and several libraries for data analysis and visualization.

Import Libraries

# Import necessary libraries for EDA
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Ensure plots show up in the notebook
%matplotlib inline

Load the Dataset

Assuming the dataset has already been uploaded to Google Colab and loaded into a DataFrame:

# Load dataset into a DataFrame
df = pd.read_csv('path_to_your_dataset.csv')

Display Basic Information

# Display the first few rows of the dataset
print(df.head())

# Display a summary of the dataset
print(df.info())
print(df.describe())

Univariate Analysis

Let's start by examining the distribution of individual features.

Numerical Features

# Histograms for numerical features
numerical_features = df.select_dtypes(include=[np.number]).columns.tolist()
df[numerical_features].hist(figsize=(15, 15), bins=15)
plt.suptitle('Histograms of Numerical Features')
plt.show()

# Kernel Density Estimate (KDE) plots for numerical features
for feature in numerical_features:
    plt.figure(figsize=(10, 6))
    sns.kdeplot(df, shade=True)
    plt.title(f'KDE for {feature}')
    plt.show()

Categorical Features

# Bar charts for categorical features
categorical_features = df.select_dtypes(include=[object]).columns.tolist()

for feature in categorical_features:
    plt.figure(figsize=(10, 6))
    sns.countplot(y=df
, order=df.value_counts().index)
    plt.title(f'Distribution of {feature}')
    plt.show()

Bivariate Analysis

Next, we explore the relationships between pairs of features.

Numerical vs Numerical

# Pairplot for numerical features
sns.pairplot(df[numerical_features])
plt.suptitle('Pairplot of Numerical Features')
plt.show()

Numerical vs Categorical

# Boxplots for numerical vs categorical features
for feature in numerical_features:
    for cat_feature in categorical_features:
        plt.figure(figsize=(10, 6))
        sns.boxplot(x=df[cat_feature], y=df)
        plt.title(f'{feature} vs {cat_feature}')
        plt.show()

Categorical vs Categorical

# Heatmap of count plot for categorical vs categorical features
for i in range(len(categorical_features)):
    for j in range(i + 1, len(categorical_features)):
        ct = pd.crosstab(df[categorical_features[i]], df[categorical_features[j]])
        sns.heatmap(ct, annot=True, fmt='d')
        plt.title(f'Heatmap of {categorical_features[i]} vs {categorical_features[j]}')
        plt.show()

Correlation Analysis

To understand the linear relationships between numerical features.

# Correlation heatmap
plt.figure(figsize=(12, 8))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix Heatmap')
plt.show()

# Display highly correlated pairs
corr_pairs = corr_matrix.unstack()
sorted_pairs = corr_pairs.sort_values(kind="quicksort")
high_corr_pairs = sorted_pairs[(sorted_pairs) > 0.8]
print("Highly Correlated Pairs:")
print(high_corr_pairs)

Summary of Findings

After conducting the EDA, summarize the key findings in a structured format:

### Summary of Findings

1. **Data Distribution**
    - Numerical features such as `feature1`, `feature2`, etc., show normal distribution while `feature3` shows skewness.
    - Categorical feature `category1` is heavily imbalanced.

2. **Relationships**
    - Strong positive correlation observed between `feature1` and `feature2`.
    - `feature3` differs significantly across levels of `category1`.

3. **Potential Outliers**
    - Outliers observed in `feature3` which may need further investigation.

4. **Conclusions**
    - These insights will inform the next steps in data modeling and feature engineering.

This completes the Exploratory Data Analysis (EDA) section of the project. The next steps will involve feature engineering, model training, and evaluation based on these initial insights.

Remember to replace placeholder feature names (feature1, feature2, etc.) with actual names from your dataset.

Visualizing the Data

In this section, we will create visualizations to better understand our customer data and derive actionable insights.

Import Required Libraries

Before we start creating visualizations, ensure that you have the necessary libraries imported:

import matplotlib.pyplot as plt
import seaborn as sns

Load the Cleaned Dataset

Assuming you have a cleaned dataset from the previous step:

# Assuming the dataset is already cleaned and available
# This example assumes the DataFrame is named 'customer_data'
customer_data = pd.read_csv('cleaned_customer_data.csv')

Distribution of Customer Ages

Let's create a histogram to visualize the age distribution of our customers.

plt.figure(figsize=(10,6))
sns.histplot(customer_data['Age'], kde=True, bins=30, color='blue')
plt.title('Distribution of Customer Ages')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

Customer Segmentation by Category

If our data includes customer segments or categories, we can visualize it using a bar plot:

plt.figure(figsize=(10,6))
customer_segment_counts = customer_data['Segment'].value_counts()
sns.barplot(x=customer_segment_counts.index, y=customer_segment_counts.values, palette='viridis')
plt.title('Customer Segmentation')
plt.xlabel('Segment')
plt.ylabel('Number of Customers')
plt.show()

Monthly Revenue Analysis

We can visualize the monthly revenue to understand the trend over time. Assuming we have Date and Revenue in our dataset:

# Convert Date column to datetime if not already done
customer_data['Date'] = pd.to_datetime(customer_data['Date'])

# Create a new column for months
customer_data['Month'] = customer_data['Date'].dt.to_period('M')

# Group by Month and sum the Revenue
monthly_revenue = customer_data.groupby('Month').agg({'Revenue': 'sum'}).reset_index()

plt.figure(figsize=(12,6))
sns.lineplot(x='Month', y='Revenue', data=monthly_revenue, marker='o')
plt.title('Monthly Revenue Trend')
plt.xlabel('Month')
plt.ylabel('Revenue')
plt.xticks(rotation=45)
plt.show()

Heatmap of Correlations

To understand the relationships between numerical variables in the dataset, we can create a heatmap of the correlation matrix:

# Compute the correlation matrix
corr_matrix = customer_data.corr()

plt.figure(figsize=(12,8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=.5)
plt.title('Correlation Heatmap')
plt.show()

Customer Lifetime Value (CLTV) Distribution

Assuming we have computed a CLTV column in the customer data, let's visualize its distribution:

plt.figure(figsize=(10,6))
sns.histplot(customer_data['CLTV'], kde=True, bins=30, color='green')
plt.title('Customer Lifetime Value Distribution')
plt.xlabel('CLTV')
plt.ylabel('Frequency')
plt.show()

These visualizations should help you gain significant insights into your customer data. Make sure to interpret these visualizations in the context of your business problem and use them to drive actionable steps.

Customer Segmentation Analysis

In this section, we will perform customer segmentation using the K-Means clustering algorithm. The goal is to group customers based on their behaviors and characteristics to derive actionable insights.

Import Necessary Libraries

import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

Data Preparation

Ensure your data is cleaned and preprocessed, with relevant features extracted during the previous steps.

# Assuming the cleaned and preprocessed dataframe is named `df`
# Select relevant features for segmentation
features = df[['feature1', 'feature2', 'feature3', 'feature4']]

Scaling the Data

Standardize the features to ensure equal weighting.

scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

Finding the Optimal Number of Clusters

We will use the Elbow Method to determine the optimal number of clusters.

wcss = []  # within-cluster sum of squares
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=42)
    kmeans.fit(scaled_features)
    wcss.append(kmeans.inertia_)

plt.figure(figsize=(10, 8))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

Select the number of clusters at the 'elbow' point of the plot, where the WCSS starts to diminish.

Applying K-Means Clustering

Based on the Elbow Method, let's assume the optimal number of clusters is n_clusters (replace with the number you choose).

n_clusters = 3  # replace this with the optimal number of clusters determined

kmeans = KMeans(n_clusters=n_clusters, init='k-means++', max_iter=300, n_init=10, random_state=42)
clusters = kmeans.fit_predict(scaled_features)

# Add the cluster labels to the original dataframe
df['Cluster'] = clusters

Analyzing the Clusters

Analyze the characteristics of each cluster by aggregating data.

cluster_summary = df.groupby('Cluster').mean()
print(cluster_summary)

Visualizing the Clusters

Visualize the clusters using two of the most significant features.

plt.figure(figsize=(10, 8))
sns.scatterplot(x='feature1', y='feature2', hue='Cluster', data=df, palette='viridis')
plt.title('Customer Segmentation')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

Ensure that the features chosen for visualization are the most significant ones identified during the exploratory data analysis.

Conclusion

In this section, K-Means clustering has been utilized to segment customers into different groups. The model's findings should now enable you to derive actionable insights for each identified customer segment.

Analyzing Purchase History

Import Libraries

import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns

Load the Dataset

Ensure the dataset is already cleaned and preprocessed as per previous units.

df = pd.read_csv('purchase_history_cleaned.csv')

Feature Engineering

Calculate Purchase Frequency

df['purchase_date'] = pd.to_datetime(df['purchase_date'])

# Calculate days since last purchase
df['days_since_last_purchase'] = (df['purchase_date'].max() - df['purchase_date']).dt.days

Calculate Monetary Value

# Sum of all purchases per customer
monetary = df.groupby('customer_id')['purchase_amount'].sum().reset_index()
monetary.columns = ['customer_id', 'total_amount_spent']

Calculate Recency

recency = df.groupby('customer_id')['days_since_last_purchase'].min().reset_index()
recency.columns = ['customer_id', 'recency']

Calculate Frequency

frequency = df.groupby('customer_id')['purchase_id'].count().reset_index()
frequency.columns = ['customer_id', 'frequency']

Combine all Features

rfm = pd.merge(recency, frequency, on='customer_id')
rfm = pd.merge(rfm, monetary, on='customer_id')

Add RFM Segmentation

Scoring

# Define the bins and use qcut to assign R, F, and M scores
rfm['R_score'] = pd.qcut(rfm['recency'], 4, ['4','3', '2', '1'])
rfm['F_score'] = pd.qcut(rfm['frequency'].rank(method='first'), 4, ['1','2', '3', '4'])
rfm['M_score'] = pd.qcut(rfm['total_amount_spent'].rank(method='first'), 4, ['1','2', '3', '4'])

# Combine RFM score
rfm['RFM_Segment'] = rfm['R_score'].astype(str) + rfm['F_score'].astype(str) + rfm['M_score'].astype(str)
rfm['RFM_Score'] = rfm[['R_score', 'F_score', 'M_score']].sum(axis=1)

Analyze RFM Segments

Summary

rfm_summary = rfm.groupby('RFM_Segment').size().reset_index()
rfm_summary.columns = ['RFM_Segment', 'Count']
rfm_summary = rfm_summary.sort_values(by='Count', ascending=False)
print(rfm_summary)

Visualize RFM Segments

plt.figure(figsize=(12,8))
sns.barplot(x='RFM_Segment', y='Count', data=rfm_summary)
plt.title('RFM Segments Distribution')
plt.xlabel('RFM Segment')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

Identify Top Customers

Top 10% Customers Based on RFM Score

top_customers = rfm.nlargest(int(0.1 * len(rfm)), 'RFM_Score')
print(top_customers.head())

Export Top Customers to CSV

top_customers.to_csv('top_customers.csv', index=False)

Summary

This implementation calculates RFM (Recency, Frequency, Monetary) scores for each customer, analyzes the RFM segments, and identifies the top 10% of customers based on their RFM scores. The results are then exported to a CSV file for further use or targeted marketing strategies.

Customer Feedback Analysis

Load Required Libraries

import pandas as pd
import numpy as np
from textblob import TextBlob
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns

Load and Preview the Dataset

# Assuming the data is already loaded in a DataFrame named `feedback_df`
feedback_df.head()

Sentiment Analysis

Define Functions for Sentiment Analysis

def get_sentiment(text):
    analysis = TextBlob(text)
    if analysis.sentiment.polarity > 0:
        return 'Positive'
    elif analysis.sentiment.polarity == 0:
        return 'Neutral'
    else:
        return 'Negative'

feedback_df['Sentiment'] = feedback_df['Feedback'].apply(get_sentiment)
feedback_df['Polarity'] = feedback_df['Feedback'].apply(lambda x: TextBlob(x).sentiment.polarity)

Visualize Sentiment Distribution

plt.figure(figsize=(10, 6))
sns.countplot(data=feedback_df, x='Sentiment', palette='viridis')
plt.title('Sentiment Distribution')
plt.xlabel('Sentiment')
plt.ylabel('Frequency')
plt.show()

Word Cloud for Feedback

Generate Word Cloud for Positive Feedback

positive_feedback = " ".join(feedback for feedback in feedback_df[feedback_df['Sentiment'] == 'Positive']['Feedback'])

wordcloud_positive = WordCloud(width=800, height=400, background_color='white').generate(positive_feedback)

plt.figure(figsize=(10, 6))
plt.imshow(wordcloud_positive, interpolation='bilinear')
plt.title('Word Cloud for Positive Feedback')
plt.axis('off')
plt.show()

Generate Word Cloud for Negative Feedback

negative_feedback = " ".join(feedback for feedback in feedback_df[feedback_df['Sentiment'] == 'Negative']['Feedback'])

wordcloud_negative = WordCloud(width=800, height=400, background_color='white').generate(negative_feedback)

plt.figure(figsize=(10, 6))
plt.imshow(wordcloud_negative, interpolation='bilinear')
plt.title('Word Cloud for Negative Feedback')
plt.axis('off')
plt.show()

Most Common Words Analysis

Function to Extract Most Common Words

from collections import Counter
import re

def get_most_common_words(text, num_words=20):
    words = re.findall(r'\w+', text.lower())
    common_words = Counter(words).most_common(num_words)
    return common_words

# Combine all feedback into one string
all_feedback = " ".join(feedback for feedback in feedback_df['Feedback'])

common_words = get_most_common_words(all_feedback)

Visualize Most Common Words

words_df = pd.DataFrame(common_words, columns=['Word', 'Frequency'])

plt.figure(figsize=(12, 8))
sns.barplot(data=words_df, x='Frequency', y='Word', palette='viridis')
plt.title('Most Common Words in Customer Feedback')
plt.xlabel('Frequency')
plt.ylabel('Words')
plt.show()

Insights Extraction

Summary of Insights

insights = {
    "Total Feedback Count": len(feedback_df),
    "Positive Feedback Count": len(feedback_df[feedback_df['Sentiment'] == 'Positive']),
    "Negative Feedback Count": len(feedback_df[feedback_df['Sentiment'] == 'Negative']),
    "Neutral Feedback Count": len(feedback_df[feedback_df['Sentiment'] == 'Neutral']),
    "Most Common Positive Words": get_most_common_words(positive_feedback),
    "Most Common Negative Words": get_most_common_words(negative_feedback)
}

for key, value in insights.items():
    print(f"{key}: {value}")

This code provides a complete practical implementation for Customer Feedback Analysis, focusing on sentiment analysis, visualization of sentiments, and extraction of the most common words from the feedback to derive actionable insights. Each section is self-contained and intended for execution within a Google Colab environment.

Predictive Modeling for Customer Insights

Below is the comprehensive implementation of predictive modeling for customer insights using Python. This section assumes you've already completed data preprocessing and exploratory data analysis.

Step 1: Import Necessary Libraries

Ensure you have all required libraries.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

Step 2: Load Your Preprocessed Dataset

Load the preprocessed dataset, assuming you've named it cleaned_customer_data.csv.

df = pd.read_csv('cleaned_customer_data.csv')

Step 3: Feature Selection

Choose relevant features for modeling and the target variable (e.g., Customer_Lifetime_Value, Churn).

features = df[['feature1', 'feature2', 'feature3', 'feature4']]
target = df['target_variable']

Step 4: Train-Test Split

Split your data into training and test sets for validation purposes.

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

Step 5: Data Scaling

Scale your data if necessary for some algorithms.

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Step 6: Model Training and Evaluation

Train and evaluate multiple models to choose the best performing one.

Logistic Regression

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
y_pred_log_reg = log_reg.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log_reg))
print("Logistic Regression Report:\n", classification_report(y_test, y_pred_log_reg))

Decision Tree

tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
y_pred_tree = tree.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_tree))
print("Decision Tree Report:\n", classification_report(y_test, y_pred_tree))

Random Forest

forest = RandomForestClassifier()
forest.fit(X_train, y_train)
y_pred_forest = forest.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_forest))
print("Random Forest Report:\n", classification_report(y_test, y_pred_forest))

Step 7: Confusion Matrix

Use the confusion matrix to get a better understanding of the model performance.

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

conf_matrix = confusion_matrix(y_test, y_pred_forest)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix - Random Forest')
plt.show()

Step 8: Interpret Results

Discuss which model performed best based on the accuracy and classification reports. Focus on the key metrics such as precision, recall, and F1-score.

Conclusion

The implementation provided will equip you to create and evaluate predictive models for customer insights. Choose the best performing model based on your project's goals and the metrics of importance.

This should seamlessly follow your previous units and enable you to derive actionable insights from your customer data.

Part 11: Deriving Business Strategies from Analytics

This section focuses on taking the insights we’ve gained from data analysis and converting them into actionable business strategies.

Step 1: Synthesize Insights

First, we'll summarize key insights from our data analysis and predictive models. Define any significant findings that could impact business strategy.

# Import necessary libraries
import pandas as pd

# Example: Summarized insights from previous analysis:

data_summary = {
    'High-Value Segments': ['Segment_3', 'Segment_5'], 
    'Low Retention Segments': ['Segment_1', 'Segment_6'],
    'Frequent Product Returns': ['Product_45', 'Product_98'],
    'High Customer Satisfaction': {'Product_23': 4.8, 'Product_17': 4.7},
    'Low Customer Satisfaction': {'Product_10': 2.1, 'Product_33': 2.4}
}

df_summary = pd.DataFrame.from_dict(data_summary, orient='index')
print(df_summary)

Step 2: Develop Business Strategies

Based on synthesized insights, formulate business strategies aimed at addressing specific issues or capitalizing on opportunities.

Example Strategies:

Customer Retention
- Offer loyalty programs or special discounts to high-value but low-retention segments.
Product Improvement
- Investigate and improve products that receive frequent returns or low satisfaction scores.
Promote High Satisfaction Products
- Increase marketing efforts for products with high customer satisfaction to boost sales.
Segmentation-Based Marketing
- Tailor marketing efforts to different customer segments based on their purchase behaviors and preferences.
Feedback-Based Adaptation
- Regularly incorporate customer feedback to adapt and improve product offerings.

Implementation in Code:

# Hypothetical function to generate business strategies based on analyzed data
def generate_business_strategies(summary):
    strategies = []
    
    # Strategy for customer retention
    high_value_segments = summary.get('High-Value Segments', [])
    low_retention_segments = summary.get('Low Retention Segments', [])
    
    for segment in high_value_segments:
        if segment in low_retention_segments:
            strategies.append(f"Implement loyalty programs and special discounts for {segment}")
    
    # Strategy for product improvement
    low_customer_satisfaction = summary.get('Low Customer Satisfaction', {})
    for product, score in low_customer_satisfaction.items():
        strategies.append(f"Investigate and improve quality of {product} (Satisfaction Score: {score})")

    # Strategy for promoting high satisfaction products
    high_customer_satisfaction = summary.get('High Customer Satisfaction', {})
    for product, score in high_customer_satisfaction.items():
        strategies.append(f"Increase marketing efforts for {product} (Satisfaction Score: {score})")
    
    return strategies

# Generate strategies based on summary insights
business_strategies = generate_business_strategies(data_summary)

# Print out the strategies
for strategy in business_strategies:
    print(strategy)

Step 3: Business Strategy Documentation

Document the derived strategies clearly to communicate them to stakeholders or team members. An example markdown format:

Business Strategy Documentation Example

1. Customer Retention

Target Segments: Segment_3, Segment_5
Plan: Implement loyalty programs and offer special discounts to increase retention rates.

2. Product Improvement

Target Products: Product_10, Product_33
Plan: Investigate reasons for low satisfaction and improve product quality.

3. Promote High Satisfaction Products

Target Products: Product_23 (Satisfaction Score: 4.8), Product_17 (Satisfaction Score: 4.7)
Plan: Increase marketing efforts to boost sales of highly rated products.

By following this structured approach, you can effectively derive, implement, and document business strategies based on your data analysis, making them actionable and impactful for your organization.

This marks the end of the practical steps for deriving business strategies from analytics within your project using Python in Google Colab.

Final Project and Future Directions

Final Project

To conclude this project, compile all the work we have done into a cohesive report and presentation. Summarize key findings and actionable insights derived from the customer data analysis. The following Python code demonstrates how to compile the results into a final report and visualization:

import pandas as pd
import matplotlib.pyplot as plt
from fpdf import FPDF

# Load processed data
df = pd.read_csv('processed_customer_data.csv')

# Summarize EDA insights
summary_stats = df.describe()
print("Summary Statistics:\n", summary_stats)

# Visualize Customer Segmentation
fig, ax = plt.subplots()
ax.scatter(df['Segment1'], df['Segment2'], c=df['Segment_Label'])
ax.set_title('Customer Segmentation')
ax.set_xlabel('Segment1')
ax.set_ylabel('Segment2')
plt.savefig('customer_segmentation.png')
plt.show()

# Exporting Final Report as PDF
class PDF(FPDF):
    def header(self):
        self.set_font('Arial', 'B', 12)
        self.cell(0, 10, 'Customer Data Analysis Final Report', 0, 1, 'C')

    def footer(self):
        self.set_y(-15)
        self.set_font('Arial', 'I', 8)
        self.cell(0, 10, 'Page %s' % self.page_no(), 0, 0, 'C')

    def chapter_title(self, title):
        self.set_font('Arial', 'B', 12)
        self.cell(0, 10, title, 0, 1, 'L')

    def chapter_body(self, body):
        self.set_font('Arial', '', 12)
        self.multi_cell(0, 10, body)

pdf = PDF()
pdf.add_page()

# Adding EDA Summary
pdf.chapter_title('Summary Statistics')
pdf.chapter_body(str(summary_stats))

# Adding Customer Segmentation
pdf.chapter_title('Customer Segmentation')
pdf.image('customer_segmentation.png', x=10, y=pdf.get_y() + 10, w=0, h=100)

pdf.output('Final_Report.pdf')

print("Final Report Generated: 'Final_Report.pdf'")

Future Directions

To further enhance the insights and value derived from the customer data, consider the following directions:

Real-Time Data Integration: Implement real-time data processing pipelines, for example, using tools like Apache Kafka and Spark. This allows for immediate insights and action based on the latest data.
Advanced Predictive Analytics: Incorporate advanced machine learning algorithms, such as random forests, gradient boosting machines, or deep learning to improve the predictive accuracy and derive more nuanced insights.
Customer Lifetime Value (CLV): Calculate the Customer Lifetime Value to understand the long-term value of customers and tailor strategies accordingly.
Recommendation Systems: Implement recommendation systems to personalize product suggestions for customers based on their purchase history and segmentation data.
A/B Testing and Experimentation: Conduct A/B testing on different marketing strategies, UI changes, or new products to measure the impact and optimize business decisions.
Advanced Visualization Dashboards: Develop dynamic dashboards using tools like Tableau or Power BI to continuously monitor key metrics and visualize data insights interactively.
Customer Journey Analysis: Map and analyze the entire customer journey to identify critical touchpoints and optimize the customer experience.

By incorporating these advanced techniques and tools, you can significantly enhance the value and insights derived from your customer data analysis.

Mastering Data Analytics with Matplotlib in Python

« Older Entries