Data Loading and Cleaning in Python
In this blog we will work through a detailed example for customer segmentation analysis using Python as our data analysis language.
To create a simple dataset to use for this example you can use either ChatGPT or EDNA Chat within Data Mentor.
Make sure to upload and use the correct file name in your Python code.
If you need any of the code to be explained further, just click the ‘Code Explainer’ button within any code block.
Setup Instructions
Install Libraries: Ensure you have the necessary libraries installed. You can install them using pip if needed.
pip install pandas numpy
Implementation
1. Loading the Data
import pandas as pd
# Load the CSV data into a DataFrame
file_path = 'path_to_your_file.csv'
df = pd.read_csv(file_path)
# Display the first few rows of the DataFrame
print(df.head())
2. Exploring and Cleaning the Data
# Exploring basic information about the DataFrame
print(df.info())
print(df.describe())
# Check for missing values
missing_values = df.isnull().sum()
print("Missing Values:\n", missing_values)
# Handling missing values
# Option 1: Drop rows with missing values
df_cleaned = df.dropna()
# Option 2: Fill missing values with the mean of the column (for numerical data)
df_cleaned = df.fillna(df.mean())
# Handling categorical variables (if any)
# Convert categorical variables to dummy/indicator variables
df_cleaned = pd.get_dummies(df_cleaned, drop_first=True)
# Output cleaned data information
print(df_cleaned.info())
3. Normalize the Data (Optional)
from sklearn.preprocessing import StandardScaler
# Initialize the scaler
scaler = StandardScaler()
# Assume 'cols_to_scale' contains the names of the columns to be scaled
cols_to_scale = ['Column1', 'Column2', 'Column3']
df_cleaned[cols_to_scale] = scaler.fit_transform(df_cleaned[cols_to_scale])
# Display the first few rows of the cleaned and scaled DataFrame
print(df_cleaned.head())
This implementation includes loading data from a CSV file, exploring the data, cleaning it by handling missing values and categorical variables, and optionally normalizing numerical features. This prepares the data for subsequent analysis and machine learning tasks.
Exploratory Data Analysis (EDA) and Clustering Implementation
1. Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
2. Load Clean Data
# Assuming data is loaded and cleaned in df
# df = pd.read_csv('cleaned_customer_data.csv') # Example of how you might load the data
# Display first few rows to inspect the dataframe
print(df.head())
3. Statistical Summary
# Summary statistics of the dataset
print(df.describe())
print(df.info())
4. Visualize Data Distributions
# Visualizing distributions of numerical features
num_columns = df.select_dtypes(include=['float64', 'int64']).columns
df[num_columns].hist(bins=15, figsize=(15, 10), layout=(5, 3))
plt.tight_layout()
plt.show()
5. Visualize Relationships
# Pairplot of numerical features
sns.pairplot(df[num_columns])
plt.show()
# Correlation matrix
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()
6. Preprocessing for Clustering
# Scaling the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df[num_columns])
# Visualizing scaled data distributions
scaled_df = pd.DataFrame(scaled_data, columns=num_columns)
scaled_df.hist(bins=15, figsize=(15, 10), layout=(5, 3))
plt.tight_layout()
plt.show()
7. Determine Optimal Number of Clusters
# Using the elbow method to find the optimal number of clusters
sse = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(scaled_data)
sse.append(kmeans.inertia_)
plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), sse, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.title('Elbow Method For Optimal k')
plt.show()
# Using silhouette score to validate
sil_scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(scaled_data)
clusters = kmeans.predict(scaled_data)
sil_scores.append(silhouette_score(scaled_data, clusters))
plt.figure(figsize=(8, 5))
plt.plot(range(2, 11), sil_scores, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Scores For Optimal k')
plt.show()
8. Fit K-Means and Assign Cluster Labels
# Fit the KMeans algorithm based on the optimal number of clusters found
optimal_clusters = 4 # Assume we found 4 as the optimal number from previous steps
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
df['Cluster'] = kmeans.fit_predict(scaled_data)
# Inspect the cluster assignments
print(df['Cluster'].value_counts())
9. Derive Actionable Insights
# Visualize cluster centers of the numerical features
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
cluster_df = pd.DataFrame(cluster_centers, columns=num_columns)
print(cluster_df)
# Visualizing Cluster Distributions
for column in num_columns:
plt.figure(figsize=(8, 4))
sns.boxplot(x='Cluster', y=column, data=df)
plt.title(f'Cluster vs {column}')
plt.show()
10. Conclusion
With the clusters derived, analyze the characteristics of each cluster to identify patterns in purchasing behavior, demographics, and engagement levels. Use these insights to tailor marketing strategies, product offerings, and customer engagement plans.
This practical implementation provides a complete EDA and clustering analysis process using Python. You can now apply this to your cleaned dataset to gain actionable insights.
Handling Missing Data and Outliers
Import Required Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
Example Data Loading (Assuming DataFrame df
is already loaded)
# Sample loading code if needed
# df = pd.read_csv('your_data.csv')
Handling Missing Data
# Identify missing values
missing_data = df.isnull().sum()
print("Missing data per column:\n", missing_data)
# Drop columns with a high percentage of missing values if needed
threshold = 0.5 # example threshold
df = df[df.columns[df.isnull().mean() < threshold]]
# Impute missing values
# Numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())
# Categorical columns
categorical_cols = df.select_dtypes(include=[object]).columns
df[categorical_cols] = df[categorical_cols].fillna(df[categorical_cols].mode().iloc[0])
Identifying and Handling Outliers
# Detect outliers using the IQR method
Q1 = df[numeric_cols].quantile(0.25)
Q3 = df[numeric_cols].quantile(0.75)
IQR = Q3 - Q1
outliers = ((df[numeric_cols] < (Q1 - 1.5 * IQR)) | (df[numeric_cols] > (Q3 + 1.5 * IQR))).any(axis=1)
# Remove outliers
df = df[~outliers]
Data Normalization
scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
K-Means Clustering
# Choose the number of clusters
k = 5
kmeans = KMeans(n_clusters=k, random_state=42)
df['Cluster'] = kmeans.fit_predict(df[numeric_cols])
Visualization of Clusters
# Example visualization
plt.figure(figsize=(10, 6))
sns.scatterplot(x='feature_x', y='feature_y', hue='Cluster', data=df, palette='Set1', legend='full')
plt.title('Customer Segmentation')
plt.show()
Deriving Actionable Insights
# Example: Calculating average values in each cluster
cluster_insights = df.groupby('Cluster').mean()
print(cluster_insights)
This implementation handles missing data by imputing values, removes outliers using the IQR method, normalizes the data, and segments customers using K-Means clustering, providing a visualization of the results along with some basic cluster insights.
# Importing necessary libraries
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Assuming `data` is your cleaned dataframe containing purchasing behavior, demographics, and engagement levels
# Separate the features from the target variable, if you have one
features = data.drop(columns=['target'], errors='ignore') # Drop target column if it exists
# Standardizing the features
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)
# Applying PCA
pca = PCA(n_components=2) # Reducing to 2 dimensions for visualization purposes, modify as needed
principal_components = pca.fit_transform(scaled_features)
# Creating a DataFrame with the principal components
pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
# Adding the target variable back to the dataframe, if it exists
if 'target' in data.columns:
pca_df = pd.concat([pca_df, data['target'].reset_index(drop=True)], axis=1)
# Proceed to clustering
from sklearn.cluster import KMeans
# Defining the KMeans model
kmeans = KMeans(n_clusters=3) # Adjust number of clusters as needed
kmeans.fit(pca_df[['PC1', 'PC2']])
# Adding cluster labels to the PCA dataframe
pca_df['Cluster'] = kmeans.labels_
# Deriving actionable insights via cluster analysis
# You can group by clusters and analyze the means or other statistics of each original feature
cluster_insights = data.copy()
cluster_insights['Cluster'] = kmeans.labels_
cluster_summary = cluster_insights.groupby('Cluster').mean()
# Displaying the cluster summary for actionable insights
print(cluster_summary)
- Feature Scaling: Standardize the features which is a necessary step before applying PCA.
- PCA: Perform Principal Component Analysis to reduce the number of dimensions.
- Clustering: Use KMeans clustering on the principal components to segment the customers.
- Insights: Group data by clusters and compute summary statistics for actionable insights.
Make sure to adjust the number of principal components and clusters according to your specific project needs.
Part 5: Introduction to Clustering Algorithms
Imports and Data Preparation
Ensure you have the essential libraries imported and data ready for clustering analysis. Below is a sample data preparation step assuming the data has been cleaned and transformed adequately as per previous parts.
import pandas as pd
from sklearn.preprocessing import StandardScaler
# Load the cleaned and pre-processed dataset
data = pd.read_csv('cleaned_data.csv')
# Select the features for clustering
features = ['purchasing_behavior', 'demographics', 'engagement_levels']
# Feature scaling
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data[features])
K-Means Clustering
Implement K-Means Algorithm
K-Means is one of the most commonly used clustering algorithms. Below is the implementation using the K-Means method from the scikit-learn library.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Define the number of clusters
num_clusters = 5
# Create KMeans model
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
# Fit the model to the scaled features
kmeans.fit(scaled_features)
# Predict the clusters
clusters = kmeans.predict(scaled_features)
# Add cluster labels to the original dataset
data['Cluster'] = clusters
Optimal Number of Clusters: Elbow Method
Using the Elbow Method to determine the optimal number of clusters by plotting the within-cluster sum of squares (WCSS).
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, random_state=42)
kmeans.fit(scaled_features)
wcss.append(kmeans.inertia_)
# Plot the Elbow graph
plt.figure(figsize=(10, 5))
plt.plot(range(1, 11, wcss))
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
Hierarchical Clustering
Dendrogram for Hierarchical Clustering
Hierarchical clustering can work well for smaller datasets. We use scipy’s dendrogram to determine the number of clusters.
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))
plt.title("Customer Dendrograms")
dend = shc.dendrogram(shc.linkage(scaled_features, method='ward'))
plt.axhline(y=6, color='r', linestyle='--')
plt.show()
Applying Hierarchical Clustering
Using the Agglomerative Clustering from scikit-learn to determine the clusters.
from sklearn.cluster import AgglomerativeClustering
# Create the model
hc = AgglomerativeClustering(n_clusters=num_clusters, affinity='euclidean', linkage='ward')
# Fit the model to the scaled features
labels = hc.fit_predict(scaled_features)
# Add cluster labels to the original dataset
data['Cluster_HC'] = labels
Analyzing and Interpreting Cluster Results
Summarize the clusters to gain actionable insights.
# Display the first few rows of the dataset with cluster labels
print(data.head())
# Calculate the mean values of each feature for each cluster
cluster_means = data.groupby('Cluster').mean()
print(cluster_means)
# If needed, visualize the cluster distribution
import seaborn as sns
# Scatter plot for visualizing clusters (example with two dimensions)
sns.scatterplot(x='purchasing_behavior', y='engagement_levels', hue='Cluster', data=data, palette='viridis')
plt.title('Cluster Analysis')
plt.show()
This implementation segment customers into distinct groups based on their purchasing behavior, demographics, and engagement levels. The clustering insights can inform targeted marketing strategies or personalized customer experiences.
Customer Segmentation with K-Means
Below is the practical implementation of customer segmentation using K-Means clustering in Python. This assumes you have already completed data loading and cleaning, exploratory data analysis, handling missing data and outliers, and dimensionality reduction with PCA.
Step 1: Import Necessary Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, pairwise_distances_argmin_min
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Standardize the Data
Ensure your dataset (let’s call it data
) is scaled since K-Means clustering is affected by the scale of the features.
# Assuming `data` is your DataFrame after PCA or other preprocessing
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
Step 3: Determine the Optimal Number of Clusters
Use the Elbow method and Silhouette score to determine the optimal number of clusters.
# Elbow method
sse = []
for k in range(1, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(scaled_data)
sse.append(kmeans.inertia_)
plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), sse, marker='o')
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('SSE')
plt.show()
# Silhouette score
silhouette_scores = []
for k in range(2, 11):
kmeans = KMeans(n_clusters=k, random_state=42)
labels = kmeans.fit_predict(scaled_data)
silhouette_scores.append(silhouette_score(scaled_data, labels))
plt.figure(figsize=(8, 5))
plt.plot(range(2, 11), silhouette_scores, marker='o')
plt.title('Silhouette Scores')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.show()
Step 4: Apply K-Means Clustering
Choose the optimal number of clusters (let’s say k=4
from the Elbow and Silhouette method results).
optimal_clusters = 4
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(scaled_data)
# Add cluster labels to the original data
data['Cluster'] = cluster_labels
Step 5: Analyze the Clusters
Derive insights from the clustered data.
# Visualizing the clusters
sns.pairplot(data, hue='Cluster', palette='viridis')
plt.show()
# Summary Statistics of clusters
cluster_summary = data.groupby('Cluster').mean()
print(cluster_summary)
Step 6: Actionable Insights
Translate the clustering findings into actionable insights.
# Suppose we have demographic features like 'Age' and 'Annual Income'
cluster_insights = data.groupby('Cluster').agg({
'Age': ['mean', 'median'],
'Annual Income': ['mean', 'median'],
# Include other relevant features
})
print(cluster_insights)
Conclusion
By following these steps, you can segment customers based on their purchasing behavior, demographics, and engagement levels and derive actionable insights from the clustering analysis. Incorporate these insights into improving targeted marketing strategies, customer retention programs, and overall business decision-making.
Remember to investigate your clusters deeply to understand the specific characteristics and needs of each group.
Customer Segmentation with Hierarchical Clustering
Step 1: Import Necessary Libraries
import pandas as pd
import numpy as np
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import matplotlib.pyplot as plt
import seaborn as sns
Step 2: Load and Prepare Data
Note: As mentioned, data loading, cleaning, and preprocessing are assumed to be completed in earlier sections of your project.
# Assume df is the DataFrame after preprocessing steps
df = pd.read_csv('processed_customer_data.csv')
# Selecting relevant features for clustering
features = ['PurchaseAmount', 'Age', 'EngagementScore']
X = df[features].values
Step 3: Perform Hierarchical Clustering
# Computing the hierarchical clustering using Ward's method
Z = linkage(X, method='ward')
Step 4: Plot Dendrogram
plt.figure(figsize=(10, 7))
plt.title("Customer Dendrogram")
dendrogram(Z)
plt.xlabel('Customer')
plt.ylabel('Euclidean distances')
plt.show()
Step 5: Determine the Optimal Number of Clusters
# Determine the number of clusters by setting a distance threshold
max_d = 50
clusters = fcluster(Z, max_d, criterion='distance')
# Alternatively, specifying the number of clusters directly
k = 4
clusters_k = fcluster(Z, k, criterion='maxclust')
# Add cluster labels to the original DataFrame
df['Cluster'] = clusters_k
Step 6: Analyze and Visualize the Segmentation
# Grouping data by clusters to interpret the results
cluster_summary = df.groupby('Cluster').mean()
# Visualizing the clusters
cluster_summary.plot(kind='bar', figsize=(10, 6))
plt.title("Cluster Summary")
plt.ylabel("Average Value")
plt.xlabel("Cluster")
plt.show()
# Visualizing clusters in a scatter plot of two features
sns.scatterplot(data=df, x='PurchaseAmount', y='Age', hue='Cluster', palette='Set2')
plt.title("Customer Segments")
plt.show()
Step 7: Derive Actionable Insights
# Assuming we output cluster summaries for business interpretation
print(cluster_summary)
# Example insights could be derived from mean values of each cluster
for cluster in cluster_summary.index:
print(f"Cluster {cluster}:")
print(cluster_summary.loc[cluster])
print("\n")
Step 8: Save the Segmented Data
# Save the DataFrame with cluster labels
df.to_csv('customer_segments.csv', index=False)
This implementation clusters customers into segments based on their purchasing behavior, demographics, and engagement levels using Hierarchical Clustering, then visualizes and analyzes the resulting segments for actionable insights.
Interpreting and Visualizing Clusters
This section focuses on interpreting and visualizing clusters after applying clustering techniques such as K-Means or Hierarchical Clustering. This implementation will help derive actionable insights based on customer segmentation.
Step 1: Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
Step 2: Fit the Clustering Algorithm
Assuming the clustering model (e.g., K-Means) has been fitted already:
# Assuming `data` is your preprocessed dataframe and `kmeans` is the KMeans object
kmeans = KMeans(n_clusters=5, random_state=42)
data['Cluster'] = kmeans.fit_predict(data)
Alternatively, if Hierarchical Clustering was used, you can fit and predict similarly using the appropriate sklearn object.
Step 3: Interpret Clusters – Profiling
# Calculate mean values of features for each cluster
cluster_profile = data.groupby('Cluster').mean()
# Display the cluster profile
print(cluster_profile)
Step 4: Visualize Clusters
Using PCA for Dimensionality Reduction
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data.drop('Cluster', axis=1))
# Create a DataFrame for visualization
pca_df = pd.DataFrame(data_pca, columns=['PCA1', 'PCA2'])
pca_df['Cluster'] = data['Cluster']
# Visualize using Seaborn
plt.figure(figsize=(10, 8))
sns.scatterplot(x='PCA1', y='PCA2', hue='Cluster', data=pca_df, palette='viridis')
plt.title('Customer Segments Visualization with PCA')
plt.show()
Visualizing Feature Importance for Each Cluster
# Melt the dataframe for better visualization
cluster_profile_melted = cluster_profile.reset_index().melt(id_vars='Cluster', var_name='Feature', value_name='Mean')
plt.figure(figsize=(15, 8))
sns.barplot(data=cluster_profile_melted, x='Feature', y='Mean', hue='Cluster', palette='viridis')
plt.title('Feature Importance Across Clusters')
plt.xticks(rotation=45)
plt.show()
Visualizing Clusters Using Pairplot
# This can be computationally expensive with many features
# Using `hue` to visualize clusters in feature pairs
sns.pairplot(data, hue='Cluster', palette='viridis')
plt.suptitle('Cluster Pairplot', y=1.02)
plt.show()
By following these steps, you can effectively interpret and visualize the clusters of your customer segmentation analysis, allowing you to derive actionable insights regarding purchasing behavior, demographics, and engagement levels.