Customer Segmentation Analysis Using Python

by | Python

Table of Contents

Data Loading and Cleaning in Python

In this blog we will work through a detailed example for customer segmentation analysis using Python as our data analysis language.

To create a simple dataset to use for this example you can use either ChatGPT or EDNA Chat within Data Mentor.

Make sure to upload and use the correct file name in your Python code.

If you need any of the code to be explained further, just click the ‘Code Explainer’ button within any code block.

Setup Instructions

Install Libraries: Ensure you have the necessary libraries installed. You can install them using pip if needed.

pip install pandas numpy

    Implementation

    1. Loading the Data

    import pandas as pd
    
    # Load the CSV data into a DataFrame
    file_path = 'path_to_your_file.csv'
    df = pd.read_csv(file_path)
    
    # Display the first few rows of the DataFrame
    print(df.head())
    

    2. Exploring and Cleaning the Data

    # Exploring basic information about the DataFrame
    print(df.info())
    print(df.describe())
    
    # Check for missing values
    missing_values = df.isnull().sum()
    print("Missing Values:\n", missing_values)
    
    # Handling missing values
    # Option 1: Drop rows with missing values
    df_cleaned = df.dropna()
    
    # Option 2: Fill missing values with the mean of the column (for numerical data)
    df_cleaned = df.fillna(df.mean())
    
    # Handling categorical variables (if any)
    # Convert categorical variables to dummy/indicator variables
    df_cleaned = pd.get_dummies(df_cleaned, drop_first=True)
    
    # Output cleaned data information
    print(df_cleaned.info())
    

    3. Normalize the Data (Optional)

    from sklearn.preprocessing import StandardScaler
    
    # Initialize the scaler
    scaler = StandardScaler()
    
    # Assume 'cols_to_scale' contains the names of the columns to be scaled
    cols_to_scale = ['Column1', 'Column2', 'Column3']
    df_cleaned[cols_to_scale] = scaler.fit_transform(df_cleaned[cols_to_scale])
    
    # Display the first few rows of the cleaned and scaled DataFrame
    print(df_cleaned.head())
    

    This implementation includes loading data from a CSV file, exploring the data, cleaning it by handling missing values and categorical variables, and optionally normalizing numerical features. This prepares the data for subsequent analysis and machine learning tasks.

    Exploratory Data Analysis (EDA) and Clustering Implementation

    1. Import Libraries

    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.preprocessing import StandardScaler
    from sklearn.cluster import KMeans
    from sklearn.metrics import silhouette_score
    

    2. Load Clean Data

    # Assuming data is loaded and cleaned in df
    # df = pd.read_csv('cleaned_customer_data.csv')  # Example of how you might load the data
    
    # Display first few rows to inspect the dataframe
    print(df.head())
    

    3. Statistical Summary

    # Summary statistics of the dataset
    print(df.describe())
    print(df.info())
    

    4. Visualize Data Distributions

    # Visualizing distributions of numerical features
    num_columns = df.select_dtypes(include=['float64', 'int64']).columns
    df[num_columns].hist(bins=15, figsize=(15, 10), layout=(5, 3))
    plt.tight_layout()
    plt.show()
    

    5. Visualize Relationships

    # Pairplot of numerical features
    sns.pairplot(df[num_columns])
    plt.show()
    
    # Correlation matrix
    corr_matrix = df.corr()
    sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
    plt.show()
    

    6. Preprocessing for Clustering

    # Scaling the data
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(df[num_columns])
    
    # Visualizing scaled data distributions
    scaled_df = pd.DataFrame(scaled_data, columns=num_columns)
    scaled_df.hist(bins=15, figsize=(15, 10), layout=(5, 3))
    plt.tight_layout()
    plt.show()
    

    7. Determine Optimal Number of Clusters

    # Using the elbow method to find the optimal number of clusters
    sse = []
    for k in range(1, 11):
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(scaled_data)
        sse.append(kmeans.inertia_)
    
    plt.figure(figsize=(8, 5))
    plt.plot(range(1, 11), sse, marker='o')
    plt.xlabel('Number of clusters')
    plt.ylabel('SSE')
    plt.title('Elbow Method For Optimal k')
    plt.show()
    
    # Using silhouette score to validate
    sil_scores = []
    for k in range(2, 11):
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(scaled_data)
        clusters = kmeans.predict(scaled_data)
        sil_scores.append(silhouette_score(scaled_data, clusters))
    
    plt.figure(figsize=(8, 5))
    plt.plot(range(2, 11), sil_scores, marker='o')
    plt.xlabel('Number of clusters')
    plt.ylabel('Silhouette Score')
    plt.title('Silhouette Scores For Optimal k')
    plt.show()
    

    8. Fit K-Means and Assign Cluster Labels

    # Fit the KMeans algorithm based on the optimal number of clusters found
    optimal_clusters = 4  # Assume we found 4 as the optimal number from previous steps
    
    kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
    df['Cluster'] = kmeans.fit_predict(scaled_data)
    
    # Inspect the cluster assignments
    print(df['Cluster'].value_counts())
    

    9. Derive Actionable Insights

    # Visualize cluster centers of the numerical features
    cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
    cluster_df = pd.DataFrame(cluster_centers, columns=num_columns)
    print(cluster_df)
    
    # Visualizing Cluster Distributions
    for column in num_columns:
        plt.figure(figsize=(8, 4))
        sns.boxplot(x='Cluster', y=column, data=df)
        plt.title(f'Cluster vs {column}')
        plt.show()
    

    10. Conclusion

    With the clusters derived, analyze the characteristics of each cluster to identify patterns in purchasing behavior, demographics, and engagement levels. Use these insights to tailor marketing strategies, product offerings, and customer engagement plans.

    This practical implementation provides a complete EDA and clustering analysis process using Python. You can now apply this to your cleaned dataset to gain actionable insights.

    Handling Missing Data and Outliers

    Import Required Libraries

    import pandas as pd
    import numpy as np
    from sklearn.preprocessing import StandardScaler
    from sklearn.cluster import KMeans
    import matplotlib.pyplot as plt
    import seaborn as sns
    

    Example Data Loading (Assuming DataFrame df is already loaded)

    # Sample loading code if needed
    # df = pd.read_csv('your_data.csv')
    

    Handling Missing Data

    # Identify missing values
    missing_data = df.isnull().sum()
    print("Missing data per column:\n", missing_data)
    
    # Drop columns with a high percentage of missing values if needed
    threshold = 0.5  # example threshold
    df = df[df.columns[df.isnull().mean() < threshold]]
    
    # Impute missing values
    # Numeric columns
    numeric_cols = df.select_dtypes(include=[np.number]).columns
    df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())
    
    # Categorical columns
    categorical_cols = df.select_dtypes(include=[object]).columns
    df[categorical_cols] = df[categorical_cols].fillna(df[categorical_cols].mode().iloc[0])
    

    Identifying and Handling Outliers

    # Detect outliers using the IQR method
    Q1 = df[numeric_cols].quantile(0.25)
    Q3 = df[numeric_cols].quantile(0.75)
    IQR = Q3 - Q1
    
    outliers = ((df[numeric_cols] < (Q1 - 1.5 * IQR)) | (df[numeric_cols] > (Q3 + 1.5 * IQR))).any(axis=1)
    
    # Remove outliers
    df = df[~outliers]
    

    Data Normalization

    scaler = StandardScaler()
    df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
    

    K-Means Clustering

    # Choose the number of clusters
    k = 5
    
    kmeans = KMeans(n_clusters=k, random_state=42)
    df['Cluster'] = kmeans.fit_predict(df[numeric_cols])
    

    Visualization of Clusters

    # Example visualization
    plt.figure(figsize=(10, 6))
    sns.scatterplot(x='feature_x', y='feature_y', hue='Cluster', data=df, palette='Set1', legend='full')
    plt.title('Customer Segmentation')
    plt.show()
    

    Deriving Actionable Insights

    # Example: Calculating average values in each cluster
    cluster_insights = df.groupby('Cluster').mean()
    print(cluster_insights)
    

    This implementation handles missing data by imputing values, removes outliers using the IQR method, normalizes the data, and segments customers using K-Means clustering, providing a visualization of the results along with some basic cluster insights.

    # Importing necessary libraries
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import StandardScaler
    import pandas as pd
    
    # Assuming `data` is your cleaned dataframe containing purchasing behavior, demographics, and engagement levels
    # Separate the features from the target variable, if you have one
    features = data.drop(columns=['target'], errors='ignore')  # Drop target column if it exists
    
    # Standardizing the features
    scaler = StandardScaler()
    scaled_features = scaler.fit_transform(features)
    
    # Applying PCA
    pca = PCA(n_components=2)  # Reducing to 2 dimensions for visualization purposes, modify as needed
    principal_components = pca.fit_transform(scaled_features)
    
    # Creating a DataFrame with the principal components
    pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
    
    # Adding the target variable back to the dataframe, if it exists
    if 'target' in data.columns:
        pca_df = pd.concat([pca_df, data['target'].reset_index(drop=True)], axis=1)
    
    # Proceed to clustering
    from sklearn.cluster import KMeans
    
    # Defining the KMeans model
    kmeans = KMeans(n_clusters=3)  # Adjust number of clusters as needed
    kmeans.fit(pca_df[['PC1', 'PC2']])
    
    # Adding cluster labels to the PCA dataframe
    pca_df['Cluster'] = kmeans.labels_
    
    # Deriving actionable insights via cluster analysis
    # You can group by clusters and analyze the means or other statistics of each original feature
    cluster_insights = data.copy()
    cluster_insights['Cluster'] = kmeans.labels_
    cluster_summary = cluster_insights.groupby('Cluster').mean()
    
    # Displaying the cluster summary for actionable insights
    print(cluster_summary)
    
    • Feature Scaling: Standardize the features which is a necessary step before applying PCA.
    • PCA: Perform Principal Component Analysis to reduce the number of dimensions.
    • Clustering: Use KMeans clustering on the principal components to segment the customers.
    • Insights: Group data by clusters and compute summary statistics for actionable insights.

    Make sure to adjust the number of principal components and clusters according to your specific project needs.

    Part 5: Introduction to Clustering Algorithms

    Imports and Data Preparation

    Ensure you have the essential libraries imported and data ready for clustering analysis. Below is a sample data preparation step assuming the data has been cleaned and transformed adequately as per previous parts.

    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    
    # Load the cleaned and pre-processed dataset
    data = pd.read_csv('cleaned_data.csv')
    
    # Select the features for clustering
    features = ['purchasing_behavior', 'demographics', 'engagement_levels']
    
    # Feature scaling
    scaler = StandardScaler()
    scaled_features = scaler.fit_transform(data[features])
    

    K-Means Clustering

    Implement K-Means Algorithm

    K-Means is one of the most commonly used clustering algorithms. Below is the implementation using the K-Means method from the scikit-learn library.

    from sklearn.cluster import KMeans
    import matplotlib.pyplot as plt
    
    # Define the number of clusters
    num_clusters = 5
    
    # Create KMeans model
    kmeans = KMeans(n_clusters=num_clusters, random_state=42)
    
    # Fit the model to the scaled features
    kmeans.fit(scaled_features)
    
    # Predict the clusters
    clusters = kmeans.predict(scaled_features)
    
    # Add cluster labels to the original dataset
    data['Cluster'] = clusters
    

    Optimal Number of Clusters: Elbow Method

    Using the Elbow Method to determine the optimal number of clusters by plotting the within-cluster sum of squares (WCSS).

    wcss = []
    for i in range(1, 11):
        kmeans = KMeans(n_clusters=i, random_state=42)
        kmeans.fit(scaled_features)
        wcss.append(kmeans.inertia_)
    
    # Plot the Elbow graph
    plt.figure(figsize=(10, 5))
    plt.plot(range(1, 11, wcss))
    plt.title('Elbow Method')
    plt.xlabel('Number of clusters')
    plt.ylabel('WCSS')
    plt.show()
    

    Hierarchical Clustering

    Dendrogram for Hierarchical Clustering

    Hierarchical clustering can work well for smaller datasets. We use scipy’s dendrogram to determine the number of clusters.

    import scipy.cluster.hierarchy as shc
    
    plt.figure(figsize=(10, 7))
    plt.title("Customer Dendrograms")
    dend = shc.dendrogram(shc.linkage(scaled_features, method='ward'))
    
    plt.axhline(y=6, color='r', linestyle='--')
    plt.show()
    

    Applying Hierarchical Clustering

    Using the Agglomerative Clustering from scikit-learn to determine the clusters.

    from sklearn.cluster import AgglomerativeClustering
    
    # Create the model
    hc = AgglomerativeClustering(n_clusters=num_clusters, affinity='euclidean', linkage='ward')
    
    # Fit the model to the scaled features
    labels = hc.fit_predict(scaled_features)
    
    # Add cluster labels to the original dataset
    data['Cluster_HC'] = labels
    

    Analyzing and Interpreting Cluster Results

    Summarize the clusters to gain actionable insights.

    # Display the first few rows of the dataset with cluster labels
    print(data.head())
    
    # Calculate the mean values of each feature for each cluster
    cluster_means = data.groupby('Cluster').mean()
    print(cluster_means)
    
    # If needed, visualize the cluster distribution
    import seaborn as sns
    
    # Scatter plot for visualizing clusters (example with two dimensions)
    sns.scatterplot(x='purchasing_behavior', y='engagement_levels', hue='Cluster', data=data, palette='viridis')
    plt.title('Cluster Analysis')
    plt.show()
    

    This implementation segment customers into distinct groups based on their purchasing behavior, demographics, and engagement levels. The clustering insights can inform targeted marketing strategies or personalized customer experiences.

    Customer Segmentation with K-Means

    Below is the practical implementation of customer segmentation using K-Means clustering in Python. This assumes you have already completed data loading and cleaning, exploratory data analysis, handling missing data and outliers, and dimensionality reduction with PCA.

    Step 1: Import Necessary Libraries

    import pandas as pd
    import numpy as np
    from sklearn.preprocessing import StandardScaler
    from sklearn.cluster import KMeans
    from sklearn.metrics import silhouette_score, pairwise_distances_argmin_min
    import matplotlib.pyplot as plt
    import seaborn as sns
    

    Step 2: Standardize the Data

    Ensure your dataset (let’s call it data) is scaled since K-Means clustering is affected by the scale of the features.

    # Assuming `data` is your DataFrame after PCA or other preprocessing
    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data)
    

    Step 3: Determine the Optimal Number of Clusters

    Use the Elbow method and Silhouette score to determine the optimal number of clusters.

    # Elbow method
    sse = []
    for k in range(1, 11):
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(scaled_data)
        sse.append(kmeans.inertia_)
    
    plt.figure(figsize=(8, 5))
    plt.plot(range(1, 11), sse, marker='o')
    plt.title('Elbow Method')
    plt.xlabel('Number of clusters')
    plt.ylabel('SSE')
    plt.show()
    
    # Silhouette score
    silhouette_scores = []
    for k in range(2, 11):
        kmeans = KMeans(n_clusters=k, random_state=42)
        labels = kmeans.fit_predict(scaled_data)
        silhouette_scores.append(silhouette_score(scaled_data, labels))
    
    plt.figure(figsize=(8, 5))
    plt.plot(range(2, 11), silhouette_scores, marker='o')
    plt.title('Silhouette Scores')
    plt.xlabel('Number of clusters')
    plt.ylabel('Silhouette Score')
    plt.show()
    

    Step 4: Apply K-Means Clustering

    Choose the optimal number of clusters (let’s say k=4 from the Elbow and Silhouette method results).

    optimal_clusters = 4
    kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
    cluster_labels = kmeans.fit_predict(scaled_data)
    
    # Add cluster labels to the original data
    data['Cluster'] = cluster_labels
    

    Step 5: Analyze the Clusters

    Derive insights from the clustered data.

    # Visualizing the clusters
    sns.pairplot(data, hue='Cluster', palette='viridis')
    plt.show()
    
    # Summary Statistics of clusters
    cluster_summary = data.groupby('Cluster').mean()
    print(cluster_summary)
    

    Step 6: Actionable Insights

    Translate the clustering findings into actionable insights.

    # Suppose we have demographic features like 'Age' and 'Annual Income'
    cluster_insights = data.groupby('Cluster').agg({
        'Age': ['mean', 'median'],
        'Annual Income': ['mean', 'median'],
        # Include other relevant features
    })
    print(cluster_insights)
    

    Conclusion

    By following these steps, you can segment customers based on their purchasing behavior, demographics, and engagement levels and derive actionable insights from the clustering analysis. Incorporate these insights into improving targeted marketing strategies, customer retention programs, and overall business decision-making.

    Remember to investigate your clusters deeply to understand the specific characteristics and needs of each group.

    Customer Segmentation with Hierarchical Clustering

    Step 1: Import Necessary Libraries

    import pandas as pd
    import numpy as np
    from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
    import matplotlib.pyplot as plt
    import seaborn as sns
    

    Step 2: Load and Prepare Data

    Note: As mentioned, data loading, cleaning, and preprocessing are assumed to be completed in earlier sections of your project.

    # Assume df is the DataFrame after preprocessing steps
    df = pd.read_csv('processed_customer_data.csv')
    
    # Selecting relevant features for clustering
    features = ['PurchaseAmount', 'Age', 'EngagementScore']
    X = df[features].values
    

    Step 3: Perform Hierarchical Clustering

    # Computing the hierarchical clustering using Ward's method
    Z = linkage(X, method='ward')
    

    Step 4: Plot Dendrogram

    plt.figure(figsize=(10, 7))
    plt.title("Customer Dendrogram")
    dendrogram(Z)
    plt.xlabel('Customer')
    plt.ylabel('Euclidean distances')
    plt.show()
    

    Step 5: Determine the Optimal Number of Clusters

    # Determine the number of clusters by setting a distance threshold
    max_d = 50
    clusters = fcluster(Z, max_d, criterion='distance')
    
    # Alternatively, specifying the number of clusters directly
    k = 4
    clusters_k = fcluster(Z, k, criterion='maxclust')
    
    # Add cluster labels to the original DataFrame
    df['Cluster'] = clusters_k
    

    Step 6: Analyze and Visualize the Segmentation

    # Grouping data by clusters to interpret the results
    cluster_summary = df.groupby('Cluster').mean()
    
    # Visualizing the clusters
    cluster_summary.plot(kind='bar', figsize=(10, 6))
    plt.title("Cluster Summary")
    plt.ylabel("Average Value")
    plt.xlabel("Cluster")
    plt.show()
    
    # Visualizing clusters in a scatter plot of two features
    sns.scatterplot(data=df, x='PurchaseAmount', y='Age', hue='Cluster', palette='Set2')
    plt.title("Customer Segments")
    plt.show()
    

    Step 7: Derive Actionable Insights

    # Assuming we output cluster summaries for business interpretation
    print(cluster_summary)
    
    # Example insights could be derived from mean values of each cluster
    for cluster in cluster_summary.index:
        print(f"Cluster {cluster}:")
        print(cluster_summary.loc[cluster])
        print("\n")
    

    Step 8: Save the Segmented Data

    # Save the DataFrame with cluster labels
    df.to_csv('customer_segments.csv', index=False)
    

    This implementation clusters customers into segments based on their purchasing behavior, demographics, and engagement levels using Hierarchical Clustering, then visualizes and analyzes the resulting segments for actionable insights.

    Interpreting and Visualizing Clusters

    This section focuses on interpreting and visualizing clusters after applying clustering techniques such as K-Means or Hierarchical Clustering. This implementation will help derive actionable insights based on customer segmentation.

    Step 1: Import Libraries

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.cluster import KMeans
    from sklearn.decomposition import PCA
    

    Step 2: Fit the Clustering Algorithm

    Assuming the clustering model (e.g., K-Means) has been fitted already:

    # Assuming `data` is your preprocessed dataframe and `kmeans` is the KMeans object
    kmeans = KMeans(n_clusters=5, random_state=42)
    data['Cluster'] = kmeans.fit_predict(data)
    

    Alternatively, if Hierarchical Clustering was used, you can fit and predict similarly using the appropriate sklearn object.

    Step 3: Interpret Clusters – Profiling

    # Calculate mean values of features for each cluster
    cluster_profile = data.groupby('Cluster').mean()
    
    # Display the cluster profile
    print(cluster_profile)
    

    Step 4: Visualize Clusters

    Using PCA for Dimensionality Reduction

    pca = PCA(n_components=2)
    data_pca = pca.fit_transform(data.drop('Cluster', axis=1))
    
    # Create a DataFrame for visualization
    pca_df = pd.DataFrame(data_pca, columns=['PCA1', 'PCA2'])
    pca_df['Cluster'] = data['Cluster']
    
    # Visualize using Seaborn
    plt.figure(figsize=(10, 8))
    sns.scatterplot(x='PCA1', y='PCA2', hue='Cluster', data=pca_df, palette='viridis')
    plt.title('Customer Segments Visualization with PCA')
    plt.show()
    

    Visualizing Feature Importance for Each Cluster

    # Melt the dataframe for better visualization
    cluster_profile_melted = cluster_profile.reset_index().melt(id_vars='Cluster', var_name='Feature', value_name='Mean')
    
    plt.figure(figsize=(15, 8))
    sns.barplot(data=cluster_profile_melted, x='Feature', y='Mean', hue='Cluster', palette='viridis')
    plt.title('Feature Importance Across Clusters')
    plt.xticks(rotation=45)
    plt.show()
    

    Visualizing Clusters Using Pairplot

    # This can be computationally expensive with many features
    # Using `hue` to visualize clusters in feature pairs
    
    sns.pairplot(data, hue='Cluster', palette='viridis')
    plt.suptitle('Cluster Pairplot', y=1.02)
    plt.show()
    

    By following these steps, you can effectively interpret and visualize the clusters of your customer segmentation analysis, allowing you to derive actionable insights regarding purchasing behavior, demographics, and engagement levels.

    Related Posts