Mastering Hierarchical Clustering with R: Dendrograms and Cluster Trees in Action

by | R

Table of Contents

Introduction to Hierarchical Clustering

Overview

Hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. Unlike k-means clustering, it does not require the user to pre-specify the number of clusters. There are two types of hierarchical clustering:

  1. Agglomerative (bottom-up): Starts with each object in a single cluster and merges nearest clusters iteratively.
  2. Divisive (top-down): Starts with all objects in a single cluster and splits the least coherent clusters iteratively.

Algorithm Steps

Agglomerative Clustering

  1. Compute the distance matrix: Distances between all data points.
  2. Find the closest two clusters and merge them: Initialize each data point in its own cluster.
  3. Update the distance matrix: Calculate distances between the new cluster and all other clusters.
  4. Repeat steps 2-3 until all points are in a single cluster.

Divisive Clustering

  1. Start with all data in one cluster.
  2. Choose the cluster to split: Often the one with the highest SSE (Sum of Squared Errors).
  3. Create two child clusters: By using an algorithm like k-means with k=2.
  4. Repeat steps 2-3 until each cluster contains a single data point or meets stopping criteria.

Practical Implementation in R

Step 1: Install and Load Necessary Libraries

# Install necessary libraries (if not already installed)
if (!require(cluster)) install.packages("cluster")
if (!require(dendextend)) install.packages("dendextend")

# Load libraries
library(cluster)
library(dendextend)

Step 2: Prepare Data

# Sample data: Iris dataset
data <- iris[, -5]  # Removing the species column

Step 3: Compute Distance Matrix

# Compute the Euclidean distance matrix
distance_matrix <- dist(data, method = "euclidean")

Step 4: Perform Hierarchical Clustering

# Perform agglomerative hierarchical clustering using the complete linkage method
hc <- hclust(distance_matrix, method = "complete")

Step 5: Plot Dendrogram

# Plot the dendrogram
plot(as.dendrogram(hc), main = "Hierarchical Clustering Dendrogram", xlab = "Samples", ylab = "Height")

Step 6: Cut Dendrogram to Form Clusters

# Cut the dendrogram at a specified height to obtain clusters
clusters <- cutree(hc, k = 3)  # k is the number of clusters, example k=3

# Add clusters to the original data
data$cluster <- as.factor(clusters)

Step 7: Visualize Clusters

# Load ggplot2 for visualization
if (!require(ggplot2)) install.packages("ggplot2")
library(ggplot2)

# Visualize clusters
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width, color = cluster)) +
  geom_point(size = 3) +
  labs(title = "Hierarchical Clustering of Iris Data", x = "Sepal Length", y = "Sepal Width") +
  theme_minimal()

Conclusion

In this practical guide, you have learned the basic steps of performing hierarchical clustering using R. You started with preparing the data, computing the distance matrix, performing hierarchical clustering, plotting the dendrogram, cutting the dendrogram to form clusters, and finally, visualizing the clusters. This knowledge is crucial for advanced analytical tasks, helping you discover and understand the structure within your data.

Setting Up Your R Environment

This guide will walk you through the process of setting up your R environment to effectively manage hierarchical clustering projects.

1. Install R

Ensure you have R installed on your system. If you haven’t installed R yet, download it from CRAN.

2. Install RStudio

RStudio is a powerful IDE for R. To install it, download the installer appropriate for your operating system from RStudio’s official website.

3. Install Required Packages

You’ll need several R packages to perform hierarchical clustering and visualization. The following script installs these packages:

# Install essential packages for hierarchical clustering
install.packages(c("stats", "factoextra", "dendextend"))

# Load the packages
library(stats)
library(factoextra)
library(dendextend)

4. Verify Package Installation

Run the following to ensure all packages are correctly installed:

# Check if the packages are loaded correctly
if(!require(stats)) stop("stats package not loaded")
if(!require(factoextra)) stop("factoextra package not loaded")
if(!require(dendextend)) stop("dendextend package not loaded")

print("All packages loaded successfully!")

5. Set Working Directory

Setting your working directory ensures that all files read or written are directed to a specified folder. Adjust the path according to your project directory.

# Set your working directory to the folder where project files are stored
setwd("/path/to/your/project/folder")

# Verify the working directory
print(getwd())

6. Load Data

Load your dataset into R for hierarchical clustering. Assuming you have a CSV file named data.csv.

# Load the dataset
data <- read.csv("data.csv", header = TRUE, sep = ",")

# Display the first few rows of the dataset
head(data)

7. Basic Data Preprocessing

Preprocessing ensures that your data is clean and ready for clustering.

# Handle missing values by removing rows with any NA values
data_clean <- na.omit(data)

# Scale the data (standardize variables to have zero mean and unit variance)
data_scaled <- scale(data_clean)

# Preview the cleaned and scaled data
head(data_scaled)

8. Performing Hierarchical Clustering

Apply hierarchical clustering using the hclust function.

# Compute the distance matrix
dist_matrix <- dist(data_scaled, method = "euclidean")

# Perform hierarchical clustering
hc <- hclust(dist_matrix, method = "ward.D2")

# Plot the dendrogram
plot(hc, cex = 0.6, hang = -1)

9. Visualizing Clusters

Enhance the visualization of your dendrogram with colored clusters using the factoextra package.

# Visualize the dendrogram with colored clusters
fviz_dend(hc, k = 4,               # Cut into 4 clusters
          cex = 0.5,               # Label size
          k_colors = rainbow(4),   # Cluster colors
          rect = TRUE,             # Add rectangle around clusters
          rect_border = rainbow(4),# Rectangle border colors
          rect_fill = TRUE)        # Fill the rectangles

10. Save the Dendrogram Plot

To save your dendrogram plot as an image file:

# Save the dendrogram plot as a PNG file
png("dendrogram.png", width = 800, height = 600)
fviz_dend(hc, k = 4, cex = 0.5, k_colors = rainbow(4), rect = TRUE, rect_border = rainbow(4), rect_fill = TRUE)
dev.off()

This completes the setup for performing hierarchical clustering in R within your project. Now your environment is ready for advanced hierarchical clustering analysis and the implementation of practical examples.

Understanding Dendrograms

Overview

A dendrogram is a tree-like diagram that records the sequences of merges or splits in hierarchical clustering. It’s an efficient way to visualize the arrangement of the clusters produced by hierarchical clustering algorithms.

Key Concepts

Nodes and Height

  • Leaf Nodes: Represent individual data points or objects.
  • Internal Nodes: Represent the clusters formed at various levels.
  • Height: Indicates the distance or dissimilarity at which the clusters are merged. The y-axis usually represents the height.

Types of Linkage

  • Single Linkage: Minimum distance between points in the clusters.
  • Complete Linkage: Maximum distance between points in the clusters.
  • Average Linkage: Average distance between points in the clusters.
  • Ward’s Method: Minimizes the total variance within clusters.

Practical Implementation in R

Step 1: Load Necessary Libraries

Make sure you load the required libraries:

library(datasets)
library(cluster)
library(dendextend)

Step 2: Load Data

For simplicity, let’s use the iris dataset:

data(iris)
iris_data <- iris[, -5]  # Remove species column for clustering

Step 3: Compute Distance Matrix

Calculate the distance between the data points:

distance_matrix <- dist(iris_data)

Step 4: Apply Hierarchical Clustering

Perform hierarchical clustering using different linkage methods. Here we use Ward’s method:

hc <- hclust(distance_matrix, method = "ward.D2")

Step 5: Plot the Dendrogram

Visualize the hierarchical clustering as a dendrogram:

plot(hc, main = "Dendrogram of Iris Data", xlab = "", sub = "", cex = 0.9, hang = -1)

Step 6: Cut the Dendrogram

Cut the dendrogram to form clusters:

groups <- cutree(hc, k = 3)  # Assuming you want 3 clusters
rect.hclust(hc, k = 3, border = 2:4)  # Optionally highlight clusters in red/green/blue

Step 7: Interpret the Clusters

Interpret the resulting clusters:

table(groups, iris$Species)

This table compares clusters with actual species to assess clustering quality.

Important Considerations

  • Interpretation: Analyze the heights at which clusters are merged to understand cluster separability.
  • Validation: Compare with true labels (if available) to validate clusters.

Conclusion

Understanding and interpreting dendrograms is crucial in hierarchical clustering. This hands-on guide provides a detailed implementation that helps in grasping the practical aspects of using dendrograms in R. You can now leverage this technique for your own datasets and clustering needs.

Data Preprocessing and Normalization

Data Preprocessing

Preprocessing is a crucial step in hierarchical clustering. It involves cleaning, transforming, and organizing your data. Below are the steps that you would typically take in R to preprocess data for hierarchical clustering.

Load Data

Assuming your dataset is stored in a CSV file:

# Load necessary library
library(tidyverse)

# Read the dataset
data <- read.csv("your_dataset.csv")

# Display the first few rows of the dataset
head(data)

Handling Missing Values

Handling missing values is vital as they can skew the results of the clustering process.

# Check for missing values
sum(is.na(data))

# Option 1: Remove rows with any missing values
data_clean <- na.omit(data)

# Option 2: Fill missing values with column mean (other imputation methods can also be used)
data_clean <- data %>%
  mutate(across(everything(), ~ifelse(is.na(.), mean(., na.rm = TRUE), .)))

Removing Non-numerical Columns

Hierarchical clustering primarily works with numerical data, so it’s prudent to remove or transform non-numerical columns.

# If you have categorical data, you might need to convert it to numerical form or remove it. 
# For simplicity, removing non-numerical columns here:
data_numeric <- data_clean %>%
  select(where(is.numeric))

Data Normalization

Normalization (or standardization) is important to ensure that each feature contributes equally to the computation of distances in clustering.

# Normalize the numerical data
data_normalized <- scale(data_numeric)

# Confirm normalization
summary(data_normalized)

Example: Applying Preprocessing and Normalization

Here’s a script that performs these tasks sequentially:

library(tidyverse)

# Load data
data <- read.csv("your_dataset.csv")

# Remove rows with any NA values
data_clean <- na.omit(data)

# Or alternatively, fill NA with column mean
data_clean <- data %>%
  mutate(across(everything(), ~ifelse(is.na(.), mean(., na.rm = TRUE), .)))

# Retain only numerical columns
data_numeric <- data_clean %>%
  select(where(is.numeric))

# Normalize the numerical data
data_normalized <- scale(data_numeric)

# Display summary and head of the cleaned and normalized data
summary(data_normalized)
head(data_normalized)

This script cleans, transforms, and normalizes your data, making it ready for hierarchical clustering analysis in R. Use this pipeline to prepare your data before moving on to the hierarchical clustering stages.

Exploring Clustering Algorithms in R

This section demonstrates the practical implementation of hierarchical clustering using R with a real-world dataset. The focus is on applying hierarchical clustering, plotting dendrograms, and interpreting the clusters.

Loading Required Libraries

# Load necessary libraries
library(cluster)
library(factoextra)

Dataset Example: Iris Dataset

The Iris dataset is commonly used for clustering and classification tasks. It contains 150 samples of iris flowers with four features: Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width.

# Load the Iris dataset
data(iris)
iris_data <- iris[, -5]  # Remove the species column for clustering

Compute Distance Matrix

Hierarchical clustering requires a distance matrix. We will use Euclidean distance.

# Compute the distance matrix
distance_matrix <- dist(iris_data, method = "euclidean")

Perform Hierarchical Clustering

Using the calculated distance matrix, we can perform hierarchical clustering with different linkage methods (e.g., complete, average, single).

# Perform hierarchical clustering using complete linkage
hc_complete <- hclust(distance_matrix, method = "complete")

# Perform hierarchical clustering using average linkage
hc_average <- hclust(distance_matrix, method = "average")

# Perform hierarchical clustering using single linkage
hc_single <- hclust(distance_matrix, method = "single")

Plot Dendrograms

Visualize the hierarchical clustering results with dendrograms.

# Plot dendrogram for complete linkage
plot(hc_complete, main = "Complete Linkage Dendrogram", xlab = "", sub = "", cex = 0.9)

# Plot dendrogram for average linkage
plot(hc_average, main = "Average Linkage Dendrogram", xlab = "", sub = "", cex = 0.9)

# Plot dendrogram for single linkage
plot(hc_single, main = "Single Linkage Dendrogram", xlab = "", sub = "", cex = 0.9)

Cut Dendrogram and Create Clusters

Choose a number of clusters and cut the dendrogram to define cluster memberships.

# Cut tree to create 3 clusters for complete linkage
clusters_complete <- cutree(hc_complete, k = 3)

# Cut tree to create 3 clusters for average linkage
clusters_average <- cutree(hc_average, k = 3)

# Cut tree to create 3 clusters for single linkage
clusters_single <- cutree(hc_single, k = 3)

Visualize Clusters

Use the fviz_cluster function from the factoextra library to visualize the clusters.

# Visualize clusters for complete linkage
fviz_cluster(list(data = iris_data, cluster = clusters_complete), geom = "point", ellipse.type = "norm", ggtheme = theme_minimal(), main = "Complete Linkage Clusters")

# Visualize clusters for average linkage
fviz_cluster(list(data = iris_data, cluster = clusters_average), geom = "point", ellipse.type = "norm", ggtheme = theme_minimal(), main = "Average Linkage Clusters")

# Visualize clusters for single linkage
fviz_cluster(list(data = iris_data, cluster = clusters_single), geom = "point", ellipse.type = "norm", ggtheme = theme_minimal(), main = "Single Linkage Clusters")

These steps demonstrate how hierarchical clustering can be performed in R using real-world data. The chosen methods illustrate the entire workflow from loading data and calculating distances to performing clustering and visualizing the results.

Creating Dendrograms from Scratch

Objective

In this section, we will create dendrograms from scratch using R. Dendrograms are a tree-like diagram that records the sequences of merges or splits.

Step-by-Step Implementation

We will use the hclust function, which performs hierarchical clustering, and then use the plot function to create the dendrogram.

1. Load and Prepare Your Data

# Load necessary libraries
library(datasets)

# Use a built-in dataset, for example, the Iris dataset
data <- iris[, -5]  # Exclude the species column

# Inspect the data
head(data)

2. Calculate the Distance Matrix

# Calculate the Euclidean distance matrix
dist_matrix <- dist(data, method = "euclidean")

3. Perform Hierarchical Clustering

# Perform hierarchical clustering using the "complete" linkage method
hc <- hclust(dist_matrix, method = "complete")

4. Plot the Dendrogram

# Plot the dendrogram
plot(hc, main="Dendrogram for Iris Data", xlab="", sub="", cex=0.9)

5. Cut the Dendrogram for a Specific Number of Clusters (Optional)

To visualize specific clusters, you can cut the dendrogram.

# Cut the dendrogram to form 3 clusters
clusters <- cutree(hc, k=3)

# Plot the cut dendrogram
plot(hc)
rect.hclust(hc, k=3, border="red")

6. Assign Cluster Labels back to the Data (Optional)

# Assign cluster labels to the original data
iris$Cluster <- as.factor(clusters)

# Inspect the first few rows of the data with cluster labels
head(iris)

Conclusion

You have successfully created a dendrogram from scratch using hierarchical clustering in R. This method leverages the built-in functions hclust and plot to visualize the hierarchical clustering process, making it easier to understand the structure of your data.

Visualizing Cluster Trees in R

To visualize cluster trees in R, you can use hierarchical clustering techniques and then plot the dendrogram. Here is a step-by-step practical implementation:

Step 1: Load Necessary Libraries

library(stats)  # For hierarchical clustering functions
library(dendextend)  # For enhanced dendrogram functionalities

Step 2: Prepare Data

Suppose you have a dataset data_matrix that is already preprocessed and normalized.

Step 3: Perform Hierarchical Clustering

Here, we use the hclust() function for hierarchical clustering.

# Calculate the distance matrix
distance_matrix <- dist(data_matrix)

# Perform hierarchical clustering using complete linkage
cluster_tree <- hclust(distance_matrix, method = "complete")

Step 4: Plot the Dendrogram

The base plotting tools or specialized dendrogram packages can be used.

Base R Plot

# Plot basic dendrogram
plot(cluster_tree, main = "Cluster Dendrogram", xlab = "Samples", sub = "", cex = 0.9, hang = -1)

Enhanced Plot using dendextend

# Convert hclust object to a dendrogram object
dend <- as.dendrogram(cluster_tree)

# Customize the dendrogram (optional: color branches, labels etc.)
dend <- dend %>%
  set("branches_k_color", k = 4) %>% # Color branches by groups
  set("labels_colors") %>%
  set("labels_cex", 0.7)

# Plot the dendrogram
plot(dend, main = "Enhanced Cluster Dendrogram")

Step 5: Cutting the Dendrogram for Cluster Assignments

Determine the number of clusters and cut the tree.

# Cut tree into 4 clusters for example
clusters <- cutree(cluster_tree, k = 4)

# Plotting dendrogram with rectangles to show clusters
rect.hclust(cluster_tree, k = 4, border = 2:5)

Step 6: Visualize Clusters on the Original Data (Optional)

This step helps in understanding how data points are clustered visually, typically done using PCA for dimensionality reduction.

# Perform Principal Component Analysis for 2D plotting
pca_result <- prcomp(data_matrix, scale. = TRUE)

# Plot PCA with clusters
plot(pca_result$x[,1], pca_result$x[,2], col = clusters, 
     main = "PCA of Clusters", xlab = "PC1", ylab = "PC2", pch = 19)

# Optionally add cluster centroids
centroids <- aggregate(pca_result$x[,1:2], list(clusters), mean)
points(centroids[,2:3], pch = 8, col = 1:nlevels(as.factor(clusters)), cex = 2)

Conclusion

This implementation provides a complete example of visualizing cluster trees in R, from hierarchical clustering to dendrogram plotting and cluster visualization on the original data. It integrates basic and enhanced visualization techniques to help you clearly understand and present your hierarchical clustering results.

Evaluating Clustering Results in R

Prerequisites

Assuming you have already performed hierarchical clustering and generated a dendrogram, we’ll go through the practical steps to evaluate the clustering results.

Internal Evaluation Metrics

  1. Silhouette Width
    • Measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation).
# Required Libraries
library(cluster)

# Example Data
set.seed(123)
data <- mtcars

# Perform Hierarchical Clustering
d <- dist(data)  # Euclidean Distance
hc <- hclust(d, method = "ward.D2")

# Cut Tree to Create Clusters
clusters <- cutree(hc, k = 4)

# Calculate Silhouette Width
silhouette_info <- silhouette(clusters, d)
summary(silhouette_info)
  1. Dunn Index
    • Ratio of the minimum inter-cluster distance to the maximum intra-cluster distance.
# Dunn Index requires the 'clValid' package
library(clValid)

# Calculate Dunn Index
dunn_index <- dunn(cl = clusters, Data = data)
print(dunn_index)

External Evaluation Metrics

  1. Adjusted Rand Index (ARI)
    • Compares the similarity between the clusters and the true labels.
# Example: Comparing clusters to ground truth using the 'mclust' package
library(mclust)

# Assuming you have true labels in true_labels vector
true_labels <- mtcars$cyl  # For example, true labels based on the number of cylinders

# Calculate ARI
ari_value <- adjustedRandIndex(true_labels, clusters)
print(ari_value)
  1. Cluster Purity
    • Measures the extent to which clusters contain a single class.
# Calculate Cluster Purity
calculate_purity <- function(clusters, labels) {
  contingency_table <- table(clusters, labels)
  sum(apply(contingency_table, 1, max)) / length(labels)
}

# Assuming true_labels vector exists
purity <- calculate_purity(clusters, true_labels)
print(purity)

Visual Evaluation Metrics

  1. Cluster Plot
    • Visual examination of the clusters can provide insights. You can use the factoextra package for visualization:
# Required Library
library(factoextra)

# Visualize Clusters
fviz_cluster(list(data = data, cluster = clusters), geom = "point")
  1. Dendrogram with Colored Labels
    • Color the dendrogram based on the clusters for better visualization.
# Required Library
library(dendextend)

# Color Dendrogram
dend <- as.dendrogram(hc)
dend <- color_branches(dend, k = 4)
plot(dend)

Interpretation of Results

  • Silhouette Width: Values close to 1 imply well-clustered data, values close to 0 mean overlapping clusters, and negative values imply misclassified points.
  • Dunn Index: Higher values indicate better clustering.
  • Adjusted Rand Index (ARI): Values close to 1 indicate high similarity, while values close to 0 indicate no agreement.
  • Cluster Purity: Higher purity indicates better clustering based on true labels.
  • Visual Inspection: Helps in understanding the nature and separation of clusters.

These metrics and visualizations will provide a comprehensive evaluation of your hierarchical clustering efforts. Apply these steps directly to your dataset in R to gauge the quality of your clustering.

Real-World Applications of Hierarchical Clustering

1. Customer Segmentation

Hierarchical clustering can be used to segment customers based on their purchasing behavior. For instance, an e-commerce company can analyze purchase history, the frequency of purchases, and total spending to identify distinct customer segments.

Practical Implementation

# Load necessary libraries
library(dplyr)
library(cluster)

# Assume `customer_data` is your dataset with relevant features
# Normalize the data
customer_data_scaled <- scale(customer_data)

# Compute the distance matrix
dist_matrix <- dist(customer_data_scaled, method = "euclidean")

# Perform hierarchical clustering
customer_hclust <- hclust(dist_matrix, method = "ward.D2")

# Cut the dendrogram to create clusters (let's assume 5 clusters)
customer_clusters <- cutree(customer_hclust, k = 5)

# Add cluster assignments to the original dataset
customer_data <- customer_data %>%
  mutate(cluster = customer_clusters)

# Analyze cluster characteristics
cluster_summary <- customer_data %>%
  group_by(cluster) %>%
  summarise(across(everything(), mean, na.rm=TRUE))

print(cluster_summary)

2. Document Clustering

Hierarchical clustering can be applied to group similar documents, which is useful in organizing large text corpora. For instance, news articles can be clustered based on their content to identify topics.

Practical Implementation

# Load necessary libraries
library(tm)
library(SnowballC)

# Assume `documents` is a character vector of text documents
# Create Term-Document Matrix
corpus <- Corpus(VectorSource(documents))
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(3, Inf)))

# Convert TDM to a matrix and scale it
tdm_matrix <- as.matrix(tdm)
tdm_matrix_scaled <- scale(tdm_matrix)

# Compute the distance matrix
dist_matrix <- dist(tdm_matrix_scaled, method = "euclidean")

# Perform hierarchical clustering
doc_hclust <- hclust(dist_matrix, method = "ward.D2")

# Cut the dendrogram to create clusters (let's assume 3 clusters)
document_clusters <- cutree(doc_hclust, k = 3)

# Add cluster assignments to the original dataset
documents_data <- data.frame(documents, cluster = document_clusters)

print(documents_data)

3. Gene Expression Data Analysis

Hierarchical clustering is widely used in bioinformatics for analyzing gene expression data. Genes with similar expression patterns are grouped together, which can give insights into gene function and regulatory mechanisms.

Practical Implementation

# Load necessary libraries
library(gplots)

# Assume `gene_expression_data` is a data frame with rows as genes and columns as samples
# Normalize the data
gene_expression_data_scaled <- scale(gene_expression_data)

# Compute the distance matrix
dist_matrix <- dist(gene_expression_data_scaled, method = "euclidean")

# Perform hierarchical clustering
gene_hclust <- hclust(dist_matrix, method = "ward.D2")

# Cut the dendrogram to create clusters (let's assume 4 clusters)
gene_clusters <- cutree(gene_hclust, k = 4)

# Add cluster assignments to the original dataset
gene_expression_data <- gene_expression_data %>%
  mutate(cluster = gene_clusters)

# Plot heatmap for visualization
heatmap.2(as.matrix(gene_expression_data[, -ncol(gene_expression_data)]), 
          Colv = as.dendrogram(gene_hclust), trace = "none", 
          dendrogram = "column", col = bluered(100),
          density.info = "none", scale = "row")

# Optionally, analyze cluster characteristics (e.g., mean expression level per cluster)
cluster_summary <- gene_expression_data %>%
  group_by(cluster) %>%
  summarise(across(everything(), mean, na.rm=TRUE))

print(cluster_summary)

These practical implementations demonstrate how to apply hierarchical clustering to real-world problems in R, enhancing your analytical skills in various domains including customer segmentation, document clustering, and gene expression analysis.

Advanced Clustering Techniques and Customizations

Agglomerative and Divisive Hierarchical Clustering in R

Hierarchical clustering can be performed using either agglomerative (bottom-up) or divisive (top-down) methods. Here, we will focus on the practical aspects using the hclust and diana methods from the cluster package in R.

Agglomerative Clustering

Step 1: Load Necessary Libraries

library(cluster)  # For the `diana` function
library(dendextend)  # For dendrogram customizations

Step 2: Load and Preprocess Data

Assume data is already preprocessed as per your project’s criteria:

# Replace 'your_data' with an actual dataset
data <- your_data

Step 3: Compute the Distance Matrix

Using the Euclidean distance, but this can be customized to other metrics as well:

dist_matrix <- dist(data, method = "euclidean")

Step 4: Perform Agglomerative Clustering

Use hclust:

hc <- hclust(dist_matrix, method = "ward.D")  # or "complete", "single", etc.

Step 5: Plot the Dendrogram

plot(hc, main = "Agglomerative Hierarchical Clustering Dendrogram")

Divisive Clustering

Step 1: Perform Divisive Clustering

Use diana function:

dc <- diana(data, metric = "euclidean")  # or other distance metrics

Step 2: Convert to Dendrogram

dc_dendrogram <- as.dendrogram(dc)

Step 3: Plot the Dendrogram

plot(dc_dendrogram, main = "Divisive Hierarchical Clustering Dendrogram")

Customizing Dendrograms for Better Interpretation

Step 1: Customize Leaves and Branches

dend_colored <- color_branches(hc, k = 4)  # Coloring by clusters
dend_colored <- color_labels(dend_colored, k = 4)

Step 2: Customize Layout

dend_colored <- set(dend_colored, "labels_cex", 0.7)  # Adjust label size
dend_colored <- set(dend_colored, "branches_lwd", 2)  # Adjust branch width
dend_colored <- set(dend_colored, "branches_k_color", c("blue", "green", "red", "purple"), k = 4)

Step 3: Plot Customized Dendrogram

plot(dend_colored, main = "Customized Agglomerative Hierarchical Clustering Dendrogram")

Cutting the Dendrogram to Form Clusters

Step 1: Choose the Number of Clusters

k <- 4  # Number of clusters
clusters <- cutree(hc, k)

Step 2: Assign Clusters to the Data

cluster_assignment <- data.frame(data, cluster = clusters)

Step 3: Visualize Clusters

library(ggplot2)
# Assuming data contains at least two components for simple 2D visualization
ggplot(cluster_assignment, aes(x = Component1, y = Component2, color = as.factor(cluster))) +
    geom_point() +
    labs(title = "Clustered Data", color = "Cluster")

By following this implementation, you will be able to perform advanced hierarchical clustering techniques and customize dendrograms extensively to derive meaningful insights from your data.

Related Posts

Mastering Reusable Code and Analysis in R

Ultimate Guide to R Programming For Data Analysis

An insightful thread covering R programming essentials, including data uploading, analytical patterns, visualization techniques, and leveraging R for effective business data analysis. Perfect for beginners and data professionals alike.