Introduction to Hierarchical Clustering
Overview
Hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. Unlike k-means clustering, it does not require the user to pre-specify the number of clusters. There are two types of hierarchical clustering:
- Agglomerative (bottom-up): Starts with each object in a single cluster and merges nearest clusters iteratively.
- Divisive (top-down): Starts with all objects in a single cluster and splits the least coherent clusters iteratively.
Algorithm Steps
Agglomerative Clustering
- Compute the distance matrix: Distances between all data points.
- Find the closest two clusters and merge them: Initialize each data point in its own cluster.
- Update the distance matrix: Calculate distances between the new cluster and all other clusters.
- Repeat steps 2-3 until all points are in a single cluster.
Divisive Clustering
- Start with all data in one cluster.
- Choose the cluster to split: Often the one with the highest SSE (Sum of Squared Errors).
- Create two child clusters: By using an algorithm like k-means with k=2.
- Repeat steps 2-3 until each cluster contains a single data point or meets stopping criteria.
Practical Implementation in R
Step 1: Install and Load Necessary Libraries
# Install necessary libraries (if not already installed)
if (!require(cluster)) install.packages("cluster")
if (!require(dendextend)) install.packages("dendextend")
# Load libraries
library(cluster)
library(dendextend)
Step 2: Prepare Data
# Sample data: Iris dataset
data <- iris[, -5] # Removing the species column
Step 3: Compute Distance Matrix
# Compute the Euclidean distance matrix
distance_matrix <- dist(data, method = "euclidean")
Step 4: Perform Hierarchical Clustering
# Perform agglomerative hierarchical clustering using the complete linkage method
hc <- hclust(distance_matrix, method = "complete")
Step 5: Plot Dendrogram
# Plot the dendrogram
plot(as.dendrogram(hc), main = "Hierarchical Clustering Dendrogram", xlab = "Samples", ylab = "Height")
Step 6: Cut Dendrogram to Form Clusters
# Cut the dendrogram at a specified height to obtain clusters
clusters <- cutree(hc, k = 3) # k is the number of clusters, example k=3
# Add clusters to the original data
data$cluster <- as.factor(clusters)
Step 7: Visualize Clusters
# Load ggplot2 for visualization
if (!require(ggplot2)) install.packages("ggplot2")
library(ggplot2)
# Visualize clusters
ggplot(data, aes(x = Sepal.Length, y = Sepal.Width, color = cluster)) +
geom_point(size = 3) +
labs(title = "Hierarchical Clustering of Iris Data", x = "Sepal Length", y = "Sepal Width") +
theme_minimal()
Conclusion
In this practical guide, you have learned the basic steps of performing hierarchical clustering using R. You started with preparing the data, computing the distance matrix, performing hierarchical clustering, plotting the dendrogram, cutting the dendrogram to form clusters, and finally, visualizing the clusters. This knowledge is crucial for advanced analytical tasks, helping you discover and understand the structure within your data.
Setting Up Your R Environment
This guide will walk you through the process of setting up your R environment to effectively manage hierarchical clustering projects.
1. Install R
Ensure you have R installed on your system. If you haven’t installed R yet, download it from CRAN.
2. Install RStudio
RStudio is a powerful IDE for R. To install it, download the installer appropriate for your operating system from RStudio’s official website.
3. Install Required Packages
You’ll need several R packages to perform hierarchical clustering and visualization. The following script installs these packages:
# Install essential packages for hierarchical clustering
install.packages(c("stats", "factoextra", "dendextend"))
# Load the packages
library(stats)
library(factoextra)
library(dendextend)
4. Verify Package Installation
Run the following to ensure all packages are correctly installed:
# Check if the packages are loaded correctly
if(!require(stats)) stop("stats package not loaded")
if(!require(factoextra)) stop("factoextra package not loaded")
if(!require(dendextend)) stop("dendextend package not loaded")
print("All packages loaded successfully!")
5. Set Working Directory
Setting your working directory ensures that all files read or written are directed to a specified folder. Adjust the path according to your project directory.
# Set your working directory to the folder where project files are stored
setwd("/path/to/your/project/folder")
# Verify the working directory
print(getwd())
6. Load Data
Load your dataset into R for hierarchical clustering. Assuming you have a CSV file named data.csv
.
# Load the dataset
data <- read.csv("data.csv", header = TRUE, sep = ",")
# Display the first few rows of the dataset
head(data)
7. Basic Data Preprocessing
Preprocessing ensures that your data is clean and ready for clustering.
# Handle missing values by removing rows with any NA values
data_clean <- na.omit(data)
# Scale the data (standardize variables to have zero mean and unit variance)
data_scaled <- scale(data_clean)
# Preview the cleaned and scaled data
head(data_scaled)
8. Performing Hierarchical Clustering
Apply hierarchical clustering using the hclust function.
# Compute the distance matrix
dist_matrix <- dist(data_scaled, method = "euclidean")
# Perform hierarchical clustering
hc <- hclust(dist_matrix, method = "ward.D2")
# Plot the dendrogram
plot(hc, cex = 0.6, hang = -1)
9. Visualizing Clusters
Enhance the visualization of your dendrogram with colored clusters using the factoextra
package.
# Visualize the dendrogram with colored clusters
fviz_dend(hc, k = 4, # Cut into 4 clusters
cex = 0.5, # Label size
k_colors = rainbow(4), # Cluster colors
rect = TRUE, # Add rectangle around clusters
rect_border = rainbow(4),# Rectangle border colors
rect_fill = TRUE) # Fill the rectangles
10. Save the Dendrogram Plot
To save your dendrogram plot as an image file:
# Save the dendrogram plot as a PNG file
png("dendrogram.png", width = 800, height = 600)
fviz_dend(hc, k = 4, cex = 0.5, k_colors = rainbow(4), rect = TRUE, rect_border = rainbow(4), rect_fill = TRUE)
dev.off()
This completes the setup for performing hierarchical clustering in R within your project. Now your environment is ready for advanced hierarchical clustering analysis and the implementation of practical examples.
Understanding Dendrograms
Overview
A dendrogram is a tree-like diagram that records the sequences of merges or splits in hierarchical clustering. It’s an efficient way to visualize the arrangement of the clusters produced by hierarchical clustering algorithms.
Key Concepts
Nodes and Height
- Leaf Nodes: Represent individual data points or objects.
- Internal Nodes: Represent the clusters formed at various levels.
- Height: Indicates the distance or dissimilarity at which the clusters are merged. The y-axis usually represents the height.
Types of Linkage
- Single Linkage: Minimum distance between points in the clusters.
- Complete Linkage: Maximum distance between points in the clusters.
- Average Linkage: Average distance between points in the clusters.
- Ward’s Method: Minimizes the total variance within clusters.
Practical Implementation in R
Step 1: Load Necessary Libraries
Make sure you load the required libraries:
library(datasets)
library(cluster)
library(dendextend)
Step 2: Load Data
For simplicity, let’s use the iris
dataset:
data(iris)
iris_data <- iris[, -5] # Remove species column for clustering
Step 3: Compute Distance Matrix
Calculate the distance between the data points:
distance_matrix <- dist(iris_data)
Step 4: Apply Hierarchical Clustering
Perform hierarchical clustering using different linkage methods. Here we use Ward’s method:
hc <- hclust(distance_matrix, method = "ward.D2")
Step 5: Plot the Dendrogram
Visualize the hierarchical clustering as a dendrogram:
plot(hc, main = "Dendrogram of Iris Data", xlab = "", sub = "", cex = 0.9, hang = -1)
Step 6: Cut the Dendrogram
Cut the dendrogram to form clusters:
groups <- cutree(hc, k = 3) # Assuming you want 3 clusters
rect.hclust(hc, k = 3, border = 2:4) # Optionally highlight clusters in red/green/blue
Step 7: Interpret the Clusters
Interpret the resulting clusters:
table(groups, iris$Species)
This table compares clusters with actual species to assess clustering quality.
Important Considerations
- Interpretation: Analyze the heights at which clusters are merged to understand cluster separability.
- Validation: Compare with true labels (if available) to validate clusters.
Conclusion
Understanding and interpreting dendrograms is crucial in hierarchical clustering. This hands-on guide provides a detailed implementation that helps in grasping the practical aspects of using dendrograms in R. You can now leverage this technique for your own datasets and clustering needs.
Data Preprocessing and Normalization
Data Preprocessing
Preprocessing is a crucial step in hierarchical clustering. It involves cleaning, transforming, and organizing your data. Below are the steps that you would typically take in R to preprocess data for hierarchical clustering.
Load Data
Assuming your dataset is stored in a CSV file:
# Load necessary library
library(tidyverse)
# Read the dataset
data <- read.csv("your_dataset.csv")
# Display the first few rows of the dataset
head(data)
Handling Missing Values
Handling missing values is vital as they can skew the results of the clustering process.
# Check for missing values
sum(is.na(data))
# Option 1: Remove rows with any missing values
data_clean <- na.omit(data)
# Option 2: Fill missing values with column mean (other imputation methods can also be used)
data_clean <- data %>%
mutate(across(everything(), ~ifelse(is.na(.), mean(., na.rm = TRUE), .)))
Removing Non-numerical Columns
Hierarchical clustering primarily works with numerical data, so it’s prudent to remove or transform non-numerical columns.
# If you have categorical data, you might need to convert it to numerical form or remove it.
# For simplicity, removing non-numerical columns here:
data_numeric <- data_clean %>%
select(where(is.numeric))
Data Normalization
Normalization (or standardization) is important to ensure that each feature contributes equally to the computation of distances in clustering.
# Normalize the numerical data
data_normalized <- scale(data_numeric)
# Confirm normalization
summary(data_normalized)
Example: Applying Preprocessing and Normalization
Here’s a script that performs these tasks sequentially:
library(tidyverse)
# Load data
data <- read.csv("your_dataset.csv")
# Remove rows with any NA values
data_clean <- na.omit(data)
# Or alternatively, fill NA with column mean
data_clean <- data %>%
mutate(across(everything(), ~ifelse(is.na(.), mean(., na.rm = TRUE), .)))
# Retain only numerical columns
data_numeric <- data_clean %>%
select(where(is.numeric))
# Normalize the numerical data
data_normalized <- scale(data_numeric)
# Display summary and head of the cleaned and normalized data
summary(data_normalized)
head(data_normalized)
This script cleans, transforms, and normalizes your data, making it ready for hierarchical clustering analysis in R. Use this pipeline to prepare your data before moving on to the hierarchical clustering stages.
Exploring Clustering Algorithms in R
This section demonstrates the practical implementation of hierarchical clustering using R with a real-world dataset. The focus is on applying hierarchical clustering, plotting dendrograms, and interpreting the clusters.
Loading Required Libraries
# Load necessary libraries
library(cluster)
library(factoextra)
Dataset Example: Iris Dataset
The Iris dataset is commonly used for clustering and classification tasks. It contains 150 samples of iris flowers with four features: Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width.
# Load the Iris dataset
data(iris)
iris_data <- iris[, -5] # Remove the species column for clustering
Compute Distance Matrix
Hierarchical clustering requires a distance matrix. We will use Euclidean distance.
# Compute the distance matrix
distance_matrix <- dist(iris_data, method = "euclidean")
Perform Hierarchical Clustering
Using the calculated distance matrix, we can perform hierarchical clustering with different linkage methods (e.g., complete, average, single).
# Perform hierarchical clustering using complete linkage
hc_complete <- hclust(distance_matrix, method = "complete")
# Perform hierarchical clustering using average linkage
hc_average <- hclust(distance_matrix, method = "average")
# Perform hierarchical clustering using single linkage
hc_single <- hclust(distance_matrix, method = "single")
Plot Dendrograms
Visualize the hierarchical clustering results with dendrograms.
# Plot dendrogram for complete linkage
plot(hc_complete, main = "Complete Linkage Dendrogram", xlab = "", sub = "", cex = 0.9)
# Plot dendrogram for average linkage
plot(hc_average, main = "Average Linkage Dendrogram", xlab = "", sub = "", cex = 0.9)
# Plot dendrogram for single linkage
plot(hc_single, main = "Single Linkage Dendrogram", xlab = "", sub = "", cex = 0.9)
Cut Dendrogram and Create Clusters
Choose a number of clusters and cut the dendrogram to define cluster memberships.
# Cut tree to create 3 clusters for complete linkage
clusters_complete <- cutree(hc_complete, k = 3)
# Cut tree to create 3 clusters for average linkage
clusters_average <- cutree(hc_average, k = 3)
# Cut tree to create 3 clusters for single linkage
clusters_single <- cutree(hc_single, k = 3)
Visualize Clusters
Use the fviz_cluster
function from the factoextra
library to visualize the clusters.
# Visualize clusters for complete linkage
fviz_cluster(list(data = iris_data, cluster = clusters_complete), geom = "point", ellipse.type = "norm", ggtheme = theme_minimal(), main = "Complete Linkage Clusters")
# Visualize clusters for average linkage
fviz_cluster(list(data = iris_data, cluster = clusters_average), geom = "point", ellipse.type = "norm", ggtheme = theme_minimal(), main = "Average Linkage Clusters")
# Visualize clusters for single linkage
fviz_cluster(list(data = iris_data, cluster = clusters_single), geom = "point", ellipse.type = "norm", ggtheme = theme_minimal(), main = "Single Linkage Clusters")
These steps demonstrate how hierarchical clustering can be performed in R using real-world data. The chosen methods illustrate the entire workflow from loading data and calculating distances to performing clustering and visualizing the results.
Creating Dendrograms from Scratch
Objective
In this section, we will create dendrograms from scratch using R. Dendrograms are a tree-like diagram that records the sequences of merges or splits.
Step-by-Step Implementation
We will use the hclust
function, which performs hierarchical clustering, and then use the plot
function to create the dendrogram.
1. Load and Prepare Your Data
# Load necessary libraries
library(datasets)
# Use a built-in dataset, for example, the Iris dataset
data <- iris[, -5] # Exclude the species column
# Inspect the data
head(data)
2. Calculate the Distance Matrix
# Calculate the Euclidean distance matrix
dist_matrix <- dist(data, method = "euclidean")
3. Perform Hierarchical Clustering
# Perform hierarchical clustering using the "complete" linkage method
hc <- hclust(dist_matrix, method = "complete")
4. Plot the Dendrogram
# Plot the dendrogram
plot(hc, main="Dendrogram for Iris Data", xlab="", sub="", cex=0.9)
5. Cut the Dendrogram for a Specific Number of Clusters (Optional)
To visualize specific clusters, you can cut the dendrogram.
# Cut the dendrogram to form 3 clusters
clusters <- cutree(hc, k=3)
# Plot the cut dendrogram
plot(hc)
rect.hclust(hc, k=3, border="red")
6. Assign Cluster Labels back to the Data (Optional)
# Assign cluster labels to the original data
iris$Cluster <- as.factor(clusters)
# Inspect the first few rows of the data with cluster labels
head(iris)
Conclusion
You have successfully created a dendrogram from scratch using hierarchical clustering in R. This method leverages the built-in functions hclust
and plot
to visualize the hierarchical clustering process, making it easier to understand the structure of your data.
Visualizing Cluster Trees in R
To visualize cluster trees in R, you can use hierarchical clustering techniques and then plot the dendrogram. Here is a step-by-step practical implementation:
Step 1: Load Necessary Libraries
library(stats) # For hierarchical clustering functions
library(dendextend) # For enhanced dendrogram functionalities
Step 2: Prepare Data
Suppose you have a dataset data_matrix
that is already preprocessed and normalized.
Step 3: Perform Hierarchical Clustering
Here, we use the hclust()
function for hierarchical clustering.
# Calculate the distance matrix
distance_matrix <- dist(data_matrix)
# Perform hierarchical clustering using complete linkage
cluster_tree <- hclust(distance_matrix, method = "complete")
Step 4: Plot the Dendrogram
The base plotting tools or specialized dendrogram packages can be used.
Base R Plot
# Plot basic dendrogram
plot(cluster_tree, main = "Cluster Dendrogram", xlab = "Samples", sub = "", cex = 0.9, hang = -1)
Enhanced Plot using dendextend
# Convert hclust object to a dendrogram object
dend <- as.dendrogram(cluster_tree)
# Customize the dendrogram (optional: color branches, labels etc.)
dend <- dend %>%
set("branches_k_color", k = 4) %>% # Color branches by groups
set("labels_colors") %>%
set("labels_cex", 0.7)
# Plot the dendrogram
plot(dend, main = "Enhanced Cluster Dendrogram")
Step 5: Cutting the Dendrogram for Cluster Assignments
Determine the number of clusters and cut the tree.
# Cut tree into 4 clusters for example
clusters <- cutree(cluster_tree, k = 4)
# Plotting dendrogram with rectangles to show clusters
rect.hclust(cluster_tree, k = 4, border = 2:5)
Step 6: Visualize Clusters on the Original Data (Optional)
This step helps in understanding how data points are clustered visually, typically done using PCA for dimensionality reduction.
# Perform Principal Component Analysis for 2D plotting
pca_result <- prcomp(data_matrix, scale. = TRUE)
# Plot PCA with clusters
plot(pca_result$x[,1], pca_result$x[,2], col = clusters,
main = "PCA of Clusters", xlab = "PC1", ylab = "PC2", pch = 19)
# Optionally add cluster centroids
centroids <- aggregate(pca_result$x[,1:2], list(clusters), mean)
points(centroids[,2:3], pch = 8, col = 1:nlevels(as.factor(clusters)), cex = 2)
Conclusion
This implementation provides a complete example of visualizing cluster trees in R, from hierarchical clustering to dendrogram plotting and cluster visualization on the original data. It integrates basic and enhanced visualization techniques to help you clearly understand and present your hierarchical clustering results.
Evaluating Clustering Results in R
Prerequisites
Assuming you have already performed hierarchical clustering and generated a dendrogram, we’ll go through the practical steps to evaluate the clustering results.
Internal Evaluation Metrics
- Silhouette Width
- Measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation).
# Required Libraries
library(cluster)
# Example Data
set.seed(123)
data <- mtcars
# Perform Hierarchical Clustering
d <- dist(data) # Euclidean Distance
hc <- hclust(d, method = "ward.D2")
# Cut Tree to Create Clusters
clusters <- cutree(hc, k = 4)
# Calculate Silhouette Width
silhouette_info <- silhouette(clusters, d)
summary(silhouette_info)
- Dunn Index
- Ratio of the minimum inter-cluster distance to the maximum intra-cluster distance.
# Dunn Index requires the 'clValid' package
library(clValid)
# Calculate Dunn Index
dunn_index <- dunn(cl = clusters, Data = data)
print(dunn_index)
External Evaluation Metrics
- Adjusted Rand Index (ARI)
- Compares the similarity between the clusters and the true labels.
# Example: Comparing clusters to ground truth using the 'mclust' package
library(mclust)
# Assuming you have true labels in true_labels vector
true_labels <- mtcars$cyl # For example, true labels based on the number of cylinders
# Calculate ARI
ari_value <- adjustedRandIndex(true_labels, clusters)
print(ari_value)
- Cluster Purity
- Measures the extent to which clusters contain a single class.
# Calculate Cluster Purity
calculate_purity <- function(clusters, labels) {
contingency_table <- table(clusters, labels)
sum(apply(contingency_table, 1, max)) / length(labels)
}
# Assuming true_labels vector exists
purity <- calculate_purity(clusters, true_labels)
print(purity)
Visual Evaluation Metrics
- Cluster Plot
- Visual examination of the clusters can provide insights. You can use the
factoextra
package for visualization:
- Visual examination of the clusters can provide insights. You can use the
# Required Library
library(factoextra)
# Visualize Clusters
fviz_cluster(list(data = data, cluster = clusters), geom = "point")
- Dendrogram with Colored Labels
- Color the dendrogram based on the clusters for better visualization.
# Required Library
library(dendextend)
# Color Dendrogram
dend <- as.dendrogram(hc)
dend <- color_branches(dend, k = 4)
plot(dend)
Interpretation of Results
- Silhouette Width: Values close to 1 imply well-clustered data, values close to 0 mean overlapping clusters, and negative values imply misclassified points.
- Dunn Index: Higher values indicate better clustering.
- Adjusted Rand Index (ARI): Values close to 1 indicate high similarity, while values close to 0 indicate no agreement.
- Cluster Purity: Higher purity indicates better clustering based on true labels.
- Visual Inspection: Helps in understanding the nature and separation of clusters.
These metrics and visualizations will provide a comprehensive evaluation of your hierarchical clustering efforts. Apply these steps directly to your dataset in R to gauge the quality of your clustering.
Real-World Applications of Hierarchical Clustering
1. Customer Segmentation
Hierarchical clustering can be used to segment customers based on their purchasing behavior. For instance, an e-commerce company can analyze purchase history, the frequency of purchases, and total spending to identify distinct customer segments.
Practical Implementation
# Load necessary libraries
library(dplyr)
library(cluster)
# Assume `customer_data` is your dataset with relevant features
# Normalize the data
customer_data_scaled <- scale(customer_data)
# Compute the distance matrix
dist_matrix <- dist(customer_data_scaled, method = "euclidean")
# Perform hierarchical clustering
customer_hclust <- hclust(dist_matrix, method = "ward.D2")
# Cut the dendrogram to create clusters (let's assume 5 clusters)
customer_clusters <- cutree(customer_hclust, k = 5)
# Add cluster assignments to the original dataset
customer_data <- customer_data %>%
mutate(cluster = customer_clusters)
# Analyze cluster characteristics
cluster_summary <- customer_data %>%
group_by(cluster) %>%
summarise(across(everything(), mean, na.rm=TRUE))
print(cluster_summary)
2. Document Clustering
Hierarchical clustering can be applied to group similar documents, which is useful in organizing large text corpora. For instance, news articles can be clustered based on their content to identify topics.
Practical Implementation
# Load necessary libraries
library(tm)
library(SnowballC)
# Assume `documents` is a character vector of text documents
# Create Term-Document Matrix
corpus <- Corpus(VectorSource(documents))
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(3, Inf)))
# Convert TDM to a matrix and scale it
tdm_matrix <- as.matrix(tdm)
tdm_matrix_scaled <- scale(tdm_matrix)
# Compute the distance matrix
dist_matrix <- dist(tdm_matrix_scaled, method = "euclidean")
# Perform hierarchical clustering
doc_hclust <- hclust(dist_matrix, method = "ward.D2")
# Cut the dendrogram to create clusters (let's assume 3 clusters)
document_clusters <- cutree(doc_hclust, k = 3)
# Add cluster assignments to the original dataset
documents_data <- data.frame(documents, cluster = document_clusters)
print(documents_data)
3. Gene Expression Data Analysis
Hierarchical clustering is widely used in bioinformatics for analyzing gene expression data. Genes with similar expression patterns are grouped together, which can give insights into gene function and regulatory mechanisms.
Practical Implementation
# Load necessary libraries
library(gplots)
# Assume `gene_expression_data` is a data frame with rows as genes and columns as samples
# Normalize the data
gene_expression_data_scaled <- scale(gene_expression_data)
# Compute the distance matrix
dist_matrix <- dist(gene_expression_data_scaled, method = "euclidean")
# Perform hierarchical clustering
gene_hclust <- hclust(dist_matrix, method = "ward.D2")
# Cut the dendrogram to create clusters (let's assume 4 clusters)
gene_clusters <- cutree(gene_hclust, k = 4)
# Add cluster assignments to the original dataset
gene_expression_data <- gene_expression_data %>%
mutate(cluster = gene_clusters)
# Plot heatmap for visualization
heatmap.2(as.matrix(gene_expression_data[, -ncol(gene_expression_data)]),
Colv = as.dendrogram(gene_hclust), trace = "none",
dendrogram = "column", col = bluered(100),
density.info = "none", scale = "row")
# Optionally, analyze cluster characteristics (e.g., mean expression level per cluster)
cluster_summary <- gene_expression_data %>%
group_by(cluster) %>%
summarise(across(everything(), mean, na.rm=TRUE))
print(cluster_summary)
These practical implementations demonstrate how to apply hierarchical clustering to real-world problems in R, enhancing your analytical skills in various domains including customer segmentation, document clustering, and gene expression analysis.
Advanced Clustering Techniques and Customizations
Agglomerative and Divisive Hierarchical Clustering in R
Hierarchical clustering can be performed using either agglomerative (bottom-up) or divisive (top-down) methods. Here, we will focus on the practical aspects using the hclust
and diana
methods from the cluster
package in R.
Agglomerative Clustering
Step 1: Load Necessary Libraries
library(cluster) # For the `diana` function
library(dendextend) # For dendrogram customizations
Step 2: Load and Preprocess Data
Assume data
is already preprocessed as per your project’s criteria:
# Replace 'your_data' with an actual dataset
data <- your_data
Step 3: Compute the Distance Matrix
Using the Euclidean distance, but this can be customized to other metrics as well:
dist_matrix <- dist(data, method = "euclidean")
Step 4: Perform Agglomerative Clustering
Use hclust
:
hc <- hclust(dist_matrix, method = "ward.D") # or "complete", "single", etc.
Step 5: Plot the Dendrogram
plot(hc, main = "Agglomerative Hierarchical Clustering Dendrogram")
Divisive Clustering
Step 1: Perform Divisive Clustering
Use diana
function:
dc <- diana(data, metric = "euclidean") # or other distance metrics
Step 2: Convert to Dendrogram
dc_dendrogram <- as.dendrogram(dc)
Step 3: Plot the Dendrogram
plot(dc_dendrogram, main = "Divisive Hierarchical Clustering Dendrogram")
Customizing Dendrograms for Better Interpretation
Step 1: Customize Leaves and Branches
dend_colored <- color_branches(hc, k = 4) # Coloring by clusters
dend_colored <- color_labels(dend_colored, k = 4)
Step 2: Customize Layout
dend_colored <- set(dend_colored, "labels_cex", 0.7) # Adjust label size
dend_colored <- set(dend_colored, "branches_lwd", 2) # Adjust branch width
dend_colored <- set(dend_colored, "branches_k_color", c("blue", "green", "red", "purple"), k = 4)
Step 3: Plot Customized Dendrogram
plot(dend_colored, main = "Customized Agglomerative Hierarchical Clustering Dendrogram")
Cutting the Dendrogram to Form Clusters
Step 1: Choose the Number of Clusters
k <- 4 # Number of clusters
clusters <- cutree(hc, k)
Step 2: Assign Clusters to the Data
cluster_assignment <- data.frame(data, cluster = clusters)
Step 3: Visualize Clusters
library(ggplot2)
# Assuming data contains at least two components for simple 2D visualization
ggplot(cluster_assignment, aes(x = Component1, y = Component2, color = as.factor(cluster))) +
geom_point() +
labs(title = "Clustered Data", color = "Cluster")
By following this implementation, you will be able to perform advanced hierarchical clustering techniques and customize dendrograms extensively to derive meaningful insights from your data.