Deciphering Unsupervised ML Algorithms and the Mathematical Theories #MLAlgorithms

Summarise this content to 300 words

Hello guys, I am Anju Reddy having experience in computer vision and supervised machine learning algorithms along with neural networks. In this particular blog we shall deep dive into understanding unsupervised machine learning algorithm along with mathematical intuition behind it. So, stay with until the end if you want to grasp everything about unsupervised machine learning algorithms.

Pre-requisites
Basics of machine learning

Topics Covered

K Means Clustering
Silhouette Performance Metric
Hierarchical Clustering
DB-SCAN

1.K-Means Clustering

Imagine sorting vegetables into different groups based on their size and color.

Imagine you have a variety of vegetables, and you want to group them into different categories based on their size and color. You don’t know how many different categories (clusters) there are, so you decide to try different numbers of groups and see which works best.

How K Means Clustering Works?

i) Choose the number of groups (K):

You decide how many groups (K) you want to use to sort the vegetables. Each group represents a cluster.

ii) Place Group Centroids (Initial Centroids):

You randomly place K group centers (centroids) among the vegetables.

iii) Sort Vegetables into groups (Assign Clusters):

You assign each vegetable to the nearest group (centroid) based on its size and color. The distance from each vegetable to each centroid is calculated using a distance metric (usually Euclidean distance).

Distance (vegetable, centroid) is given by.

where sizevegetable and colorvegetable are the size and color of the vegetable, and sizecentroid and colorcentroid are the size and color of the centroid.

iv) Compute Centroid of Each Group:

Once all vegetables are assigned, you compute the average (mean) size and color of the vegetables in each group. This average becomes the new centroid for each group.

v) Repeat Until Convergence:

Steps 3 and 4 are repeated until the vegetables stop switching groups (convergence).

Step-by-Step Process with Example:

Let’s say you have these vegetables:

i) Choose K (Number of Groups):

You decide to start with K = 2 groups (clusters).

ii) Place Initial Centroids:

Randomly place two initial centroids (group centers) among the vegetables.

Let’s say initial centroids are:

Centroid 1: Size = 6 cm, Color = 6
Centroid 2: Size = 10 cm, Color = 7

iii) Assign Groups (Sort Vegetables):

Calculate the distance from each vegetable to each centroid and assign each vegetable to the nearest centroid.

Example:

Calculate distances for Carrot to both centroids and assign it to the nearest.

iv) Computer New Centroids:

After sorting all vegetables, compute the average size and color for each group to find the new centroids.

Example

New centroid for Group 1 (Centroid 1) might be Size = 6.5 cm, Color = 6.5
New centroid for Group 2 (Centroid 2) might be Size = 11 cm, Color = 6

v) Repeat Steps 3 and 4:

Repeat until the vegetables stop switching groups and centroids stabilize.

Where K-Means Clustering is Used:

Market Segmentation: Grouping customers based on buying behavior.
Image Segmentation: Grouping pixels in an image based on color.
Anomaly Detection: Identifying unusual patterns in data.

Elbow Method and Within Cluster Sum of Squared (WCSS):

Elbow Method: Helps to choose the optimal number of clusters (K).
Within-Cluster Sum of Squares (WCSS): Measures the compactness of clusters.

Elbow Method:

You try different numbers of groups (K) and measure WCSS.
Plot WCSS against K and look for a point where the decrease slows down (elbow point).

Where:

Ci is the set of points assigned to cluster i.
μi is the centroid of cluster i.
∣∣x−μi∣∣ is the Euclidean distance between point x and centroid μi.

Summary:

K-means Clustering sorts data into K clusters based on similarity.
Centroids are average points of clusters.
Distance Metrics measure similarity.
Elbow Method helps choose K.
WCSS measures cluster tightness.

2. Silhouette Score (Performance Metric):

The silhouette score measures how similar each vegetable (or any data point) is to its own group (cluster) compared to other groups. It ranges from -1 to 1:

A score close to 1 means the vegetable is well-matched to its own group and poorly matched to neighboring groups.
A score close to 0 means the vegetable is on or very close to the decision boundary between two neighboring clusters.
A score close to -1 means the vegetable might have been assigned to the wrong group.

How to Calculate Silhouette Score?

For each vegetable:

i) Calculate the average distance to all other vegetables in the same group (a).

ii) Calculate the average distance to all vegetables in the nearest different group (b).

iii) Compute the silhouette score for the vegetable using the formula:

a is the average distance to other vegetables in the same group.
b is the average distance to vegetables in the nearest different group.

Step-by-Step Process With Example:

i) Calculating Silhouette Score for Carrot (Group 1)

Calculate a(average distance to other vegetables in the same group):

Average a = 3.61+1.412 / 2 = 2.51

ii) Calculate b (average distance to vegetables in the nearest different group):

Average b = 5.66 + 2.83 / 2 = 4.25

iii) Compute the Silhouette Score for Carrot

This score indicates that the carrot is reasonably well clustered in group 1.

Use of Silhouette Score:

The Silhouette Score helps to understand

How well each vegetable fits within its assigned group.
Whether the number of groups (K) you’ve chosen is appropriate.

How to use Silhouette Score:

Compute the silhouette score for each vegetable.
Calculate the average silhouette score for all vegetables. This gives you an overall measure of how well the clustering has been performed.
Try different values of K (number of groups) and compute the average silhouette score for each K. The value of K with the highest average silhouette score is generally the best choice.

Summary:

The silhouette score measures how well each vegetable is clustered.
It ranges from -1 to 1, with higher scores indicating better clustering.
Use it to determine the best number of clusters (K) by choosing the K with the highest average silhouette score.

3. Hierarchical Clustering:

There are two main types of Hierarchical Clustering

i) Agglomerative (bottom-up): Starts with each vegetable in its own cluster and merges the closest clusters iteratively.

ii) Divisive (top-down): Starts with all vegetables in one cluster and splits them iteratively.

We’ll focus on the agglomerative approach, which is more common.

Step-by-Step Process with Example:

i) Calculate the Distance Between Each Pair of Vegetables:

Use a distance metric usually a Euclidean Distance

Example distances:

ii) Create a Distance Matrix:

This matrix shows the distance between each pair of vegetables.

iii) Merge the Closest Clusters:

Start with each vegetable as its own cluster.
Find the pair of clusters with the smallest distance and merge them.

Example:

Merge Carrot and Pepper (distance = 1.41).

iv) Update the Distance Matrix:

Recalculate distances between the new cluster and all other clusters.

Example:

New cluster (Carrot-Pepper) distances:

To Tomato: average of (Carrot-Tomato and Pepper-Tomato) = (3.61 + 2.24) / 2 = 2.93
To Cabbage: average of (Carrot-Cabbage and Pepper-Cabbage) = (5.66 + 6.32) / 2 = 5.99
To Broccoli: average of (Carrot-Broccoli and Pepper-Broccoli) = (2.83 + 3.16) / 2 = 2.995

v) Repeat Until All Vegetables are Merged into a Single Cluster:

Continue merging the closest clusters and updating the distance matrix until all vegetables are in one cluster.

Dendrogram

A dendrogram is a tree-like diagram that shows the arrangement of the clusters produced by hierarchical clustering. It illustrates the process of merging clusters and helps decide the number of clusters by cutting the tree at a certain level.

How to Interpret a Dendrogram

Each leaf (bottom of the tree) represents a single vegetable.
As you move up the tree, vegetables are merged into clusters.
The height of the branches indicates the distance (or dissimilarity) between clusters.
You can “cut” the dendrogram at a chosen height to determine the number of clusters.

Example Dendrogram

For our example, our dendrogram might look like this:

Why use Hierarchical Clustering?

No need to pre-specify the number of clusters.
Produces a hierarchy of clusters that can be interpreted at different levels of granularity.
Useful for exploratory data analysis to understand the structure of the data.

Mathematical Formula:

The primary mathematical component in hierarchical clustering is the distance calculation. The Euclidean distance formula is used to measure the similarity between points:

Summary:

Hierarchical Clustering: Groups data based on similarity, building a tree-like structure of clusters.
Agglomerative Approach: Starts with each item in its own cluster and merges the closest clusters iteratively.
Distance Matrix: Shows the distances between each pair of items.
Dendrogram: A tree diagram that illustrates the merging process and helps determine the number of clusters.

Source link

Source link: https://medium.com/@anju75061/understanding-unsupervised-ml-algorithms-and-math-behind-it-be655b8c3412?source=rss——artificial_intelligence-5