Clustering Algorithm- Affinity propagation

Praneethsanthosh
4 min readJul 2, 2021

--

Clustering is an unsupervised machine learning technique. It can be widely used in many sectors such as IT, health care, automobile etc. It gathers and identifies the data based on the similarities , shape, size, behavior etc. is usually used to classify data in to structures that are easily understand and manipulated. Also it can defined as dividing the datasets in to certain number of clusters in such a manner that the data point belongs to similar characteristics.

What are the types of clustering?

  1. K Means
  2. DBSCAN (Density based spatial clustering of applications with nose)
  3. OPTICS (Ordinary points to identify clustering structure)
  4. HDBSCAN (Hierarchical density based spatial clustering of applications with nose)
  5. Hierarchical clustering
  6. Fuzzy clustering
  7. Partitioning clustering
  8. PAM (Partitioning around medoids)
  9. Grid based clustering.

Main requirements of clustering algorithm should satisfy the below conditions:

  1. Scalability.
  2. Dealing with different types of attributes.
  3. Discovering clusters with arbitrary shape.
  4. Minimum requirement of domain knowledge to determine input parameters.
  5. Ability to deal with nose and outliers.
  6. Insensitivity to the order of input records.
  7. High dimensionality.
  8. Interpretability and usability.

Affinity propagation?

Affinity propagation is an clustering algorithm based on the concept of “Message passing” between the data points. Unlike clustering algorithm’s such as k-means or k-medoids, Affinity propagation doesn’t require to number of cluster’s to determined or estimated before running the algorithm. Similar to k-medoids affinity propagation finds ‘Exemplar's’ numbers of input set that representative of clusters.

It is well particularly suitable for problem where we don’t know the optimal number of clusters.

Concept/Theory description:

A. Scales well with size, as it uses a matrices

B. Uses 4 matrices to cluster i.e., Similarity matrix, Availability matrix, Responsibility matrix, Criterion matrix.

C. Sets “Exemplars”: objects that end up with the same exemplar are clusterd together.

What is similarity matrix?

  1. Contains values that corresponds to how similar two objects are
  2. Diagonal values: Fill them in with the lowest number among all the cells
  3. Non-Diagonal values: The negation the sum of the squares of the differences between the participants.

What is Responsibility matrix?

  1. Contains values that correspond to how responsible one object is for another.
  2. Example: Take the cell that expresses the responsibility of column to row, which is calculated by: the similarity of column and row minus the maximum of the remaining similarities in row’s.

Availability matrix ?

  1. Contains values that corresponds to how available one object is to be an example for another object
  2. Diagonal values: Sum of all positive responsibilities in the column, excluding the object’s self-responsibility.
  3. Non-Diagonal values: Take the Columns and row example again. The availability of column to row is column self-responsibility plus the sum of the remaining positive responsibilities of column’s excluding the responsibility of column to row.

Criterion matrix ?

  1. Each cell is the simply sum of the availability matrix and responsibility matrix at that location.
  2. The highest criterion value of each row is then designated as an exemplar.
  3. Rows that share the same exemplar end up being in the same cluster.
  4. Number of exemplars = Number of clusters

Why do we affinity propagation ?

The inventors of affinity propagation showed it is better for certain computer vision and computational biology tasks, e.g. clustering of pictures of human faces and identifying regulated transcripts, than k-means, even when k-means was allowed many random restarts and initialized using PCA.

Scatter plot:

Principal behind affinity propagation:

Source: Google

Demo of Pseudocode:

from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from sklearn.datasets import make_blobs
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5,
random_state=0)

# Compute Affinity Propagation
af = AffinityPropagation(preference=-50).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

n_clusters_ = len(cluster_centers_indices)

source: sklearn

Advantages:

Affinity Propagation (AP) clustering does not need to set the number of clusters, and has advantages on efficiency and accuracy, but is not suitable for large-scale data clustering. … The data set to be clustered is firstly divided into several subsets, each of which can be efficiently clustered by AP algorithm.

Drawbacks:

  1. It is hard to know the value of the parameter preferences which can yield an optimal clustering solution.
  2. When oscillations occur, AP cannot automatically eliminate them.

--

--

No responses yet