Mind-Let Loose: Clustering

Well, this is a concept which I was reminded of recently at a presentation, since I learnt it as a subject at the university. It was a bit of a confusing topic to me those days, especially with all the algorithms involved. However, I guess I have a better knowledge of it at the moment to explain the concept in a simple way which most people could grasp.

Clustering is used to group objects in a way by which the group members are similar to each other. These objects can be anything related to data.

So why should data be clustered like this? It's simple. This concept helps to analyze important patterns from seemingly unimportant data; Such patterns are invaluable to organizations such as business firms, research firms etc.

Marketing: finding groups of customers with similar behavior when given a large database of customer data containing their properties and past buying records
Biology: classification of plants and animals once given their features
Libraries: book ordering
Insurance: identifying groups of motor insurance policy holders with a high average claim cost and also identifying frauds
WWW: document classification and clustering weblog data to discover groups of similar access patterns

There are 4 TYPES of clustering algorithm classifications.

Exclusive Clustering - In this method, data are grouped in an exclusive way; If a certain datum belongs to a definite cluster then it could not be included in another cluster
Overlapping Clustering - This uses fuzzy sets to cluster data; Each point may belong to two or more clusters with different degrees of membership
Hierarchical Clustering - Is based on the union between the two nearest clusters. Here, the beginning condition for clustering is realized by setting every datum as a cluster. After a few iterations, it reaches the final clusters wanted
Probabilistic Clustering - It uses a completely probabilistic approach

The 4 most used clustering algorithms are as follows:

K-means
Fuzzy C-means
Hierarchical clustering
Mixture of Gaussians

K-means is an exclusive clustering algorithms whereas Fuzzy C-means is an overlapping clustering algorithm. Hierarchical clustering obviously belongs to the Hierarchical clustering algorithm. Mixture of Gaussians is classified under probabilistic clustering

Mind-Let Loose

About Me

Sunday, October 31, 2010

Clustering

No comments:

Post a Comment