In this era of big-data, pre-processing the raw data is one necessary step. Using the right mechanism, we can find some interesting patterns in our datasets. This time, we will learn about “point clustering” using Python and Plotly.
By definition, clustering is a task of grouping a set of objects in a way that objects in a particular group are more similar to each other rather than the objects in the other groups. It has multiple applications in almost every field. You can even segment your customers into different groups based on their purchase patterns.
This is a Python script demonstrating the basic clustering algorithm, “k-means”. Also, it will plot the clusters using Plotly API. It uses sample data points for now, but you can easily feed in your dataset.
How does k-means works?
We need to determine the number of clusters proactively, the value of “k”. This is also considered as a limitation of the “k-means” algorithm.
The algorithm has two steps, and it keeps executing these steps in a loop.
- AssignmentIn this step, every data points is assigned to the nearest ‘centroid’. The Euclidean Distance is used as a distance metric here.
- UpdateFollowing that, in the “Update Step” the position of the centroid is recalculated for every cluster.
These two steps are repeated until the converge state, which is when there is no change after the assignment step.
In our implementation, we choose random “k” points from our dataset as initial (means) centroids. That’s called the “initialization step”.
Python script for point clustering
You can get it from this gist. It comes integrated with Plotly API.
These are the features:
- Clusters the points using “k-means” algorithm.
- Facilitates plotting the clusters using the Plotly API.
- Can plot the intermediate steps of the algorithm.
- Saves the intermediate clusters as images using Plotly’s static image export.
$ python clustering.py [--allow-multiple]
Using the –allow-multiple flag is optional, it will generate the plots for all the intermediate cluster stages of the algorithm.
That’s the plot of the final clusters, the algorithm converged after 6 iterations. The bigger point represents the centroid of the cluster.
Here is an interactive execution of the “k-means” algorithm. You can see the “assignment step” where the points are changing their respective cluster (color) after every iteration.
It was created using imagemagic. Once you have the intermediate cluster images, you can convert them into a GIF.
$ sudo apt-get install imagemagic
$ convert -delay 100 -loop 0 *.png clustering.gif