K-Means is a popular unsupervised machine learning algorithm used for clustering. The main purpose of this algorithm is to categorize data points into well-defined, non-overlapping clusters, ensuring each point is assigned to the cluster with the closest mean.
In this tutorial, we'll learn how to cluster data with the K-Means algorithm using the KMeans class of scikit-learn in Python. The tutorial covers:
- Understanding K-Means algorithm
- Preparing the data
- Clustering with KMeans
- Source code listing
Understanding K-Means algorithm
The objective of this algorithm is to partition a dataset into K distinct, non-overlapping subgroups (clusters) where each data point belongs to only one group. The algorithm accomplishes this by iteratively assigning data points to clusters based on the mean (centroid) of points in the cluster.
The K-Means algorithm involves the following steps:
1. Initialization:
- Specifies the desired number of clusters, denoted as .
- Initializes centroids randomly; these serve as the initial approximations for the cluster centers.
2. Assignment:
- Associates each data point with the cluster whose centroid is the nearest. Typically, Euclidean distance is employed, although alternative metrics are applicable.
3. Update:
- Reassesses the centroid of each cluster based on the currently assigned data points. The updated centroid is determined as the mean (average) of all data points in that cluster.
4. Iteration:
- Reiterates the assignment and updates the steps until the convergence.
Now, let's dive into the Python coding for implementing the K-Means clustering.
Preparing the data
We'll start by loading the required packages.
We'll generate sample data for this tutorial and visualize it in a plot.
Clustering with KMeans
Next, we'll define the model by using KMeans classs and set 5 to the clusters number parameter and fit the model to x data.
{'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 300, 'n_clusters': 5,
We can extract center points and labels of each cluster data from the model attributes.
Here, the model clustered data into 5 clusters. By using model outputs, we highlighted the clusters with different colors and plot centoroids of each cluster.
We can also check the model without setting the cluster numbers. Here, we'll define the model with default parameters and fit it again. Then, visualize the output clusters.
The model divided the data into eight clusters by default parameters.
In this tutorial, we've briefly learned how to cluster data with the KMeans in Python. The K-Means clustering algorithm offers a robust and widely used approach to unsupervised machine learning. Through iterative processes of initialization, assignment, and update, K-Means efficiently divides a dataset into distinct clusters.
The full source code is listed below.
Source code listing
No comments:
Post a Comment