DataTechNotes: K-means Clustering Example in Python

K-Means is a popular unsupervised machine learning algorithm used for clustering. The main purpose of this algorithm is to categorize data points into well-defined, non-overlapping clusters, ensuring each point is assigned to the cluster with the closest mean.

In this tutorial, we'll learn how to cluster data with the K-Means algorithm using the KMeans class of scikit-learn in Python. The tutorial covers:

Understanding K-Means algorithm
Preparing the data
Clustering with KMeans
Source code listing

Understanding K-Means algorithm

The objective of this algorithm is to partition a dataset into K distinct, non-overlapping subgroups (clusters) where each data point belongs to only one group. The algorithm accomplishes this by iteratively assigning data points to clusters based on the mean (centroid) of points in the cluster.

The K-Means algorithm involves the following steps:

1. Initialization:

Specifies the desired number of clusters, denoted as $K$ .
Initializes $K$ centroids randomly; these serve as the initial approximations for the cluster centers.

2. Assignment:

Associates each data point with the cluster whose centroid is the nearest. Typically, Euclidean distance is employed, although alternative metrics are applicable.

3. Update:

Reassesses the centroid of each cluster based on the currently assigned data points. The updated centroid is determined as the mean (average) of all data points in that cluster.

4. Iteration:

Reiterates the assignment and updates the steps until the convergence.

Now, let's dive into the Python coding for implementing the K-Means clustering.

Preparing the data

We'll start by loading the required packages.

 
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
  

We'll generate sample data for this tutorial and visualize it in a plot.

# Generate synthetic data with make_blobs
x, _ = make_blobs(n_samples=300, centers=5, cluster_std=1.1, random_state=1)

# Visualize the generated data
plt.scatter(x[:, 0], x[:, 1])
plt.title("Generated Data")
plt.show()
 

Clustering with KMeans

Next, we'll define the model by using KMeans classs and set 5 to the clusters number parameter and fit the model to x data.

# Perform KMeans clustering with default parameters
kmeans = KMeans(n_clusters=5).fit(x)
 
# Access and print entire attribute dictionary
print("All parameters:")
print(kmeans.get_params())
 
All parameters:
{'algorithm': 'auto', 'copy_x': True, 'init': 'k-means++', 'max_iter': 300, 'n_clusters': 5, 
'n_init': 10, 'n_jobs': 'deprecated', 'precompute_distances': 'deprecated', 
'random_state': None, 'tol': 0.0001, 'verbose': 0} 

We can extract center points and labels of each cluster data from the model attributes.

 
# Get cluster centers and labels
centers = kmeans.cluster_centers_
labels = kmeans.labels_

# Visualize the clustered data and centroids
plt.scatter(x[:, 0], x[:, 1], c=labels)
plt.scatter(centers[:, 0], centers[:, 1], marker='*', color="r", s=80)
plt.show()

Here, the model clustered data into 5 clusters. By using model outputs, we highlighted the clusters with different colors and plot centoroids of each cluster.

We can also check the model without setting the cluster numbers. Here, we'll define the model with default parameters and fit it again. Then, visualize the output clusters.

 
# Perform KMeans clustering with default parameters
kmeans = KMeans().fit(x)

# Get cluster centers and labels
centers = kmeans.cluster_centers_
labels = kmeans.labels_

# Visualize the clustered data and centroids
plt.scatter(x[:, 0], x[:, 1], c=labels)
plt.scatter(centers[:, 0], centers[:, 1], marker='*', color="r", s=80)
plt.show()
 

The model divided the data into eight clusters by default parameters.

In this tutorial, we've briefly learned how to cluster data with the KMeans in Python. The K-Means clustering algorithm offers a robust and widely used approach to unsupervised machine learning. Through iterative processes of initialization, assignment, and update, K-Means efficiently divides a dataset into distinct clusters.
The full source code is listed below.

Source code listing

 
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Generate synthetic data with make_blobs
x, _ = make_blobs(n_samples=300, centers=5, cluster_std=1.1, random_state=1)

# Visualize the generated data
plt.scatter(x[:, 0], x[:, 1])
plt.title("Generated Data")
plt.show()

# Perform KMeans clustering with specified number of clusters (3)
kmeans = KMeans(n_clusters=5).fit(x)
# Access and print entire attribute dictionary
print("All parameters:")
print(kmeans.get_params())

# Convert cluster centers to numpy array
centers = kmeans.cluster_centers_
labels = kmeans.labels_

# Visualize the clustered data and centroids
plt.scatter(x[:, 0], x[:, 1], c=labels)
plt.scatter(centers[:, 0], centers[:, 1], marker='*', color="r", s=80)
plt.title("KMeans Clustering (3 Clusters)")
plt.show()

# Perform KMeans clustering with default parameters
kmeans = KMeans().fit(x)
print("KMeans Model with Default Parameters:")
print(kmeans.get_params())

# Get cluster centers and labels
centers = kmeans.cluster_centers_
labels = kmeans.labels_

# Visualize the clustered data and centroids
plt.scatter(x[:, 0], x[:, 1], c=labels)
plt.scatter(centers[:, 0], centers[:, 1], marker='*', color="r", s=80)
plt.title("KMeans Clustering (Default Parameters)")
plt.show()
  

DataTechNotes

Pages

K-means Clustering Example in Python

No comments:

Post a Comment