Skip to content

Week 9 | Session 3: K-means Clustering — Python Implementation (Google Colab)

Course: Supply Chain Digitization — Module 3: Analytics in SCM



  • Previous Sessions: Concept — what K-means is, how the algorithm works, WCSS, Elbow Method.
  • This Session: Implementation — write Python code in Google Colab, reproduce the cluster output.
  • Dataset: customer_location.csv — 811 rows, 3 columns (serial no., latitude, longitude).

Library / AliasPurpose
pandas (pd)Data manipulation and analysis library — read CSV, create DataFrames.
numpy (np)Numerical Python — mathematical and logical operations on arrays.
matplotlib.pyplot (plt)Plotting library for Python — line charts, scatter plots.
seaborn (sn)High-level visualization library — attractive cluster plots with color coding.
sklearn.cluster.KMeansK-means clustering implementation — fit clusters, get labels & centroids.

  1. Import Data: Load customer_location.csv using pandas.
  2. Plot Raw Data: Visualize all 811 points on a lat/long scatter plot (seaborn).
  3. Select Features: Drop serial number column — keep only latitude & longitude.
  4. Find Optimal K: Loop K=1 to 9, compute WCSS, plot Elbow Diagram.
  5. Form Clusters: Run KMeans(n_clusters=4).fit(). Assign IDs.
  6. Plot Clusters: Color-coded scatter plot.
  7. Get Centroids: Extract cluster_centers_ (proposed DC locations).
  8. Plot Centroids: Superimpose markers (‘x’) on the cluster plot.

import pandas as pd
df = pd.read_csv('customer_location.csv')
df.head()
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
sn.lmplot(x='latitude', y='longitude', data=df, fit_reg=False, height=4)
plt.title('Customer Locations')
plt.show()

Remove irrelevant columns.

new_df = df[['latitude', 'longitude']]
from sklearn.cluster import KMeans
cluster_range = range(1, 10) # K = 1 to 9
cluster_errors = [] # empty list to store WCSS values
for num_clusters in cluster_range:
clusters = KMeans(num_clusters)
clusters.fit(new_df)
cluster_errors.append(clusters.inertia_) # inertia_ = WCSS
plt.figure(figsize=(6, 4))
plt.plot(cluster_range, cluster_errors, marker='o')
plt.title('Elbow Diagram')
plt.xlabel('Number of Clusters')
plt.ylabel('Sum of Squared Error')
plt.show()
clusters_new = KMeans(4) # set K = 4
clusters_new.fit(new_df)
# Add cluster ID as a new column
new_df.insert(loc=2, column='cluster_id', value=clusters_new.labels_)
sn.lmplot(x='latitude', y='longitude', data=new_df,
hue='cluster_id', fit_reg=False, height=4)
plt.show()
centers = np.array(clusters_new.cluster_centers_)
print(centers)
# Output Example:
# Cluster 0: [27.68, 80.90]
sn.lmplot(x='latitude', y='longitude', data=new_df,
hue='cluster_id', fit_reg=False, height=4)
plt.scatter(centers[:, 0], centers[:, 1], marker='x', s=100, c='black')
plt.show()

5. Final Output — Centroid (DC) Locations

Section titled “5. Final Output — Centroid (DC) Locations”

After running the full code, K-means returns 4 centroids — the proposed DC locations:

Cluster IDCentroid LatCentroid LongProposed DC serves…
027.6880.90Blue cluster customers
127.4281.15Orange cluster customers
227.3180.83Green cluster customers
327.5680.57Red cluster customers

  • Pipeline: Import → Plot raw → Select features → Elbow diagram → Fit K=4 → Plot clusters → Get centroids → Plot centroids.
  • Key output: 4 centroid coordinates = proposed DC locations; 811 cluster IDs = customer-DC mapping.
  • Tool: Google Colab.