K-means Clustering: An Interactive Textbook

This interactive textbook explores K-means clustering with theory, mathematics, and live demonstrations. For any questions, information, or to report errors/typos, please contact me at riccardo_colletti@berkeley.edu.

Lecture Notes — Lecture 5: CS289A: Introduction to Machine Learning (Academic Year 2025/2026), UC Berkeley.

K-means Clustering

Theory, Mathematics, and Analysis

Interactive Visualization

Iteration: 0
WCSS: 0.0000
BCSS: 0.0000
TSS: 0.0000
Status: Ready

Introduction: The Clustering Problem

In the vast landscape of machine learning, few problems are as fundamental as that of finding structure in unlabeled data. The K-means algorithm, developed by Stuart Lloyd in 1957 (though not published until 1982), represents one of the most elegant solutions to the clustering problem—a problem that asks us to partition a dataset into meaningful groups without any prior knowledge of group membership.

At its core, K-means is an algorithm of remarkable simplicity. Given a set of observations, it seeks to partition them into K clusters such that each observation belongs to the cluster with the nearest mean. This mean serves as a prototype for the cluster, and the algorithm iteratively refines these prototypes until a stable configuration emerges.

What makes K-means particularly appealing is not just its conceptual elegance, but its efficiency. Despite the combinatorial nature of the clustering problem—there are exponentially many ways to partition N points into K groups—K-means converges to a solution in polynomial time. However, as we shall see in later chapters, this efficiency comes at a cost: the algorithm is sensitive to initialization and can converge to suboptimal local minima.

Formal Definition: K-means clustering is an unsupervised learning algorithm that partitions a dataset \(\mathcal{X} = \{x_1, x_2, \ldots, x_n\}\) into \(K\) disjoint clusters \(\mathcal{C} = \{C_1, C_2, \ldots, C_K\}\), where each cluster is represented by its centroid (mean) \(\mu_k\), with the objective of minimizing the within-cluster sum of squares.

The algorithm is characterized by four essential properties that distinguish it from other clustering approaches. First, it is unsupervised—no labeled training examples are required. Second, it is partitional—each data point belongs to exactly one cluster. Third, it is iterative—the solution improves progressively through successive refinements. Fourth, it is centroid-based—each cluster is represented by the arithmetic mean of its constituent points.

Mathematical Foundations

The Objective Function

The mathematical heart of K-means lies in its objective function. To understand the algorithm’s behavior, we must first understand what it is trying to optimize. The goal is to minimize the within-cluster sum of squares (WCSS), also known as the inertia or distortion.

Let us denote our dataset as \(\mathcal{X} = \{x_1, x_2, \ldots, x_n\}\) where each \(x_i \in \mathbb{R}^d\). We seek a partition into \(K\) clusters \(C_1, C_2, \ldots, C_K\), where each cluster \(C_k\) has a centroid \(\mu_k\). The within-cluster sum of squares is formally defined as:

$$\text{WCSS} = \sum_{k=1}^{K} \sum_{i \in C_k} \|x_i – \mu_k\|^2$$

This expression captures the total squared distance between each point and its assigned cluster centroid. By minimizing this quantity, we ensure that points within each cluster are as close to their centroid as possible—in other words, the clusters are “tight” or “compact.”

The Variance Decomposition Identity

One of the most illuminating results in K-means theory is the decomposition of total variance. This identity reveals a deep connection between within-cluster and between-cluster variation, and explains why minimizing WCSS is equivalent to maximizing cluster separation.

Let \(\bar{x}\) denote the global mean of all data points: \(\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i\). The total sum of squares (TSS) measures the total variance in the dataset:

$$\text{TSS} = \sum_{i=1}^{n} \|x_i – \bar{x}\|^2$$

Similarly, the between-cluster sum of squares (BCSS) quantifies how far the cluster centroids are from the global mean:

$$\text{BCSS} = \sum_{k=1}^{K} n_k \|\mu_k – \bar{x}\|^2$$

where \(n_k = |C_k|\) is the number of points in cluster \(k\).

The fundamental decomposition states that:

Variance Decomposition Identity:

$$\text{TSS} = \text{WCSS} + \text{BCSS}$$

This identity has profound implications. Since TSS is constant for any given dataset (it depends only on the data, not on the clustering), minimizing WCSS is mathematically equivalent to maximizing BCSS. In other words, creating tight clusters automatically means creating well-separated clusters. This dual perspective—minimizing within-cluster variance versus maximizing between-cluster variance—provides two complementary ways of thinking about the clustering objective.

Lloyd’s Algorithm

The Iterative Procedure

The K-means algorithm, in its most common form, is known as Lloyd’s algorithm. It is an iterative procedure that alternates between two steps: assigning points to their nearest centroid, and updating centroids based on current assignments. This alternation guarantees monotonic improvement in the objective function.

The algorithm begins with an initialization of \(K\) centroids, which can be chosen randomly or through more sophisticated methods (we shall discuss K-means++ initialization in Chapter 6). Then it proceeds as follows:

E-Step (Assignment Step): Assign each point to its nearest centroid:

$$c_i \leftarrow \arg\min_{k \in \{1,\ldots,K\}} \|x_i – \mu_k\|^2$$

M-Step (Update Step): Recompute each centroid as the mean of its assigned points:

$$\mu_k \leftarrow \frac{1}{n_k} \sum_{i \in C_k} x_i$$

These two steps are repeated until convergence, which occurs when assignments no longer change (or equivalently, when centroids no longer move). The nomenclature “E-step” and “M-step” comes from the algorithm’s connection to the Expectation-Maximization (EM) framework, though K-means actually predates the general EM algorithm.

Why Lloyd’s Algorithm Works

The elegance of Lloyd’s algorithm lies in the fact that each step is guaranteed to decrease (or maintain) the WCSS objective. During the assignment step, we choose the cluster assignment that minimizes the distance to centroids for each point individually. This clearly cannot increase WCSS. During the update step, we compute the mean of each cluster—and it is a well-known result that the mean is the point that minimizes the sum of squared distances to a set of points. Thus, this step also cannot increase WCSS.

Since WCSS is bounded below by zero and decreases monotonically, the algorithm must converge. However—and this is crucial—it converges to a local minimum, not necessarily a global one. The final clustering depends heavily on the initial placement of centroids.

Convergence Theory

Monotonic Decrease and Finite Convergence

We now present the formal convergence theorem for Lloyd’s algorithm. This result guarantees that the algorithm will terminate in finite time with a stable clustering.

Theorem (Convergence of Lloyd’s Algorithm): The within-cluster sum of squares (WCSS) decreases monotonically at each iteration of Lloyd’s algorithm, and the algorithm converges to a local minimum in a finite number of steps.

Proof sketch: There are only finitely many ways to partition \(n\) points into \(K\) clusters (at most \(K^n\) partitions). Since WCSS strictly decreases except when convergence is reached, and since we cannot revisit the same partition (due to monotonic decrease), the algorithm must terminate in finite time.

While this theorem guarantees convergence, it says nothing about the quality of the solution. In practice, K-means often converges to a local minimum that is far from the global optimum, especially with poor initialization.

Computational Complexity

Time Complexity Analysis

Understanding the computational cost of K-means is essential for practical applications, especially when dealing with large datasets. Let us analyze the complexity of each step.

In the assignment step, we must compute the distance from each of the \(n\) points to each of the \(K\) centroids. For data in \(d\) dimensions, each distance computation requires \(O(d)\) operations. Thus, the assignment step has complexity \(O(nKd)\).

In the update step, we must compute the mean of each cluster. This requires summing all points in each cluster and dividing by the cluster size, which takes \(O(nd)\) operations in total.

If the algorithm converges in \(I\) iterations, the total time complexity is:

$$\mathcal{O}(I \cdot n \cdot K \cdot d)$$

In practice, \(I\) is often small (typically less than 100 iterations), and \(K\) and \(d\) are also usually modest. This makes K-means remarkably efficient, even for datasets with millions of points. This efficiency, combined with its simplicity, explains the algorithm’s enduring popularity.

K-means++ Initialization

The Problem with Random Initialization

As we have seen, Lloyd’s algorithm is guaranteed to converge, but only to a local minimum. The quality of this local minimum depends critically on the initial placement of centroids. Random initialization—choosing \(K\) points uniformly at random—often leads to poor results, especially when clusters are of different sizes or densities.

The K-means++ algorithm, introduced by Arthur and Vassilvitskii in 2007, provides a principled approach to initialization that balances computational efficiency with solution quality. The key idea is to choose initial centroids that are far apart from each other, thereby increasing the probability of capturing the true cluster structure.

The K-means++ Algorithm

The initialization procedure is as follows:

K-means++ Initialization:

  1. Choose the first centroid uniformly at random from the data points
  2. For each point \(x\), compute \(D(x)\), its distance to the nearest centroid already chosen
  3. Choose the next centroid with probability proportional to \(D(x)^2\)
  4. Repeat steps 2-3 until \(K\) centroids have been chosen

The squared distance weighting in step 3 is crucial: points that are far from existing centroids have a higher probability of being selected. This naturally spreads out the initial centroids.

Theoretical Guarantee

The remarkable property of K-means++ is that it comes with a provable approximation guarantee:

Theorem (Arthur & Vassilvitskii, 2007): The expected cost of the clustering produced by K-means++ is at most \(O(\log K)\) times the optimal clustering cost:

$$\mathbb{E}[\text{WCSS}_{\text{K-means++}}] \leq 8(\ln K + 2) \cdot \text{OPT}$$

This means that K-means++ provides a good starting point in expectation. In practice, running K-means with K-means++ initialization multiple times (10-50 runs) and selecting the clustering with the lowest WCSS often yields excellent results.

Pathological Cases: When K-means Fails

Despite its elegance and widespread use, K-means has fundamental limitations that cannot be overcome through better implementation or parameter tuning. In this chapter, we explore three essential failure modes that reveal the algorithm’s theoretical constraints. Understanding these limitations is crucial for knowing when to apply K-means and when to consider alternative clustering methods.

Case 1: Non-Convex and Complex Geometries

K-means makes a fundamental geometric assumption: clusters should be roughly spherical and convex. This limitation stems directly from how the algorithm partitions space.

Theoretical Problem: The algorithm partitions space using Voronoi cells—each point is assigned to its nearest centroid. These Voronoi cells are always convex regions bounded by hyperplanes (the perpendicular bisectors between centroids). Therefore, K-means cannot represent non-convex cluster shapes. When confronted with concentric rings, half-moons, spirals, or other complex geometries, K-means partitions them into convex “pieces” that bear no relation to the true structure. The algorithm is geometrically incapable of discovering curved, elongated, or interleaved patterns—it can only find regions separated by straight boundaries.

🔬 Experiment 1a: Concentric Rings

Scenario: Two ring-shaped clusters—an inner ring and an outer ring forming concentric circles. The true structure is based on distance from center, not on any linear separation.

Expected behavior: K-means cannot represent circular structures. Instead, it partitions the rings into radial “pie slices” defined by straight lines emanating from the centroids. The resulting clusters cut across both rings, completely missing their circular geometry.

Iteration: 0 | WCSS:

Key Observation: The clustering bears no relation to the ring structure. Points on opposite sides of each ring are grouped together, while points on the same ring are split apart.

🔬 Experiment 1b: Two Half-Moons

Scenario: Two crescent-shaped clusters that “embrace” each other. This is a classic benchmark for testing clustering algorithms’ ability to handle non-linear separations.

Expected behavior: K-means draws a straight decision boundary between centroids, cutting through both crescents rather than separating them. The convex hull assumption forces linear separation.

Iteration: 0 | WCSS:

Key Observation: The straight decision boundary misclassifies many points. Each resulting cluster contains parts of both moons. This is a fundamental limitation—K-means simply cannot discover non-linear boundaries.

Why this happens: K-means uses Euclidean distance and computes cluster centers as arithmetic means. This inherently assumes that “closeness” in Euclidean space corresponds to cluster membership. For curved or complex structures, this assumption breaks down. Algorithms like DBSCAN (density-based) or spectral clustering (which can capture manifold structure) are more appropriate for such data.

Case 2: Sensitivity to Outliers

K-means minimizes the sum of squared distances, which makes it highly sensitive to outliers. Because the cluster center is computed as the arithmetic mean, a single distant point can have disproportionate influence.

Theoretical Problem: The arithmetic mean minimizes squared deviations, but this also means that outliers—which have large squared distances—exert enormous pull on cluster centers. When an outlier is assigned to a cluster, it shifts that cluster’s centroid away from the true center of mass of the main points. This distortion affects the Voronoi cell boundaries, potentially causing misclassification of points near the boundary. The effect can cascade: as the centroid moves, the decision boundary shifts, causing points that should belong to one cluster to be assigned to another. Since K-means uses \(L_2\) distance (squared), the influence grows quadratically with distance—a point twice as far has four times the impact.

🔬 Experiment 2: Outliers Distort Cluster Centers

Scenario: Two well-separated, compact clusters with several outlier points scattered randomly in the space. The outliers don’t belong to any natural cluster but must be assigned somewhere.

Expected behavior: Outliers pull centroids away from the true cluster centers. The shifted centroids lead to distorted decision boundaries, potentially misclassifying legitimate cluster members near the boundary.

Iteration: 0 | WCSS:

Key Observation: Watch how the centroids shift toward outliers, distorting the natural cluster structure. The clusters appear “pulled” toward the noise points, and the decision boundary between clusters becomes skewed.

Why this happens: The mean is not a robust statistic—it’s highly influenced by extreme values. Alternative approaches include: (1) K-medoids, which uses actual data points as centers (the medoid is the point minimizing sum of distances, not squared distances); (2) preprocessing to remove outliers; (3) using robust distance metrics; or (4) trimmed K-means variants that iteratively remove points with highest contribution to WCSS.

Case 3: Poor Initialization with Close Centroids

K-means is a local optimization algorithm that converges to the nearest local minimum from its starting point. When multiple centroids initialize close together, they become “trapped” in the same region, leaving other parts of the data space unexplored.

Theoretical Problem: When two or more centroids start near each other in the same region, they partition that region’s points between them during the assignment step. During the update step, both centroids move to the means of their respective subsets—but both subsets are drawn from the same local area. The centroids may separate slightly, but they remain trapped in that region. Meanwhile, distant clusters have no centroid nearby to “claim” their points. These points get assigned to the nearest available centroid (which may be quite far away), but they’re too few to pull that centroid toward them against the inertia of the local majority. The algorithm settles into a local minimum where one natural cluster is artificially subdivided while other natural clusters are merged or ignored.

🔬 Experiment 3: Centroids Too Close Together

Scenario: Four natural clusters exist—two large clusters at the bottom-left and bottom-right, and two small clusters very close together at the top. With \(K=3\), we cannot capture all four structures, but a reasonable solution would separate the two large bottom clusters and merge the two small top clusters.

Poor initialization: All three centroids initialize near the top region where the two small clusters are located. Two centroids land on the small top clusters, and one centroid sits between the large bottom clusters.

Expected behavior: The centroid positioned between the bottom clusters captures both large clusters (giving them the same label—incorrect!), while the two small top clusters get separated. This is backwards: we’re splitting the small nearby clusters while merging the large distant ones.

Iteration: 0 | WCSS:

Key Observation: With poor initialization, the two large bottom clusters (which should be separated) get merged under a single centroid, while computational resources are “wasted” splitting the two small top clusters. With good initialization, the bottom clusters are properly separated. The clustering quality depends entirely on whether centroids are spread across the data space or clustered together.

Why this happens: K-means has no global exploration mechanism—it greedily optimizes from wherever it starts. The WCSS objective function is non-convex with many local minima. Once centroids are trapped in a local region, the update rule (computing means) keeps them there. Solutions include: (1) K-means++ initialization, which probabilistically spreads initial centroids apart; (2) running K-means multiple times (10-50 runs) with different random seeds and selecting the best result (lowest WCSS); (3) hierarchical initialization methods that first roughly partition the space.

Best Practices and Recommendations

Having explored the theoretical foundations and pathological failure modes of K-means, we now turn to practical recommendations for applying the algorithm successfully. These guidelines represent accumulated wisdom from decades of experience in the machine learning community.

Practical Guidelines:

  • Use K-means++ initialization whenever possible. The logarithmic approximation guarantee significantly improves reliability.
  • Run multiple trials (10-50) with different random seeds and select the clustering with the lowest WCSS. This simple strategy dramatically reduces the impact of poor initialization.
  • Normalize your features before clustering. Features with large magnitudes will dominate the distance calculations. Standardization (zero mean, unit variance) is often appropriate.
  • Remove or downweight obvious outliers if domain knowledge suggests they are artifacts. Alternatively, consider robust variants like K-medoids.
  • Use the elbow method or silhouette analysis to choose \(K\). Plot WCSS versus \(K\) and look for an “elbow” where additional clusters provide diminishing returns.
  • Validate your clusters using external criteria when available, or internal metrics like silhouette coefficient when not.

Alternative Clustering Algorithms

K-means is not a universal solution. When its assumptions are violated—non-spherical clusters, presence of outliers, unknown number of clusters—other algorithms may be more appropriate. We briefly survey several important alternatives.

K-medoids (PAM)

K-medoids replaces the mean with the medoid—the point in each cluster that minimizes total distance to all other points. By using actual data points as cluster representatives, K-medoids is more robust to outliers. The price is increased computational cost: \(O(K(n-K)^2)\) per iteration.

DBSCAN

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) can discover clusters of arbitrary shape and automatically identifies outliers. It requires two parameters: \(\epsilon\) (neighborhood radius) and \(\text{minPts}\) (minimum points to form a dense region). DBSCAN is particularly effective for spatial data with varying densities.

Gaussian Mixture Models (GMM)

GMMs generalize K-means by modeling each cluster as a Gaussian distribution with its own covariance matrix. This allows for elliptical clusters with different orientations and shapes. The Expectation-Maximization (EM) algorithm is used for inference. GMMs also provide soft cluster assignments (probabilities) rather than hard assignments.

Hierarchical Clustering

Hierarchical methods build a tree of clusters (dendrogram) that can be cut at different levels to obtain different numbers of clusters. Agglomerative (bottom-up) approaches are most common. These methods are particularly useful when the data has nested structure or when you want to explore clusterings at multiple scales.