Interactive Guide to Gaussian Mixture Models

Chapter I: Introduction: What are Gaussian Mixture Models?
Chapter II: Decision Boundaries in GMM: The General Case
Chapter III: Case 1: Isotropic Covariances — Recovering K-Means
Chapter IV: Case 2: Equal Isotropic Covariances
Chapter V: Case 3: Equal Non-Isotropic Covariances
Chapter VI: Case 4: Different Covariances — QDA
Chapter VII: Connection to Logistic Regression
Chapter VIII: The Complete Picture: A Unified View
Chapter IX: Conclusion

Introduction: What are Gaussian Mixture Models?

The Machine Learning Context

In machine learning, we often encounter two fundamental types of models: discriminative and generative models. Discriminative models (like logistic regression or support vector machines) directly learn the decision boundary between classes—they answer the question "Given the features $\mathbf{x}$, what is the probability of class $y$?" Formally, they model $P(y \mid \mathbf{x})$.

Generative models, on the other hand, take a fundamentally different approach. They model the joint distribution $P(\mathbf{x}, y)$ by learning how the data is generated. A generative model answers: "What is the probability of observing these features $\mathbf{x}$ together with class $y$?" By modeling this joint distribution, we can then use Bayes' rule to perform classification:

$$P(y \mid \mathbf{x}) = \frac{P(\mathbf{x} \mid y) P(y)}{P(\mathbf{x})}$$

Gaussian Mixture Models (GMMs) are a powerful class of generative models. The fundamental assumption is that our data is generated from a mixture of several Gaussian (normal) distributions. Each Gaussian represents a different "cluster" or "component" in the data.

The Core Assumption

Let us consider a dataset where each point $\mathbf{x} = (x_1, x_2)^\top \in \mathbb{R}^2$. The GMM assumes that each data point is generated through the following process:

First, we randomly select one of $K$ components (clusters) according to probabilities $\pi_1, \pi_2, \ldots, \pi_K$ where $\sum_{k=1}^K \pi_k = 1$.
Then, given that we selected component $k$, we sample the point from a Gaussian distribution $\mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ with mean $\boldsymbol{\mu}_k$ and covariance matrix $\boldsymbol{\Sigma}_k$.

Mathematically, if we introduce a latent variable $z \in \{1, 2, \ldots, K\}$ that indicates which component generated the data point, we have:

$$P(z = k) = \pi_k$$ $$P(\mathbf{x} \mid z = k) = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

Since we don't observe $z$ (it's latent/hidden), we marginalize over it to get the probability of observing $\mathbf{x}$:

$$p(\mathbf{x}) = \sum_{k=1}^{K} P(z=k) \cdot P(\mathbf{x} \mid z=k) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$$

This is the fundamental equation of the Gaussian Mixture Model. The Gaussian density is:

$$\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{(2\pi)^{d/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right)$$

Why GMMs Matter: A Unifying Framework

The beauty of GMMs lies in their generality. By imposing different constraints on the covariance matrices $\boldsymbol{\Sigma}_k$ and examining limiting behaviors, we recover many classical machine learning algorithms as special cases:

Special Cases of GMM:

K-Means Clustering: GMM with isotropic covariances $\boldsymbol{\Sigma}_k = \sigma^2 \mathbf{I}$ where $\sigma^2 \to 0$
Linear Discriminant Analysis (LDA): GMM with equal covariances across all components $\boldsymbol{\Sigma}_1 = \boldsymbol{\Sigma}_2 = \cdots = \boldsymbol{\Sigma}_K$
Logistic Regression: Two-class LDA viewed discriminatively
Quadratic Discriminant Analysis (QDA): GMM with different covariances for each component

In this guide, we will systematically explore these connections by starting from the general GMM formula and deriving what happens under different assumptions.

Decision Boundaries in GMM: The General Case

Before examining special cases, let us understand how GMM makes decisions. In a classification context with $K$ classes, we assign a point $\mathbf{x}$ to the class with the highest posterior probability:

$$\text{Assign } \mathbf{x} \text{ to class } k^* = \arg\max_k P(z = k \mid \mathbf{x})$$

Using Bayes' rule:

$$P(z = k \mid \mathbf{x}) = \frac{P(\mathbf{x} \mid z = k) P(z = k)}{P(\mathbf{x})} = \frac{\pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}$$

For binary classification ($K=2$), the decision boundary is the set of points where $P(z = 1 \mid \mathbf{x}) = P(z = 2 \mid \mathbf{x})$, which occurs when:

$$\pi_1 \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1) = \pi_2 \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2)$$

Taking logarithms (which is monotonic and preserves the equality):

Derivation of the General Decision Boundary:

$$\log \pi_1 + \log \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1) = \log \pi_2 + \log \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2)$$

Expanding the Gaussian densities:

$$\log \pi_1 - \frac{1}{2}\log|\boldsymbol{\Sigma}_1| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_1)^\top \boldsymbol{\Sigma}_1^{-1}(\mathbf{x} - \boldsymbol{\mu}_1)$$ $$= \log \pi_2 - \frac{1}{2}\log|\boldsymbol{\Sigma}_2| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_2)^\top \boldsymbol{\Sigma}_2^{-1}(\mathbf{x} - \boldsymbol{\mu}_2)$$

Rearranging:

$$\mathbf{x}^\top \boldsymbol{\Sigma}_1^{-1}\mathbf{x} - \mathbf{x}^\top \boldsymbol{\Sigma}_2^{-1}\mathbf{x} - 2\boldsymbol{\mu}_1^\top \boldsymbol{\Sigma}_1^{-1}\mathbf{x} + 2\boldsymbol{\mu}_2^\top \boldsymbol{\Sigma}_2^{-1}\mathbf{x} + \text{const} = 0$$

This is a quadratic equation in $\mathbf{x}$. The decision boundary is, in general, a conic section (ellipse, parabola, or hyperbola). The key insight is that the shape of the boundary depends entirely on the covariance matrices.

Case 1: Isotropic Covariances with $\sigma^2 \to 0$ — Recovering K-Means

The Setup

Let us now examine our first special case. Suppose all components have isotropic (spherical) covariances with the same variance:

$$\boldsymbol{\Sigma}_k = \sigma^2 \mathbf{I} \quad \text{for all } k$$

where $\mathbf{I}$ is the identity matrix. This means each cluster is circular (in 2D) or spherical (in higher dimensions) with the same "spread" $\sigma^2$.

What Happens to the Posterior?

With this constraint, the posterior probability for component $k$ becomes:

Derivation:

The Gaussian density simplifies to:

$$\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \sigma^2\mathbf{I}) = \frac{1}{(2\pi\sigma^2)^{d/2}} \exp\left(-\frac{\|\mathbf{x} - \boldsymbol{\mu}_k\|^2}{2\sigma^2}\right)$$

Therefore, the posterior is:

$$P(z = k \mid \mathbf{x}) = \frac{\pi_k \exp\left(-\frac{\|\mathbf{x} - \boldsymbol{\mu}_k\|^2}{2\sigma^2}\right)}{\sum_{j=1}^K \pi_j \exp\left(-\frac{\|\mathbf{x} - \boldsymbol{\mu}_j\|^2}{2\sigma^2}\right)}$$

The Limit $\sigma^2 \to 0$

Now comes the crucial insight. What happens as $\sigma^2 \to 0$? The exponential terms become extremely sharp. Consider the term $\exp\left(-\frac{d_k}{2\sigma^2}\right)$ where $d_k = \|\mathbf{x} - \boldsymbol{\mu}_k\|^2$:

If $d_k$ is the minimum distance among all components, this term approaches 1
For all other components where $d_j > d_k$, the term $\exp\left(-\frac{d_j}{2\sigma^2}\right) \to 0$ exponentially fast

Key Insight: As $\sigma^2 \to 0$, the posterior probability becomes:

$$P(z = k \mid \mathbf{x}) \to \begin{cases} 1 & \text{if } k = \arg\min_j \|\mathbf{x} - \boldsymbol{\mu}_j\|^2 \\ 0 & \text{otherwise} \end{cases}$$

This is exactly the K-Means assignment rule: assign each point to the nearest centroid!

From Soft to Hard Assignment

GMM normally performs soft assignment—each point has some probability of belonging to each cluster. But as $\sigma^2 \to 0$, we transition to hard assignment—each point belongs definitively to exactly one cluster. K-Means is the limiting case of GMM where this assignment becomes deterministic.

Case 2: Equal Isotropic Covariances (Finite Variance)

Maintaining Isotropy with Finite Variance

What if we keep the spherical assumption $\boldsymbol{\Sigma}_k = \sigma^2 \mathbf{I}$ but don't let $\sigma^2 \to 0$? Let us derive the decision boundary.

Derivation of the Decision Boundary:

For equal priors ($\pi_1 = \pi_2$), the boundary occurs where:

$$\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_1, \sigma^2\mathbf{I}) = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_2, \sigma^2\mathbf{I})$$

Taking logs and using the fact that $|\sigma^2\mathbf{I}| = (\sigma^2)^d$ cancels out:

$$-\frac{\|\mathbf{x} - \boldsymbol{\mu}_1\|^2}{2\sigma^2} = -\frac{\|\mathbf{x} - \boldsymbol{\mu}_2\|^2}{2\sigma^2}$$

Multiplying by $-2\sigma^2$ and expanding:

$$\|\mathbf{x}\|^2 - 2\boldsymbol{\mu}_1^\top\mathbf{x} + \|\boldsymbol{\mu}_1\|^2 = \|\mathbf{x}\|^2 - 2\boldsymbol{\mu}_2^\top\mathbf{x} + \|\boldsymbol{\mu}_2\|^2$$

The $\|\mathbf{x}\|^2$ terms cancel! We get:

$$2(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^\top\mathbf{x} = \|\boldsymbol{\mu}_1\|^2 - \|\boldsymbol{\mu}_2\|^2$$

Result: With equal isotropic covariances, the decision boundary is a hyperplane (a line in 2D). Specifically, it is the perpendicular bisector of the line segment connecting $\boldsymbol{\mu}_1$ and $\boldsymbol{\mu}_2$. This is exactly the same boundary as K-Means!

However, unlike K-Means, points near the boundary have uncertain assignments. The width of the uncertainty region is controlled by $\sigma^2$.

Interactive Visualization: Isotropic Gaussian Mixture Model

In an isotropic Gaussian mixture, both clusters share the same spherical covariance: $\Sigma_1 = \Sigma_2 = \sigma^2 I$. This creates perfect circular symmetry around each cluster center, and the decision boundary becomes the Euclidean bisector — a straight line perpendicular to the line connecting the two means.

Variance (σ²): 1.00

Number of Points per Cluster: 50

Cluster 1 (Blue)

Center $x_1$: -1.5 Center $x_2$: 0.0

Cluster 2 (Gold)

Center $x_1$: 1.5 Center $x_2$: 0.0

Cluster 1 Region

Cluster 2 Region

Decision Boundary

Mathematical Foundation: With isotropic covariances $\Sigma_1 = \Sigma_2 = \sigma^2 I$, the decision boundary satisfies:

\[\|\mathbf{x}-\mu_1\|^2 = \|\mathbf{x}-\mu_2\|^2\]

Expanding this equation yields:

\[(\mu_1-\mu_2)^\top \mathbf{x} = \frac{1}{2}(\|\mu_1\|^2 - \|\mu_2\|^2)\]

This is the equation of a hyperplane — the perpendicular bisector of the segment connecting $\mu_1$ and $\mu_2$. The boundary is linear, orthogonal to the vector $(\mu_1-\mu_2)$, and passes through the midpoint.

Case 3: Equal Non-Isotropic Covariances — Enter the Mahalanobis Distance

Breaking Isotropy

Now let us relax the isotropy constraint while maintaining equality: all components share the same covariance matrix $\boldsymbol{\Sigma}$, but this matrix is no longer a multiple of the identity. This means our clusters can be elliptical rather than circular, and they can be oriented in any direction.

Deriving the Decision Boundary

Derivation:

The boundary occurs where (assuming equal priors):

$$\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_1, \boldsymbol{\Sigma}) = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_2, \boldsymbol{\Sigma})$$

Taking logs:

$$-\frac{1}{2}\log|\boldsymbol{\Sigma}| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_1)^\top\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu}_1)$$ $$= -\frac{1}{2}\log|\boldsymbol{\Sigma}| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_2)^\top\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu}_2)$$

The $\log|\boldsymbol{\Sigma}|$ terms cancel. Expanding the quadratic forms:

$$\mathbf{x}^\top\boldsymbol{\Sigma}^{-1}\mathbf{x} - 2\boldsymbol{\mu}_1^\top\boldsymbol{\Sigma}^{-1}\mathbf{x} + \boldsymbol{\mu}_1^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_1$$ $$= \mathbf{x}^\top\boldsymbol{\Sigma}^{-1}\mathbf{x} - 2\boldsymbol{\mu}_2^\top\boldsymbol{\Sigma}^{-1}\mathbf{x} + \boldsymbol{\mu}_2^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_2$$

Crucial observation: The quadratic terms $\mathbf{x}^\top\boldsymbol{\Sigma}^{-1}\mathbf{x}$ cancel out! We're left with:

$$2(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^\top\boldsymbol{\Sigma}^{-1}\mathbf{x} = \boldsymbol{\mu}_1^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_2$$

Result: The decision boundary is still a hyperplane (linear), but its orientation is determined by $\boldsymbol{\Sigma}^{-1}\mathbf{v}$, not by $\mathbf{v}$ itself!

The vector $\boldsymbol{\Sigma}^{-1}\mathbf{v}$ defines the direction perpendicular to the decision boundary. This is fundamentally different from the Euclidean case unless $\boldsymbol{\Sigma} = \sigma^2\mathbf{I}$.

The Mahalanobis Distance

The quantity $(\mathbf{x} - \boldsymbol{\mu})^\top\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})$ is called the squared Mahalanobis distance. Unlike Euclidean distance, it accounts for:

The different variances along different directions (some directions have more spread)
The correlations between variables (variables might increase together)

The Mahalanobis distance essentially transforms the space so that the covariance becomes spherical, measures Euclidean distance in that transformed space, then transforms back.

When Do the Boundaries Coincide?

Key Theorem: The GMM (Mahalanobis) boundary coincides with the Euclidean bisector if and only if $\mathbf{v} = \boldsymbol{\mu}_1 - \boldsymbol{\mu}_2$ is an eigenvector of $\boldsymbol{\Sigma}$.

Proof sketch: If $\boldsymbol{\Sigma}^{-1}\mathbf{v} = \lambda\mathbf{v}$ for some scalar $\lambda$, then the normal to the hyperplane is parallel to $\mathbf{v}$, which means the hyperplane is the perpendicular bisector.

Connection to Linear Discriminant Analysis

This case—GMM with equal covariances—is exactly Linear Discriminant Analysis (LDA). LDA is a classical statistical method for classification that assumes each class has a Gaussian distribution with the same covariance matrix.

Interactive Visualization: Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis assumes that both classes share the same covariance matrix $\Sigma$, but have different means $\mu_1$ and $\mu_2$. This shared covariance constraint produces a linear decision boundary. The black line shows the Mahalanobis boundary, while the green dashed line shows the Euclidean bisector for comparison. The dashed gray lines show the principal axes of the covariance ellipses.

Shared Covariance Matrix Σ

Var₁₁ (x₁): 1.00 Var₂₂ (x₂): 1.00 Cov₁₂: 0.00

Number of Points per Cluster: 50

Cluster 1 (Blue)

Center $x_1$: -1.5 Center $x_2$: 0.0

Cluster 2 (Gold)

Center $x_1$: 1.5 Center $x_2$: 0.0

Cluster 1 Region

Cluster 2 Region

Mahalanobis Boundary

Euclidean Bisector

Principal Axes

Key Observation: When covariance is non-zero, the Mahalanobis boundary (black) diverges from the Euclidean bisector (green). The ellipses show how the covariance structure affects the cluster shapes. Try adjusting the covariance to see how the decision boundary rotates!

Note: The covariance range is automatically constrained to ensure the covariance matrix remains positive semi-definite (PSD). The maximum allowed covariance is $\sqrt{\sigma_x^2 \cdot \sigma_y^2}$.

Case 4: Different Covariances — Quadratic Discriminant Analysis

The Most General Case

Finally, let us consider the fully general case where each component has its own covariance matrix: $\boldsymbol{\Sigma}_1 \neq \boldsymbol{\Sigma}_2$. What happens to the decision boundary?

Derivation:

The boundary equation is:

$$\log\pi_1 - \frac{1}{2}\log|\boldsymbol{\Sigma}_1| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_1)^\top\boldsymbol{\Sigma}_1^{-1}(\mathbf{x} - \boldsymbol{\mu}_1)$$ $$= \log\pi_2 - \frac{1}{2}\log|\boldsymbol{\Sigma}_2| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_2)^\top\boldsymbol{\Sigma}_2^{-1}(\mathbf{x} - \boldsymbol{\mu}_2)$$

Now, crucially, the quadratic terms do not cancel:

$$\mathbf{x}^\top\boldsymbol{\Sigma}_1^{-1}\mathbf{x} \neq \mathbf{x}^\top\boldsymbol{\Sigma}_2^{-1}\mathbf{x}$$

Rearranging:

$$\mathbf{x}^\top(\boldsymbol{\Sigma}_1^{-1} - \boldsymbol{\Sigma}_2^{-1})\mathbf{x} + \text{linear terms in } \mathbf{x} + \text{constant} = 0$$

Result: The decision boundary is a quadratic surface—in 2D, this is a conic section (ellipse, parabola, or hyperbola depending on the discriminant).

This is Quadratic Discriminant Analysis (QDA), the most flexible version of discriminant analysis. It can capture much more complex decision boundaries than LDA, but requires estimating more parameters.

Interactive Visualization: Anisotropic Gaussian Mixture

This comprehensive visualization allows you to explore both Linear Discriminant Analysis (LDA) with shared covariances and Quadratic Discriminant Analysis (QDA) with different covariances. Use the checkbox to switch between models, or try the preset buttons to explore classic cases.

Use separate covariances (Σ₁≠Σ₂)

Σ₁ (Blue)

Var₁₁ (x₁): 1.00

Var₂₂ (x₂): 1.00

Cov₁₂: 0.00

Σ₂ (Gold)

Var₁₁ (x₁): 1.00

Var₂₂ (x₂): 1.00

Cov₁₂: 0.00

Number of Points per Cluster: 50

Cluster 1 (Blue)

Center $x_1$: -1.5 Center $x_2$: 0.0

Cluster 2 (Gold)

Center $x_1$: 1.5 Center $x_2$: 0.0

Cluster 1 Region

Cluster 2 Region

Mahalanobis Boundary

Euclidean Bisector

Principal Axes

Explore Different Models: Use the checkbox to switch between LDA (shared covariance) and QDA (separate covariances). The preset buttons configure the visualization for classic cases:

Isotropic GMM: Spherical clusters with equal variance
LDA (Equal Σ): Both clusters share the same covariance matrix
Eigenvector Case: The mean difference vector aligns with an eigenvector of Σ
QDA (Different Σ): Each cluster has its own distinct covariance matrix

Connection to Logistic Regression

From Generative to Discriminative

We've seen that GMM with equal covariances (Case 3) gives us LDA. But there's another famous algorithm lurking here: logistic regression. How are they connected?

Consider a two-class GMM with equal covariances $\boldsymbol{\Sigma}$. The log-odds (log of the ratio of posterior probabilities) is:

Derivation:

$$\log\frac{P(z=1|\mathbf{x})}{P(z=2|\mathbf{x})} = \log\frac{\pi_1 \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_1,\boldsymbol{\Sigma})}{\pi_2 \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_2,\boldsymbol{\Sigma})}$$

Expanding and using the fact that the quadratic terms cancel (as we showed in Case 3):

$$= \log\frac{\pi_1}{\pi_2} + 2(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^\top\boldsymbol{\Sigma}^{-1}\mathbf{x} - \boldsymbol{\mu}_1^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_1 + \boldsymbol{\mu}_2^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_2$$

This can be written as:

$$= \mathbf{w}^\top\mathbf{x} + b$$

where $\mathbf{w} = 2\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)$ and $b$ is a constant.

Key Insight: The log-odds is a linear function of $\mathbf{x}$! Using the fact that $P(z=1|\mathbf{x}) = \sigma(\mathbf{w}^\top\mathbf{x} + b)$ where $\sigma(z) = \frac{1}{1+e^{-z}}$ is the sigmoid function, we recover the form of logistic regression.

The difference: LDA/GMM learns $\mathbf{w}$ and $b$ by estimating the class means and covariance (generative approach), while logistic regression optimizes them directly to maximize the conditional likelihood (discriminative approach).

From Soft to Hard Boundaries: The Magnitude of the Weight Vector

In logistic regression, the decision boundary is determined by $\mathbf{w}^\top\mathbf{x} + b = 0$, and the probability is given by:

$$P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^\top\mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^\top\mathbf{x} + b)}}$$

The magnitude of the weight vector $\|\mathbf{w}\|$ controls how "sharp" the transition is around the decision boundary. Consider what happens as we scale $\mathbf{w}$:

When $\|\mathbf{w}\|$ is small: The sigmoid changes gradually—points far from the boundary still have significant probability for both classes (soft boundary)
When $\|\mathbf{w}\|$ is large: The sigmoid becomes very steep—even points slightly away from the boundary have probabilities close to 0 or 1 (hard boundary)
As $\|\mathbf{w}\| \to \infty$: The sigmoid approaches a step function—we get deterministic, hard classification similar to binary K-Means

This parallels exactly what we saw in GMM: decreasing $\sigma^2$ in GMM has the same effect as increasing $\|\mathbf{w}\|$ in logistic regression—both transition from soft to hard assignments.

Interactive Visualization: Soft to Hard Boundaries in Logistic Regression

This visualization demonstrates how the magnitude of the weight vector $\|\mathbf{w}\|$ controls the transition from soft probabilistic assignments to hard deterministic boundaries in logistic regression. Use the slider to adjust the effective weight magnitude and observe how the decision boundary becomes sharper.

‖w‖ (effective weight norm): 1.0

Number of Points per Class: 50

Cluster std: 0.50

Hard limit

Class 1 (Blue)

Center $x_1$: -1.0 Center $x_2$: 0.0

Class 2 (Gold)

Center $x_1$: 1.0 Center $x_2$: 0.0

Observe: As the effective weight magnitude $\|\mathbf{w}\|$ increases, the sigmoid function becomes steeper, creating a sharper transition between classes. At very high values, we approach a hard linear classifier similar to binary K-Means. The checkbox "Hard limit" immediately switches to deterministic classification.

The Complete Picture: A Unified View

The GMM Hierarchy

We can now see how classical algorithms emerge from GMM:

Full GMM (different $\boldsymbol{\Sigma}_k$): Quadratic Discriminant Analysis (QDA)
→ Most flexible, quadratic boundaries
GMM with $\boldsymbol{\Sigma}_1 = \boldsymbol{\Sigma}_2 = \boldsymbol{\Sigma}$: Linear Discriminant Analysis (LDA)
→ Linear boundaries, Mahalanobis metric
LDA viewed discriminatively: Logistic Regression
→ Same linear boundary, different learning approach
GMM with $\boldsymbol{\Sigma}_k = \sigma^2\mathbf{I}$ (finite $\sigma^2$): Isotropic LDA
→ Linear boundaries, Euclidean metric, soft assignments
GMM with $\boldsymbol{\Sigma}_k = \sigma^2\mathbf{I}$ and $\sigma^2 \to 0$: K-Means
→ Linear boundaries, Euclidean metric, hard assignments
Logistic Regression with $\|\mathbf{w}\| \to \infty$: Hard linear classifier
→ Similar to binary K-Means

The Role of Covariance

The covariance matrix $\boldsymbol{\Sigma}$ is the key that controls:

Shape: Isotropic ($\sigma^2\mathbf{I}$) → spherical clusters; General $\boldsymbol{\Sigma}$ → elliptical clusters
Boundary geometry: Equal $\boldsymbol{\Sigma}_k$ → linear; Different $\boldsymbol{\Sigma}_k$ → quadratic
Distance metric: $\sigma^2\mathbf{I}$ → Euclidean; General $\boldsymbol{\Sigma}$ → Mahalanobis

The Role of Variance Magnitude

The magnitude of variance (in GMM) or weight vector (in logistic regression) controls the transition from soft to hard:

Large $\sigma^2$ (or small $\|\mathbf{w}\|$): Soft, probabilistic assignments—significant uncertainty near boundaries
Small $\sigma^2$ (or large $\|\mathbf{w}\|$): Sharp, nearly deterministic assignments
$\sigma^2 \to 0$ (or $\|\mathbf{w}\| \to \infty$): Completely hard assignments—clustering rather than probability

Conclusion

The Gaussian Mixture Model is far more than just another clustering algorithm—it is a unifying framework that connects generative and discriminative learning, soft and hard assignments, and linear and nonlinear decision boundaries.

By systematically varying the constraints on the covariance matrices and examining limiting behaviors, we've uncovered the deep relationships between seemingly disparate algorithms:

K-Means is GMM in the limit of infinitesimal isotropic variance
LDA is GMM with equal covariances across classes
Logistic regression is the discriminative counterpart to LDA
QDA is the unrestricted GMM allowing different covariances

Understanding these connections provides valuable intuition for choosing and designing machine learning models. When you select K-Means, you're implicitly assuming spherical clusters with hard assignments. When you use logistic regression, you're making the same distributional assumptions as LDA. And when you need more flexibility, QDA offers the full expressive power of GMM. The sharpness of the decision boundary (controlled by $\sigma^2$ in GMM or $\|\mathbf{w}\|$ in logistic regression) determines whether assignments are probabilistic or deterministic.

The journey from GMM to these classical algorithms reveals a beautiful mathematical structure—one where geometric intuition, probabilistic reasoning, and algebraic manipulation come together to illuminate the hidden unity of machine learning.

Table of Contents

Introduction: What are Gaussian Mixture Models?

The Machine Learning Context

The Core Assumption

Why GMMs Matter: A Unifying Framework

Decision Boundaries in GMM: The General Case

Derivation of the General Decision Boundary:

Case 1: Isotropic Covariances with \(\sigma^2 \to 0\) — Recovering K-Means

The Setup

What Happens to the Posterior?

Derivation:

The Limit \(\sigma^2 \to 0\)

From Soft to Hard Assignment

Case 2: Equal Isotropic Covariances (Finite Variance)

Maintaining Isotropy with Finite Variance

Derivation of the Decision Boundary:

Interactive Visualization: Isotropic Gaussian Mixture Model

Cluster 1 (Blue)

Cluster 2 (Gold)

Case 3: Equal Non-Isotropic Covariances — Enter the Mahalanobis Distance

Breaking Isotropy

Deriving the Decision Boundary

Derivation:

The Mahalanobis Distance

When Do the Boundaries Coincide?

Connection to Linear Discriminant Analysis

Interactive Visualization: Linear Discriminant Analysis (LDA)

Shared Covariance Matrix Σ

Cluster 1 (Blue)

Cluster 2 (Gold)

Case 4: Different Covariances — Quadratic Discriminant Analysis

The Most General Case

Derivation:

Interactive Visualization: Anisotropic Gaussian Mixture

Σ₁ (Blue)

Σ₂ (Gold)

Cluster 1 (Blue)

Cluster 2 (Gold)

Connection to Logistic Regression

From Generative to Discriminative

Derivation:

From Soft to Hard Boundaries: The Magnitude of the Weight Vector

Interactive Visualization: Soft to Hard Boundaries in Logistic Regression

Class 1 (Blue)

Class 2 (Gold)

The Complete Picture: A Unified View

The GMM Hierarchy

The Role of Covariance

The Role of Variance Magnitude

Conclusion