Table of Contents
- Chapter I: Introduction: What are Gaussian Mixture Models?
- Chapter II: Decision Boundaries in GMM: The General Case
- Chapter III: Case 1: Isotropic Covariances — Recovering K-Means
- Chapter IV: Case 2: Equal Isotropic Covariances
- Chapter V: Case 3: Equal Non-Isotropic Covariances
- Chapter VI: Case 4: Different Covariances — QDA
- Chapter VII: Connection to Logistic Regression
- Chapter VIII: The Complete Picture: A Unified View
- Chapter IX: Conclusion
Introduction: What are Gaussian Mixture Models?
The Machine Learning Context
In machine learning, we often encounter two fundamental types of models: discriminative and generative models. Discriminative models (like logistic regression or support vector machines) directly learn the decision boundary between classes—they answer the question "Given the features \(\mathbf{x}\), what is the probability of class \(y\)?" Formally, they model \(P(y \mid \mathbf{x})\).
Generative models, on the other hand, take a fundamentally different approach. They model the joint distribution \(P(\mathbf{x}, y)\) by learning how the data is generated. A generative model answers: "What is the probability of observing these features \(\mathbf{x}\) together with class \(y\)?" By modeling this joint distribution, we can then use Bayes' rule to perform classification:
Gaussian Mixture Models (GMMs) are a powerful class of generative models. The fundamental assumption is that our data is generated from a mixture of several Gaussian (normal) distributions. Each Gaussian represents a different "cluster" or "component" in the data.
The Core Assumption
Let us consider a dataset where each point \(\mathbf{x} = (x_1, x_2)^\top \in \mathbb{R}^2\). The GMM assumes that each data point is generated through the following process:
- First, we randomly select one of \(K\) components (clusters) according to probabilities \(\pi_1, \pi_2, \ldots, \pi_K\) where \(\sum_{k=1}^K \pi_k = 1\).
- Then, given that we selected component \(k\), we sample the point from a Gaussian distribution \(\mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\) with mean \(\boldsymbol{\mu}_k\) and covariance matrix \(\boldsymbol{\Sigma}_k\).
Mathematically, if we introduce a latent variable \(z \in \{1, 2, \ldots, K\}\) that indicates which component generated the data point, we have:
Since we don't observe \(z\) (it's latent/hidden), we marginalize over it to get the probability of observing \(\mathbf{x}\):
This is the fundamental equation of the Gaussian Mixture Model. The Gaussian density is:
Why GMMs Matter: A Unifying Framework
The beauty of GMMs lies in their generality. By imposing different constraints on the covariance matrices \(\boldsymbol{\Sigma}_k\) and examining limiting behaviors, we recover many classical machine learning algorithms as special cases:
Special Cases of GMM:
- K-Means Clustering: GMM with isotropic covariances \(\boldsymbol{\Sigma}_k = \sigma^2 \mathbf{I}\) where \(\sigma^2 \to 0\)
- Linear Discriminant Analysis (LDA): GMM with equal covariances across all components \(\boldsymbol{\Sigma}_1 = \boldsymbol{\Sigma}_2 = \cdots = \boldsymbol{\Sigma}_K\)
- Logistic Regression: Two-class LDA viewed discriminatively
- Quadratic Discriminant Analysis (QDA): GMM with different covariances for each component
In this guide, we will systematically explore these connections by starting from the general GMM formula and deriving what happens under different assumptions.
Decision Boundaries in GMM: The General Case
Before examining special cases, let us understand how GMM makes decisions. In a classification context with \(K\) classes, we assign a point \(\mathbf{x}\) to the class with the highest posterior probability:
Using Bayes' rule:
For binary classification (\(K=2\)), the decision boundary is the set of points where \(P(z = 1 \mid \mathbf{x}) = P(z = 2 \mid \mathbf{x})\), which occurs when:
Taking logarithms (which is monotonic and preserves the equality):
Derivation of the General Decision Boundary:
$$\log \pi_1 + \log \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1) = \log \pi_2 + \log \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_2, \boldsymbol{\Sigma}_2)$$
Expanding the Gaussian densities:
$$\log \pi_1 - \frac{1}{2}\log|\boldsymbol{\Sigma}_1| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_1)^\top \boldsymbol{\Sigma}_1^{-1}(\mathbf{x} - \boldsymbol{\mu}_1)$$ $$= \log \pi_2 - \frac{1}{2}\log|\boldsymbol{\Sigma}_2| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_2)^\top \boldsymbol{\Sigma}_2^{-1}(\mathbf{x} - \boldsymbol{\mu}_2)$$
Rearranging:
$$\mathbf{x}^\top \boldsymbol{\Sigma}_1^{-1}\mathbf{x} - \mathbf{x}^\top \boldsymbol{\Sigma}_2^{-1}\mathbf{x} - 2\boldsymbol{\mu}_1^\top \boldsymbol{\Sigma}_1^{-1}\mathbf{x} + 2\boldsymbol{\mu}_2^\top \boldsymbol{\Sigma}_2^{-1}\mathbf{x} + \text{const} = 0$$
This is a quadratic equation in \(\mathbf{x}\). The decision boundary is, in general, a conic section (ellipse, parabola, or hyperbola). The key insight is that the shape of the boundary depends entirely on the covariance matrices.
Case 1: Isotropic Covariances with \(\sigma^2 \to 0\) — Recovering K-Means
The Setup
Let us now examine our first special case. Suppose all components have isotropic (spherical) covariances with the same variance:
where \(\mathbf{I}\) is the identity matrix. This means each cluster is circular (in 2D) or spherical (in higher dimensions) with the same "spread" \(\sigma^2\).
What Happens to the Posterior?
With this constraint, the posterior probability for component \(k\) becomes:
Derivation:
The Gaussian density simplifies to:
$$\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \sigma^2\mathbf{I}) = \frac{1}{(2\pi\sigma^2)^{d/2}} \exp\left(-\frac{\|\mathbf{x} - \boldsymbol{\mu}_k\|^2}{2\sigma^2}\right)$$
Therefore, the posterior is:
$$P(z = k \mid \mathbf{x}) = \frac{\pi_k \exp\left(-\frac{\|\mathbf{x} - \boldsymbol{\mu}_k\|^2}{2\sigma^2}\right)}{\sum_{j=1}^K \pi_j \exp\left(-\frac{\|\mathbf{x} - \boldsymbol{\mu}_j\|^2}{2\sigma^2}\right)}$$
The Limit \(\sigma^2 \to 0\)
Now comes the crucial insight. What happens as \(\sigma^2 \to 0\)? The exponential terms become extremely sharp. Consider the term \(\exp\left(-\frac{d_k}{2\sigma^2}\right)\) where \(d_k = \|\mathbf{x} - \boldsymbol{\mu}_k\|^2\):
- If \(d_k\) is the minimum distance among all components, this term approaches 1
- For all other components where \(d_j > d_k\), the term \(\exp\left(-\frac{d_j}{2\sigma^2}\right) \to 0\) exponentially fast
Key Insight: As \(\sigma^2 \to 0\), the posterior probability becomes:
$$P(z = k \mid \mathbf{x}) \to \begin{cases} 1 & \text{if } k = \arg\min_j \|\mathbf{x} - \boldsymbol{\mu}_j\|^2 \\ 0 & \text{otherwise} \end{cases}$$
This is exactly the K-Means assignment rule: assign each point to the nearest centroid!
From Soft to Hard Assignment
GMM normally performs soft assignment—each point has some probability of belonging to each cluster. But as \(\sigma^2 \to 0\), we transition to hard assignment—each point belongs definitively to exactly one cluster. K-Means is the limiting case of GMM where this assignment becomes deterministic.
Case 2: Equal Isotropic Covariances (Finite Variance)
Maintaining Isotropy with Finite Variance
What if we keep the spherical assumption \(\boldsymbol{\Sigma}_k = \sigma^2 \mathbf{I}\) but don't let \(\sigma^2 \to 0\)? Let us derive the decision boundary.
Derivation of the Decision Boundary:
For equal priors (\(\pi_1 = \pi_2\)), the boundary occurs where:
$$\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_1, \sigma^2\mathbf{I}) = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_2, \sigma^2\mathbf{I})$$
Taking logs and using the fact that \(|\sigma^2\mathbf{I}| = (\sigma^2)^d\) cancels out:
$$-\frac{\|\mathbf{x} - \boldsymbol{\mu}_1\|^2}{2\sigma^2} = -\frac{\|\mathbf{x} - \boldsymbol{\mu}_2\|^2}{2\sigma^2}$$
Multiplying by \(-2\sigma^2\) and expanding:
$$\|\mathbf{x}\|^2 - 2\boldsymbol{\mu}_1^\top\mathbf{x} + \|\boldsymbol{\mu}_1\|^2 = \|\mathbf{x}\|^2 - 2\boldsymbol{\mu}_2^\top\mathbf{x} + \|\boldsymbol{\mu}_2\|^2$$
The \(\|\mathbf{x}\|^2\) terms cancel! We get:
$$2(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^\top\mathbf{x} = \|\boldsymbol{\mu}_1\|^2 - \|\boldsymbol{\mu}_2\|^2$$
Result: With equal isotropic covariances, the decision boundary is a hyperplane (a line in 2D). Specifically, it is the perpendicular bisector of the line segment connecting \(\boldsymbol{\mu}_1\) and \(\boldsymbol{\mu}_2\). This is exactly the same boundary as K-Means!
However, unlike K-Means, points near the boundary have uncertain assignments. The width of the uncertainty region is controlled by \(\sigma^2\).
Interactive Visualization: Isotropic Gaussian Mixture Model
In an isotropic Gaussian mixture, both clusters share the same spherical covariance: \(\Sigma_1 = \Sigma_2 = \sigma^2 I\). This creates perfect circular symmetry around each cluster center, and the decision boundary becomes the Euclidean bisector — a straight line perpendicular to the line connecting the two means.
Cluster 1 (Blue)
Cluster 2 (Gold)
Mathematical Foundation: With isotropic covariances \(\Sigma_1 = \Sigma_2 = \sigma^2 I\), the decision boundary satisfies:
\[\|\mathbf{x}-\mu_1\|^2 = \|\mathbf{x}-\mu_2\|^2\]
Expanding this equation yields:
\[(\mu_1-\mu_2)^\top \mathbf{x} = \frac{1}{2}(\|\mu_1\|^2 - \|\mu_2\|^2)\]
This is the equation of a hyperplane — the perpendicular bisector of the segment connecting \(\mu_1\) and \(\mu_2\). The boundary is linear, orthogonal to the vector \((\mu_1-\mu_2)\), and passes through the midpoint.
Case 3: Equal Non-Isotropic Covariances — Enter the Mahalanobis Distance
Breaking Isotropy
Now let us relax the isotropy constraint while maintaining equality: all components share the same covariance matrix \(\boldsymbol{\Sigma}\), but this matrix is no longer a multiple of the identity. This means our clusters can be elliptical rather than circular, and they can be oriented in any direction.
Deriving the Decision Boundary
Derivation:
The boundary occurs where (assuming equal priors):
$$\mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_1, \boldsymbol{\Sigma}) = \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_2, \boldsymbol{\Sigma})$$
Taking logs:
$$-\frac{1}{2}\log|\boldsymbol{\Sigma}| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_1)^\top\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu}_1)$$ $$= -\frac{1}{2}\log|\boldsymbol{\Sigma}| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_2)^\top\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu}_2)$$
The \(\log|\boldsymbol{\Sigma}|\) terms cancel. Expanding the quadratic forms:
$$\mathbf{x}^\top\boldsymbol{\Sigma}^{-1}\mathbf{x} - 2\boldsymbol{\mu}_1^\top\boldsymbol{\Sigma}^{-1}\mathbf{x} + \boldsymbol{\mu}_1^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_1$$ $$= \mathbf{x}^\top\boldsymbol{\Sigma}^{-1}\mathbf{x} - 2\boldsymbol{\mu}_2^\top\boldsymbol{\Sigma}^{-1}\mathbf{x} + \boldsymbol{\mu}_2^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_2$$
Crucial observation: The quadratic terms \(\mathbf{x}^\top\boldsymbol{\Sigma}^{-1}\mathbf{x}\) cancel out! We're left with:
$$2(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^\top\boldsymbol{\Sigma}^{-1}\mathbf{x} = \boldsymbol{\mu}_1^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_2$$
Result: The decision boundary is still a hyperplane (linear), but its orientation is determined by \(\boldsymbol{\Sigma}^{-1}\mathbf{v}\), not by \(\mathbf{v}\) itself!
The vector \(\boldsymbol{\Sigma}^{-1}\mathbf{v}\) defines the direction perpendicular to the decision boundary. This is fundamentally different from the Euclidean case unless \(\boldsymbol{\Sigma} = \sigma^2\mathbf{I}\).
The Mahalanobis Distance
The quantity \((\mathbf{x} - \boldsymbol{\mu})^\top\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\) is called the squared Mahalanobis distance. Unlike Euclidean distance, it accounts for:
- The different variances along different directions (some directions have more spread)
- The correlations between variables (variables might increase together)
The Mahalanobis distance essentially transforms the space so that the covariance becomes spherical, measures Euclidean distance in that transformed space, then transforms back.
When Do the Boundaries Coincide?
Key Theorem: The GMM (Mahalanobis) boundary coincides with the Euclidean bisector if and only if \(\mathbf{v} = \boldsymbol{\mu}_1 - \boldsymbol{\mu}_2\) is an eigenvector of \(\boldsymbol{\Sigma}\).
Proof sketch: If \(\boldsymbol{\Sigma}^{-1}\mathbf{v} = \lambda\mathbf{v}\) for some scalar \(\lambda\), then the normal to the hyperplane is parallel to \(\mathbf{v}\), which means the hyperplane is the perpendicular bisector.
Connection to Linear Discriminant Analysis
This case—GMM with equal covariances—is exactly Linear Discriminant Analysis (LDA). LDA is a classical statistical method for classification that assumes each class has a Gaussian distribution with the same covariance matrix.
Interactive Visualization: Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis assumes that both classes share the same covariance matrix \(\Sigma\), but have different means \(\mu_1\) and \(\mu_2\). This shared covariance constraint produces a linear decision boundary. The black line shows the Mahalanobis boundary, while the green dashed line shows the Euclidean bisector for comparison. The dashed gray lines show the principal axes of the covariance ellipses.
Shared Covariance Matrix Σ
Cluster 1 (Blue)
Cluster 2 (Gold)
Key Observation: When covariance is non-zero, the Mahalanobis boundary (black) diverges from the Euclidean bisector (green). The ellipses show how the covariance structure affects the cluster shapes. Try adjusting the covariance to see how the decision boundary rotates!
Note: The covariance range is automatically constrained to ensure the covariance matrix remains positive semi-definite (PSD). The maximum allowed covariance is \(\sqrt{\sigma_x^2 \cdot \sigma_y^2}\).
Case 4: Different Covariances — Quadratic Discriminant Analysis
The Most General Case
Finally, let us consider the fully general case where each component has its own covariance matrix: \(\boldsymbol{\Sigma}_1 \neq \boldsymbol{\Sigma}_2\). What happens to the decision boundary?
Derivation:
The boundary equation is:
$$\log\pi_1 - \frac{1}{2}\log|\boldsymbol{\Sigma}_1| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_1)^\top\boldsymbol{\Sigma}_1^{-1}(\mathbf{x} - \boldsymbol{\mu}_1)$$ $$= \log\pi_2 - \frac{1}{2}\log|\boldsymbol{\Sigma}_2| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_2)^\top\boldsymbol{\Sigma}_2^{-1}(\mathbf{x} - \boldsymbol{\mu}_2)$$
Now, crucially, the quadratic terms do not cancel:
$$\mathbf{x}^\top\boldsymbol{\Sigma}_1^{-1}\mathbf{x} \neq \mathbf{x}^\top\boldsymbol{\Sigma}_2^{-1}\mathbf{x}$$
Rearranging:
$$\mathbf{x}^\top(\boldsymbol{\Sigma}_1^{-1} - \boldsymbol{\Sigma}_2^{-1})\mathbf{x} + \text{linear terms in } \mathbf{x} + \text{constant} = 0$$
Result: The decision boundary is a quadratic surface—in 2D, this is a conic section (ellipse, parabola, or hyperbola depending on the discriminant).
This is Quadratic Discriminant Analysis (QDA), the most flexible version of discriminant analysis. It can capture much more complex decision boundaries than LDA, but requires estimating more parameters.
Interactive Visualization: Anisotropic Gaussian Mixture
This comprehensive visualization allows you to explore both Linear Discriminant Analysis (LDA) with shared covariances and Quadratic Discriminant Analysis (QDA) with different covariances. Use the checkbox to switch between models, or try the preset buttons to explore classic cases.
Σ₁ (Blue)
Σ₂ (Gold)
Cluster 1 (Blue)
Cluster 2 (Gold)
Explore Different Models: Use the checkbox to switch between LDA (shared covariance) and QDA (separate covariances). The preset buttons configure the visualization for classic cases:
- Isotropic GMM: Spherical clusters with equal variance
- LDA (Equal Σ): Both clusters share the same covariance matrix
- Eigenvector Case: The mean difference vector aligns with an eigenvector of Σ
- QDA (Different Σ): Each cluster has its own distinct covariance matrix
Connection to Logistic Regression
From Generative to Discriminative
We've seen that GMM with equal covariances (Case 3) gives us LDA. But there's another famous algorithm lurking here: logistic regression. How are they connected?
Consider a two-class GMM with equal covariances \(\boldsymbol{\Sigma}\). The log-odds (log of the ratio of posterior probabilities) is:
Derivation:
$$\log\frac{P(z=1|\mathbf{x})}{P(z=2|\mathbf{x})} = \log\frac{\pi_1 \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_1,\boldsymbol{\Sigma})}{\pi_2 \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_2,\boldsymbol{\Sigma})}$$
Expanding and using the fact that the quadratic terms cancel (as we showed in Case 3):
$$= \log\frac{\pi_1}{\pi_2} + 2(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)^\top\boldsymbol{\Sigma}^{-1}\mathbf{x} - \boldsymbol{\mu}_1^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_1 + \boldsymbol{\mu}_2^\top\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_2$$
This can be written as:
$$= \mathbf{w}^\top\mathbf{x} + b$$
where \(\mathbf{w} = 2\boldsymbol{\Sigma}^{-1}(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_2)\) and \(b\) is a constant.
Key Insight: The log-odds is a linear function of \(\mathbf{x}\)! Using the fact that \(P(z=1|\mathbf{x}) = \sigma(\mathbf{w}^\top\mathbf{x} + b)\) where \(\sigma(z) = \frac{1}{1+e^{-z}}\) is the sigmoid function, we recover the form of logistic regression.
The difference: LDA/GMM learns \(\mathbf{w}\) and \(b\) by estimating the class means and covariance (generative approach), while logistic regression optimizes them directly to maximize the conditional likelihood (discriminative approach).
From Soft to Hard Boundaries: The Magnitude of the Weight Vector
In logistic regression, the decision boundary is determined by \(\mathbf{w}^\top\mathbf{x} + b = 0\), and the probability is given by:
The magnitude of the weight vector \(\|\mathbf{w}\|\) controls how "sharp" the transition is around the decision boundary. Consider what happens as we scale \(\mathbf{w}\):
- When \(\|\mathbf{w}\|\) is small: The sigmoid changes gradually—points far from the boundary still have significant probability for both classes (soft boundary)
- When \(\|\mathbf{w}\|\) is large: The sigmoid becomes very steep—even points slightly away from the boundary have probabilities close to 0 or 1 (hard boundary)
- As \(\|\mathbf{w}\| \to \infty\): The sigmoid approaches a step function—we get deterministic, hard classification similar to binary K-Means
This parallels exactly what we saw in GMM: decreasing \(\sigma^2\) in GMM has the same effect as increasing \(\|\mathbf{w}\|\) in logistic regression—both transition from soft to hard assignments.
Interactive Visualization: Soft to Hard Boundaries in Logistic Regression
This visualization demonstrates how the magnitude of the weight vector \(\|\mathbf{w}\|\) controls the transition from soft probabilistic assignments to hard deterministic boundaries in logistic regression. Use the slider to adjust the effective weight magnitude and observe how the decision boundary becomes sharper.
Class 1 (Blue)
Class 2 (Gold)
Observe: As the effective weight magnitude \(\|\mathbf{w}\|\) increases, the sigmoid function becomes steeper, creating a sharper transition between classes. At very high values, we approach a hard linear classifier similar to binary K-Means. The checkbox "Hard limit" immediately switches to deterministic classification.
The Complete Picture: A Unified View
The GMM Hierarchy
We can now see how classical algorithms emerge from GMM:
- Full GMM (different \(\boldsymbol{\Sigma}_k\)): Quadratic Discriminant Analysis (QDA)
→ Most flexible, quadratic boundaries - GMM with \(\boldsymbol{\Sigma}_1 = \boldsymbol{\Sigma}_2 = \boldsymbol{\Sigma}\): Linear Discriminant Analysis (LDA)
→ Linear boundaries, Mahalanobis metric - LDA viewed discriminatively: Logistic Regression
→ Same linear boundary, different learning approach - GMM with \(\boldsymbol{\Sigma}_k = \sigma^2\mathbf{I}\) (finite \(\sigma^2\)): Isotropic LDA
→ Linear boundaries, Euclidean metric, soft assignments - GMM with \(\boldsymbol{\Sigma}_k = \sigma^2\mathbf{I}\) and \(\sigma^2 \to 0\): K-Means
→ Linear boundaries, Euclidean metric, hard assignments - Logistic Regression with \(\|\mathbf{w}\| \to \infty\): Hard linear classifier
→ Similar to binary K-Means
The Role of Covariance
The covariance matrix \(\boldsymbol{\Sigma}\) is the key that controls:
- Shape: Isotropic (\(\sigma^2\mathbf{I}\)) → spherical clusters; General \(\boldsymbol{\Sigma}\) → elliptical clusters
- Boundary geometry: Equal \(\boldsymbol{\Sigma}_k\) → linear; Different \(\boldsymbol{\Sigma}_k\) → quadratic
- Distance metric: \(\sigma^2\mathbf{I}\) → Euclidean; General \(\boldsymbol{\Sigma}\) → Mahalanobis
The Role of Variance Magnitude
The magnitude of variance (in GMM) or weight vector (in logistic regression) controls the transition from soft to hard:
- Large \(\sigma^2\) (or small \(\|\mathbf{w}\|\)): Soft, probabilistic assignments—significant uncertainty near boundaries
- Small \(\sigma^2\) (or large \(\|\mathbf{w}\|\)): Sharp, nearly deterministic assignments
- \(\sigma^2 \to 0\) (or \(\|\mathbf{w}\| \to \infty\)): Completely hard assignments—clustering rather than probability
Conclusion
The Gaussian Mixture Model is far more than just another clustering algorithm—it is a unifying framework that connects generative and discriminative learning, soft and hard assignments, and linear and nonlinear decision boundaries.
By systematically varying the constraints on the covariance matrices and examining limiting behaviors, we've uncovered the deep relationships between seemingly disparate algorithms:
- K-Means is GMM in the limit of infinitesimal isotropic variance
- LDA is GMM with equal covariances across classes
- Logistic regression is the discriminative counterpart to LDA
- QDA is the unrestricted GMM allowing different covariances
Understanding these connections provides valuable intuition for choosing and designing machine learning models. When you select K-Means, you're implicitly assuming spherical clusters with hard assignments. When you use logistic regression, you're making the same distributional assumptions as LDA. And when you need more flexibility, QDA offers the full expressive power of GMM. The sharpness of the decision boundary (controlled by \(\sigma^2\) in GMM or \(\|\mathbf{w}\|\) in logistic regression) determines whether assignments are probabilistic or deterministic.
The journey from GMM to these classical algorithms reveals a beautiful mathematical structure—one where geometric intuition, probabilistic reasoning, and algebraic manipulation come together to illuminate the hidden unity of machine learning.