Multi-Class Classification
Introduction to Multi-Class Linear Classification
Rather than decomposing a multi-class problem into multiple binary classification tasks, we can construct a unified model that handles all $K$ classes simultaneously. This approach yields a more elegant formulation and possesses desirable theoretical properties, including the natural emergence of convex decision regions.
Comparison: Binary vs Multi-Class Classification
Fundamental Difference for 2 Classes:
Binary Classification Approach:
• Uses a SINGLE weight vector $\mathbf{w}$
• One discriminant: $y(\mathbf{x}) = \mathbf{w}^\top\mathbf{x} + b$
• Decision: Class 1 if $y(\mathbf{x}) > 0$, Class 2 if $y(\mathbf{x}) < 0$
Multi-Class Approach (even for K=2):
• Uses TWO weight vectors: $\mathbf{w}_1$ and $\mathbf{w}_2$
• Two discriminants: $y_1(\mathbf{x}) = \mathbf{w}_1^\top\mathbf{x} + b_1$ and $y_2(\mathbf{x}) = \mathbf{w}_2^\top\mathbf{x} + b_2$
• Decision: Class 1 if $y_1(\mathbf{x}) > y_2(\mathbf{x})$, Class 2 if $y_2(\mathbf{x}) > y_1(\mathbf{x})$
Why This Changes the Interpretation:
In Binary: The single $\mathbf{w}$ is perpendicular to the decision boundary and points from Class 2 toward Class 1. The boundary is defined by $\mathbf{w}^\top\mathbf{x} + b = 0$.
In Multi-Class: Each $\mathbf{w}_k$ represents the “direction of preference” for class $C_k$. Neither $\mathbf{w}_1$ nor $\mathbf{w}_2$ is perpendicular to the boundary! Instead:
• $\mathbf{w}_1$ points in the direction where $y_1$ increases (favoring Class 1)
• $\mathbf{w}_2$ points in the direction where $y_2$ increases (favoring Class 2)
• The boundary between classes occurs where $y_1(\mathbf{x}) = y_2(\mathbf{x})$
• The normal to this boundary is $(\mathbf{w}_1 – \mathbf{w}_2)$, NOT $\mathbf{w}_1$ or $\mathbf{w}_2$ alone
Geometric Properties of Decision Boundaries: The decision boundary separating classes $C_k$ and $C_j$ corresponds to the hyperplane where the two discriminant functions are equal, i.e., where $y_k(\mathbf{x}) = y_j(\mathbf{x})$. This can be expressed in standard hyperplane form as:
The decision region associated with each class is defined as the intersection of $K-1$ half-spaces, thereby forming a convex polyhedron. This constitutes a significant advantage over alternative approaches such as one-versus-rest or one-versus-one decomposition: the multi-class discriminant formulation inherently produces convex, singly-connected decision regions, ensuring a geometrically well-behaved partition of the input space.
Interactive Visualization
Key Observation: What Do Weight Vectors Represent?
$\mathbf{w}_1$ shows the direction where $y_1$ increases, not where “class $C_1$ is located”.
Moving in direction $\mathbf{w}_1$ makes $y_1$ grow, but classification depends on the relative comparison of $y_1$ vs $y_2$.
The green vector $(\mathbf{w}_1 – \mathbf{w}_2)$ actually points from $C_2$ toward $C_1$ across the boundary.
Mathematical Setup
Understanding the Weight Vectors
The weight vector $\mathbf{w}_1$ indicates the direction in which the discriminant $y_1$ increases most rapidly, rather than directly specifying the location of class $C_1$ in the input space.
Motion along the direction of $\mathbf{w}_1$ increases the value of $y_1$, but the classification decision depends on the relative comparison between $y_1$ and $y_2$, not on the absolute value of either discriminant.
The green vector $(\mathbf{w}_1 – \mathbf{w}_2)$ represents the normal to the decision boundary, oriented from region $C_2$ toward region $C_1$.
$\mathbf{w}_2 = [-1, 2]$, $b_2 = 1$
Predict $C_2$ if $y_2(\mathbf{x}) > y_1(\mathbf{x})$
Three Classes (K=3)
Three-Class Setup
$\mathbf{w}_2 = [-0.8, 1.8]$, $b_2 = -0.5$
$\mathbf{w}_3 = [-0.7, -2.0]$, $b_3 = 3.5$
Normal: $\mathbf{w}_1 – \mathbf{w}_2 = [2.3, -1.3]$
$C_1$ vs $C_3$: Where $y_1(\mathbf{x}) = y_3(\mathbf{x})$
Normal: $\mathbf{w}_1 – \mathbf{w}_3 = [2.2, 2.5]$
$C_2$ vs $C_3$: Where $y_2(\mathbf{x}) = y_3(\mathbf{x})$
Normal: $\mathbf{w}_2 – \mathbf{w}_3 = [-0.1, 3.8]$
Convex Decision Regions in Multi-Class Linear Classification
For a classification problem involving $K$ classes with affine discriminant functions $y_k(\mathbf{x}) = \mathbf{w}_k^\top \mathbf{x} + b_k$, the decision rule assigns each point $\mathbf{x}$ to the class $C_i$ for which $y_i(\mathbf{x}) > y_j(\mathbf{x})$ holds for all $j \neq i$. This “winner-takes-all” strategy partitions the feature space $\mathbb{R}^D$ into $K$ convex regions.
Theorem (Convexity): The decision region $R_k = \{\mathbf{x} : y_k(\mathbf{x}) \geq y_j(\mathbf{x}) \text{ for all } j \neq k\}$ corresponding to any class $C_k$ forms a convex polyhedron. This result follows from the observation that $R_k$ can be expressed as the intersection of $K-1$ half-spaces, and since the intersection of convex sets is convex, $R_k$ must be convex.
Decision Boundaries: The boundary separating classes $C_i$ and $C_j$ is characterized by the hyperplane where $y_i(\mathbf{x}) = y_j(\mathbf{x})$, or equivalently $(\mathbf{w}_i – \mathbf{w}_j)^\top \mathbf{x} + (b_i – b_j) = 0$. The normal vector to this hyperplane is $(\mathbf{w}_i – \mathbf{w}_j)$, oriented from region $C_j$ toward region $C_i$.
Mathematical Property (Concurrency): In the three-class setting, the boundaries defined by $y_1=y_2$, $y_1=y_3$, and $y_2=y_3$ are linearly dependent, as the relation $(y_1-y_3) = (y_1-y_2) + (y_2-y_3)$ holds identically. Consequently, these three hyperplanes must intersect at a single point (under non-degenerate conditions), which is the unique solution to the system:
$\begin{bmatrix} (\mathbf{w}_1-\mathbf{w}_2)^\top \\ (\mathbf{w}_1-\mathbf{w}_3)^\top \end{bmatrix} \mathbf{x} = -\begin{bmatrix} b_1-b_2 \\ b_1-b_3 \end{bmatrix}$Class Regions: Each class occupies the region where its discriminant function dominates all others:
• $C_1$: $\{\mathbf{x} : y_1(\mathbf{x}) \geq y_2(\mathbf{x}) \text{ AND } y_1(\mathbf{x}) \geq y_3(\mathbf{x})\}$
• $C_2$: $\{\mathbf{x} : y_2(\mathbf{x}) \geq y_1(\mathbf{x}) \text{ AND } y_2(\mathbf{x}) \geq y_3(\mathbf{x})\}$
• $C_3$: $\{\mathbf{x} : y_3(\mathbf{x}) \geq y_1(\mathbf{x}) \text{ AND } y_3(\mathbf{x}) \geq y_2(\mathbf{x})\}$