The Polar Decomposition

Let’s unpack the identity $O_{t} = (M_{t} M_{t}^{T})^{- 1/2} M_{t} = M_{t} (M_{t}^{T} M_{t})^{- 1/2}$ and understand why it holds, how it comes from polar decomposition / SVD, and what the pieces mean. I’ll start with some linear algebra background, then show the derivation, then explain intuitive meaning.

Background: Polar decomposition & SVD

Singular Value Decomposition (SVD)

For any real matrix $M \in R^{n \times m}$ , there is a factorization

M = U Σ V^{T}

where

$U \in R^{n \times r}$ , $V \in R^{m \times r}$ have orthonormal columns (with $r = rank (M)$ ).
$Σ \in R^{r \times r}$ is diagonal with nonnegative singular values $σ_{1}, \dots, σ_{r}$ , sorted decreasingly.

Polar decomposition

Polar decomposition says that any (real) matrix $M$ can be written as

M = O \cdot H

where:
- $O$ is an orthogonal (or semi-orthogonal) matrix (if $M$ isn’t square / full rank, some relaxed version).
- $H$ is Symmetric Positive Definite Matrices (SPD Matrices).
One canonical construction is: let

H = (M^{T} M)^{1/2}

Then define

O = M H^{- 1}

(assuming $M^{T} M$ is invertible, or using pseudo-inverse if rank deficient).

Alternatively,

O = (M M^{T})^{- 1/2} M

are equivalent forms (they both give the same $O$ ), where the inverse or inverse square root is suitably defined.

Derivation of the identity

Starting from the SVD:

M = U Σ V^{T}

$U$ is $n \times r$ , $V$ is $m \times r$ .
$Σ$ is $r \times r$ , diagonal, with positive singular values ${σ_{i}}$ .

Compute $(M^{T} M)^{1/2}$

M^{T} M = (U Σ V^{T})^{T} (U Σ V^{T}) = V Σ U^{T} U Σ V^{T} = V Σ^{2} V^{T}

because $U^{T} U = I_{r}$ . So

(M^{T} M)^{1/2} = V Σ V^{T}

Then

M (M^{T} M)^{- 1/2} = (U Σ V^{T}) \cdot (V Σ V^{T})^{- 1}

$(V Σ V^{T})^{- 1} = V Σ^{- 1} V^{T}$ since $V$ has orthonormal columns.
So $M (M^{T} M)^{- 1/2} = U Σ V^{T} \cdot V Σ^{- 1} V^{T} = U Σ Σ^{- 1} V^{T} = U V^{T}$

Similarly, starting with $(M M^{T})^{- 1/2} M$

M M^{T} = (U Σ V^{T}) (U Σ V^{T})^{T} = U Σ V^{T} V Σ U^{T} = U Σ^{2} U^{T}

so $(M M^{T})^{1/2} = U Σ U^{T}$ Then

(M M^{T})^{- 1/2} M = (U Σ U^{T})^{- 1} \cdot U Σ V^{T} = U Σ^{- 1} U^{T} \cdot U Σ V^{T} = U V^{T}

since $U^{T} U = I_{r}$ . Thus both expressions produce

O = U V^{T}

which is the orthogonal polar factor.

Summary: so the identity holds because of SVD properties

Both

O = M (M^{T} M)^{- 1/2}

and

O = (M M^{T})^{- 1/2} M

equal the same canonical “orthogonal part” $U V^{T}$ in the SVD $M = U Σ V^{T}$ .

The square root / inverse square root comes from extracting the singular values: $(M^{T} M)$ has eigenvalues $σ_{i}^{2}$ , so $(M^{T} M)^{- 1/2}$ has eigenvalues $σ_{i}^{- 1}$ . Multiplying by that removes the magnitudes (singular values), leaving the pure directions $U V^{T}$ .

Intuition: what it means

Decompose $M$ into “magnitude” + “direction”:
- $Σ$ = magnitudes / singular values (how much scaling along different orthogonal input/output directions).
- $U$ , $V$ capture “directions” (row/column subspaces).
$O = U V^{T}$ is like the “shape / rotation / directional” part, ignoring scaling. It’s an orthogonal matrix (or semi-orthogonal if $M$ isn’t square or full rank).
The expressions $M (M^{T} M)^{- 1/2}$ or $(M M^{T})^{- 1/2} M$ are implementations that “divide out” the scaling imposed by $Σ$ by applying the inverse of $Σ$ in whichever product yields a symmetric positive definite (or semidef.) matrix.
In the context of Muon Optimizer AdaMuon - Adaptive Muon Optimizer Muon Blog:
- $M_{t}$ holds the momentum (matrix).
- Polar / orthogonal factor $O_{t}$ = remove scaling / magnitude part of $M_{t}$ , leaving the structured directional information.
- This helps ensure update steps follow coherent geometry, not dominated by large singular values (which might correspond to disproportionate scaling along some directions).

Why “both” forms are valid / when to use which

$M (M^{T} M)^{- 1/2}$ might be more convenient when $m$ is small (so $M^{T} M$ is a small matrix) – computing its inverse square root is cheaper.
$(M M^{T})^{- 1/2} M$ might be better when $n$ is smaller.
Both give the same $O$ in exact arithmetic; numerical approximations (like Newton–Schulz) approximate one of these forms more directly.

Example: 2x2 Matrix

Let’s work through a concrete example with a $2 \times 2$ matrix. Let $M = [2112]$ .

Step 1: Compute $M^{T} M$ and $M M^{T}$

$M^{T} M = [2112] [2112] = [5445]$
$M M^{T} = [2112] [2112] = [5445]$

Step 2: Compute $(M^{T} M)^{- 1/2}$ and $(M M^{T})^{- 1/2}$ First, find the eigenvalues and eigenvectors of $M^{T} M$ (which is the same as for $M M^{T}$ in this symmetric case). The eigenvalues $λ$ satisfy $det ([5 - λ 4 4 5 - λ]) = (5 - λ)^{2} - 16 = 0$ . So $λ_{1} = 9$ , $λ_{2} = 1$ .

Eigenvector for $λ_{1} = 9$ : $[11]$ (normalized: $\frac{1}{2} [11]$ ) Eigenvector for $λ_{2} = 1$ : $[1 - 1]$ (normalized: $\frac{1}{2} [1 - 1]$ )

Thus, the eigendecomposition is $M^{T} M = Q Λ Q^{T}$ , where $Q = \frac{1}{2} [11 1 - 1]$ and $Λ = [9001]$ .

Then $(M^{T} M)^{- 1/2} = Q Λ^{- 1/2} Q^{T} = \frac{1}{2} [11 1 - 1] [1/3 0 01] \frac{1}{2} [11 1 - 1]^{T}$ $= \frac{1}{2} [11 1 - 1] [1/3 0 01] [11 1 - 1]$ $= \frac{1}{2} [1/3 1/3 1 - 1] [11 1 - 1]$ $= \frac{1}{2} [1/3 + 1 1/3 - 1 1/3 - 1 1/3 + 1] = \frac{1}{2} [4/3 - 2/3 - 2/3 4/3] = [2/3 - 1/3 - 1/3 2/3]$

Since $M M^{T} = M^{T} M$ in this case, $(M M^{T})^{- 1/2}$ is the same matrix.

Step 3: Compute $O$ using both forms

First form: $O = (M M^{T})^{- 1/2} M = [2/3 - 1/3 - 1/3 2/3] [2112] = [(4/3 - 1/3) (- 2/3 + 2/3) (2/3 - 2/3) (- 1/3 + 4/3)] = [1001]$

Second form: $O = M (M^{T} M)^{- 1/2} = [2112] [2/3 - 1/3 - 1/3 2/3] = [(4/3 - 1/3) (2/3 - 2/3) (- 2/3 + 2/3) (- 1/3 + 4/3)] = [1001]$

Both forms yield the same orthogonal matrix $O = I$ , the identity matrix. This makes sense because the original matrix $M$ is symmetric and positive definite, and its polar decomposition is $M = I \cdot H$ , where $H = M$ itself. The operations successfully removed the “scaling” (the eigenvalues 3 and 1 of $M$ ) to leave the pure “rotation” (which is the identity in this case).

🤓 Yashwanth's Notes

Explorer

The Polar Decomposition

Background: Polar decomposition & SVD

Derivation of the identity

Compute $(M^{T} M)^{1/2}$

Then

Similarly, starting with $(M M^{T})^{- 1/2} M$

Summary: so the identity holds because of SVD properties

Intuition: what it means

Why “both” forms are valid / when to use which

Example: 2x2 Matrix

Explorer

Backlinks

Table of Contents

Graph View

🤓 Yashwanth's Notes

Explorer

The Polar Decomposition

Background: Polar decomposition & SVD

Derivation of the identity

Compute (MTM)1/2

Then

Similarly, starting with (MMT)−1/2M

Summary: so the identity holds because of SVD properties

Intuition: what it means

Why “both” forms are valid / when to use which

Example: 2x2 Matrix

Explorer

Backlinks

Table of Contents

Graph View

Compute $(M^{T} M)^{1/2}$

Similarly, starting with $(M M^{T})^{- 1/2} M$