Letβs unpack the identity and understand why it holds, how it comes from polar decomposition / SVD, and what the pieces mean. Iβll start with some linear algebra background, then show the derivation, then explain intuitive meaning.
Background: Polar decomposition & SVD
- Singular Value Decomposition (SVD)
- For any real matrix , there is a factorization
where
- , have orthonormal columns (with ).
- is diagonal with nonnegative singular values , sorted decreasingly.
- Polar decomposition
- Polar decomposition says that any (real) matrix can be written as
- where:
- is an orthogonal (or semi-orthogonal) matrix (if isnβt square / full rank, some relaxed version).
- is Symmetric Positive Definite Matrices (SPD Matrices).
- One canonical construction is: let
- Then define
(assuming is invertible, or using pseudo-inverse if rank deficient).
- Alternatively,
are equivalent forms (they both give the same ), where the inverse or inverse square root is suitably defined.
Derivation of the identity
Starting from the SVD:
- is , is .
- is , diagonal, with positive singular values .
Compute
because . So
Then
- since has orthonormal columns.
- So
Similarly, starting with
so Then
since . Thus both expressions produce
which is the orthogonal polar factor.
Summary: so the identity holds because of SVD properties
- Both
and
equal the same canonical βorthogonal partβ in the SVD .
- The square root / inverse square root comes from extracting the singular values: has eigenvalues , so has eigenvalues . Multiplying by that removes the magnitudes (singular values), leaving the pure directions .
Intuition: what it means
- Decompose into βmagnitudeβ + βdirectionβ:
- = magnitudes / singular values (how much scaling along different orthogonal input/output directions).
- , capture βdirectionsβ (row/column subspaces).
- is like the βshape / rotation / directionalβ part, ignoring scaling. Itβs an orthogonal matrix (or semi-orthogonal if isnβt square or full rank).
- The expressions or are implementations that βdivide outβ the scaling imposed by by applying the inverse of in whichever product yields a symmetric positive definite (or semidef.) matrix.
- In the context of Muon Optimizer AdaMuon - Adaptive Muon Optimizer Muon Blog:
- holds the momentum (matrix).
- Polar / orthogonal factor = remove scaling / magnitude part of , leaving the structured directional information.
- This helps ensure update steps follow coherent geometry, not dominated by large singular values (which might correspond to disproportionate scaling along some directions).
Why βbothβ forms are valid / when to use which
- might be more convenient when is small (so is a small matrix) β computing its inverse square root is cheaper.
- might be better when is smaller.
- Both give the same in exact arithmetic; numerical approximations (like NewtonβSchulz) approximate one of these forms more directly.
Example: 2x2 Matrix
Letβs work through a concrete example with a matrix. Let .
Step 1: Compute and
Step 2: Compute and First, find the eigenvalues and eigenvectors of (which is the same as for in this symmetric case). The eigenvalues satisfy . So , .
Eigenvector for : (normalized: ) Eigenvector for : (normalized: )
Thus, the eigendecomposition is , where and .
Then
Since in this case, is the same matrix.
Step 3: Compute using both forms
First form:
Second form:
Both forms yield the same orthogonal matrix , the identity matrix. This makes sense because the original matrix is symmetric and positive definite, and its polar decomposition is , where itself. The operations successfully removed the βscalingβ (the eigenvalues 3 and 1 of ) to leave the pure βrotationβ (which is the identity in this case).