Let’s unpack the identity and understand why it holds, how it comes from polar decomposition / SVD, and what the pieces mean. I’ll start with some linear algebra background, then show the derivation, then explain intuitive meaning.

Background: Polar decomposition & SVD

  1. Singular Value Decomposition (SVD)
  • For any real matrix , there is a factorization

where

  • , have orthonormal columns (with ).
  • is diagonal with nonnegative singular values , sorted decreasingly.
  1. Polar decomposition
  • Polar decomposition says that any (real) matrix can be written as
  • Then define

(assuming is invertible, or using pseudo-inverse if rank deficient).

  • Alternatively,

are equivalent forms (they both give the same ), where the inverse or inverse square root is suitably defined.

Derivation of the identity

Starting from the SVD:

  • is , is .
  • is , diagonal, with positive singular values .

Compute

because . So

Then

  • since has orthonormal columns.
  • So

Similarly, starting with

so Then

since . Thus both expressions produce

which is the orthogonal polar factor.

Summary: so the identity holds because of SVD properties

  • Both

and

equal the same canonical β€œorthogonal part” in the SVD .

  • The square root / inverse square root comes from extracting the singular values: has eigenvalues , so has eigenvalues . Multiplying by that removes the magnitudes (singular values), leaving the pure directions .

Intuition: what it means

  • Decompose into β€œmagnitude” + β€œdirection”:
    • = magnitudes / singular values (how much scaling along different orthogonal input/output directions).
    • , capture β€œdirections” (row/column subspaces).
  • is like the β€œshape / rotation / directional” part, ignoring scaling. It’s an orthogonal matrix (or semi-orthogonal if isn’t square or full rank).
  • The expressions or are implementations that β€œdivide out” the scaling imposed by by applying the inverse of in whichever product yields a symmetric positive definite (or semidef.) matrix.
  • In the context of Muon Optimizer AdaMuon - Adaptive Muon Optimizer Muon Blog:
    • holds the momentum (matrix).
    • Polar / orthogonal factor = remove scaling / magnitude part of , leaving the structured directional information.
    • This helps ensure update steps follow coherent geometry, not dominated by large singular values (which might correspond to disproportionate scaling along some directions).

Why β€œboth” forms are valid / when to use which

  • might be more convenient when is small (so is a small matrix) – computing its inverse square root is cheaper.
  • might be better when is smaller.
  • Both give the same in exact arithmetic; numerical approximations (like Newton–Schulz) approximate one of these forms more directly.

Example: 2x2 Matrix

Let’s work through a concrete example with a matrix. Let .

Step 1: Compute and

Step 2: Compute and First, find the eigenvalues and eigenvectors of (which is the same as for in this symmetric case). The eigenvalues satisfy . So , .

Eigenvector for : (normalized: ) Eigenvector for : (normalized: )

Thus, the eigendecomposition is , where and .

Then

Since in this case, is the same matrix.

Step 3: Compute using both forms

First form:

Second form:

Both forms yield the same orthogonal matrix , the identity matrix. This makes sense because the original matrix is symmetric and positive definite, and its polar decomposition is , where itself. The operations successfully removed the β€œscaling” (the eigenvalues 3 and 1 of ) to leave the pure β€œrotation” (which is the identity in this case).