How Word2vec Learns Representations: A Step-by-Step Guide

Introduction

Word2vec is a foundational algorithm for learning dense vector representations of words. But what exactly does it learn, and how? Recent research reveals that under realistic conditions, word2vec's learning process reduces to unweighted least-squares matrix factorization, with final representations given by PCA. This guide breaks down the learning dynamics into concrete, observable steps, from initialization to the emergence of linear subspaces that enable analogies like 'king - man + woman = queen'. Whether you're a researcher or practitioner, understanding these steps illuminates how modern language models build interpretable internal structures.

How Word2vec Learns Representations: A Step-by-Step Guide — Source: bair.berkeley.edu

What You Need

Basic familiarity with word2vec (skip-gram or CBOW) and gradient descent.
A trained word2vec model (or access to training logs) to observe dynamics.
Understanding of linear algebra (vectors, subspaces, PCA) to interpret the results.
Curiosity about representation learning and the linear representation hypothesis.

Step-by-Step Process

Step 1: Initialize Embeddings at the Origin

Start with all word embedding vectors randomly initialized very close to the origin. This effectively makes them zero-dimensional—a blank slate. From this tiny initialization, the learning process will unfold in discrete, rank‑incrementing stages. The key is that the initial weights are small enough to break symmetry and allow the network to gradually expand its representational capacity.

Step 2: Train with Contrastive Gradient Descent

Iterate through your text corpus using the word2vec objective (either skip‑gram with negative sampling or CBOW). For each target‑context pair, update the two‑layer linear network via self‑supervised gradient descent. The loss function drives the model to distinguish true word co‑occurrences from random noise. Under mild approximations, these updates can be mathematically reduced to unweighted least‑squares matrix factorization. This step sets the stage for a predictable learning trajectory.

Step 3: Observe Rank‑Incrementing Learning Steps

Monitor the singular values or eigenvalues of the embedding weight matrix during training. You'll see the rank increase in discrete jumps: the network adds one new orthogonal linear subspace (a “concept”) at a time. Each rank increment corresponds to a sharp drop in the loss. The embeddings collectively shift from zero dimensions to 1D, then 2D, etc., until model capacity is saturated. This sequence is deterministic and mirrors how PCA extracts principal components one by one.

Step 4: Map Concepts to Interpretable Subspaces

After each rank increase, the newly added subspace encodes a specific semantic or syntactic concept—e.g., gender, verb tense, or dialect. Because the learned directions are orthogonal, you can isolate them via linear algebra. For instance, the direction from “man” to “woman” emerges as a latent axis. This is the origin of the celebrated linear representation hypothesis: the network naturally organizes knowledge along interpretable axes. These axes enable the famous word‑vector analogies (e.g., “king” − “man” + “woman” ≈ “queen”).

Step 5: Reach Final PCA Solution

Training converges when the gradient flow dynamics reach a closed‑form solution: the final word embeddings are exactly given by the principal components of a certain co‑occurrence matrix. In practice, this means the learned vectors are the top d eigenvectors (scaled by square roots of eigenvalues) of a pointwise mutual information matrix. The process is equivalent to performing PCA on the log‑counts of word co‑occurrences, denoised by the contrastive training. This final representation is both optimal and interpretable.

Tips for Practitioners

Start with tiny weights to force rank‑incrementing dynamics; it makes the learning steps easier to track and diagnose.
Log singular values during training to witness the discrete jumps—this can reveal when new concepts are learned.
Use the PCA connection to derive word vectors analytically if you need a faster, interpretable baseline.
Beware of over‑capacity: once the rank saturates, additional training may overfit to noise; stop early when the loss plateaus.
Explore the discovered subspaces by projecting word vectors onto each principal component—many will align with linguistic categories.