Science & Space

How Word2vec Learns Representations: A Step-by-Step Guide to Understanding Its Internal Dynamics

Posted by u/Merekku · 2026-05-03 04:45:14

Introduction

Word2vec is a foundational algorithm for generating dense word embeddings, but understanding exactly how it learns these representations has been elusive for years. Recent research reveals that word2vec, when trained from a small initialization, learns in discrete, rank-incrementing steps—each step adding a new orthogonal concept to the embedding space. This process is equivalent to solving an unweighted least-squares matrix factorization, and the final embeddings are given by a principal component analysis (PCA). In this guide, we'll walk through the key insights that explain what word2vec learns and how it does so, step by step.

How Word2vec Learns Representations: A Step-by-Step Guide to Understanding Its Internal Dynamics — Source: bair.berkeley.edu

What You Need

Basic understanding of neural networks, especially linear layers and gradient descent.
Familiarity with word embeddings and the skip-gram or CBOW architectures.
A text corpus (e.g., a small sample of Wikipedia) for experimental confirmation (optional).
Python environment with PyTorch or TensorFlow (if you wish to replicate the dynamics).
Access to the original paper referenced in the article for deeper mathematical details.

Step 1: Frame Word2vec as a Minimal Neural Language Model

Word2vec trains a shallow two-layer linear network using self-supervised contrastive learning. The input is a one-hot encoding of a word, and the output is a softmax over the vocabulary predicting a context word. Drop the non-linearities and you have a linear model. This simplicity makes it the perfect testing ground for understanding representation learning in language tasks.

During training, the network iterates through a corpus, adjusting the embedding vectors (the weights of the first layer) and the output weights to maximize the probability of observed word pairs. The embeddings end up in a low-dimensional space where semantic relationships are captured by vector angles and distances.

Step 2: Set Up the Learning Problem with Small Initialization

A critical detail is that the embeddings are initialized very close to the origin—effectively zero-dimensional. This tiny initialization forces the network to learn concepts one at a time, rather than from a random interplay of many directions. Under this condition, the gradient flow dynamics become tractable.

The key approximation is that the loss function can be simplified to an unweighted least-squares matrix factorization problem. The input-output co-occurrence matrix (or a shifted version of it) is factorized into two low-rank matrices: the embedding matrix and the output weight matrix.

Step 3: Observe the Discrete, Sequential Learning Steps

If you monitor the weight matrix during training, you'll see that the rank increases in discrete jumps. At each jump, a new orthogonal subspace is added to the embedding space. The loss drops sharply with each such rank increment. This is illustrated in the original paper's figure: left panel shows the loss decreasing stepwise; right panel shows the latent space expanding dimension by dimension.

Each step corresponds to learning a new “concept” (orthogonal direction) that helps explain the co-occurrence statistics. The order of these concepts is determined by the singular values of the underlying matrix—the largest singular value (the strongest statistical pattern) is learned first, followed by the next, and so on.

Step 4: Recognize That Each Step Learns a Linear Subspace

Interesting empirical properties emerge: the learned linear subspaces often encode interpretable concepts like gender, verb tense, or dialect. This is the linear representation hypothesis—the idea that high-level semantic features align with linear directions in the embedding space. In word2vec, these linear directions directly support analogies via vector arithmetic (e.g., “man” – “woman” ≈ “king” – “queen”).

The sequential learning process implies that the first dimension encodes the most dominant regularities, and later dimensions encode finer distinctions. This explains why analogies often work with just a few top principal components.

Step 5: Connect the Final Learned Representations to PCA

The closed-form solution of the gradient flow dynamics shows that the final learned embeddings are simply the principal components of a certain matrix (derived from the co-occurrence counts). In other words, word2vec, under the small-initialization regime, performs a form of PCA on the word-context co-occurrence statistics.

This result is powerful because it provides a quantitative, predictive theory of what word2vec learns. No longer are the embeddings a black box—they are exactly the dominant eigenvalues and eigenvectors of a data-dependent matrix.

Step 6: Apply the Insights to Modern Language Models

Understanding word2vec as a minimal language model that learns concepts sequentially via PCA gives us intuition for more complex models. Modern large language models (LLMs) also exhibit linear representation properties, and their training dynamics may also follow rank-incrementing patterns, albeit more intricately. The linear representation hypothesis enables semantic inspection and model steering techniques—so grasping word2vec's learning dynamics is a prerequisite for those advanced methods.

Tips

Start small: Replicate the word2vec dynamics on a tiny synthetic dataset to see the sequential rank increments yourself.
Monitor singular values: During training, plot the singular values of the weight matrix. You'll see them pop up one by one.
Use the theoretical approximations: For deeper analysis, derive the gradient flow equations assuming small initialization and no regularization—the simplification still holds in practice.
Experiment with different initializations: Larger initializations blur the discrete steps and make the dynamics harder to interpret.
Apply to other embedding methods: The same matrix factorization viewpoint has been extended to other methods like GloVe—compare and contrast.

In conclusion, word2vec is not just a heuristic for embeddings; its learning process is mathematically elegant, reducing to PCA through a series of discrete concept-acquisition steps. This understanding demystifies one of the most influential models in NLP and opens the door to designing better interpretable representations.

Share Save Report