§2 Background — Feature Circuits in LLMs

Features

Features are representational primitives in neural networks. They are the smallest units of representation and cannot be decomposed into smaller parts. An example from the image recognition domain is a feature that activates in the presence of a cat. In the language domain, a feature may activate in the presence of a specific entity, such as the Golden Gate Bridge, across all languages.[Templeton et al., 2024]

Ideally, each neuron in an LLM would correspond to exactly one such feature—a monosemantic neuron. This is often the case in the visual domain, where neurons specializing in edges, textures, or objects are frequently observed. Language models, however, behave differently.

Polysemanticity & Superposition

We would have liked for these features to be certain neurons—such as a neuron that fires on a celebrity name. However, large language models demonstrate a different behaviour where multiple unrelated features are crammed into a single neuron. For example, a neuron fires in the presence of a celebrity name but also a certain food.

Elhage et al. [2022] demonstrate the creation of these polysemantic neurons with a phenomenon named superposition using a toy model. They demonstrate that the key factors behind superposition are the sparsity of the input features and the ReLU activations. The toy model takes a high-dimensional input vector \(\mathbf{x} \in \mathbb{R}^n\), projects it into a low-dimensional representation \(\mathbf{h} \in \mathbb{R}^m\) using \(W \in \mathbb{R}^{m \times n}\), and then recovers it using \(W^T\). The sparsity of the input features is varied from dense to sparse, and the encoder matrix \(W\) is visualized.

Superposition in a 2D toy model — Experiment visualization from [Elhage et al., 2022] demonstrating the emergence of superposition with increased sparsity in a simple 2D example. Left: dense inputs produce two orthogonal, interpretable directions. Right: with 90% sparsity, all five features are packed into two dimensions and interfere.

The 2D Case

The 2-dimensional case provides a particularly clean visualization. Here, the hidden state has two dimensions (\(m = 2\)) while the input has five features (\(n = 5\)). Thus, \(W\) is a \(2 \times 5\) matrix. We can visualize the five 2-dimensional vectors that compose the signals in the hidden state.

In the dense input case (left), two vectors correspond to the two most important input features, while the others are mapped to zero. These two features were the most important (i.e., they caused the most loss when omitted); thus, the network prioritized representing them while discarding the remaining features, behaving essentially like PCA. Note that the discarded features have no connection to the hidden dimension. Although these orthogonal directions are not aligned with the neuron axes, we can interpret the signal by applying a rotation that aligns these directions with the neurons. This yields an interpretable neuron that fires only in the presence of a specific input feature.

However, with 90% sparsity (meaning a feature is present only 10% of the time), we see that all five input features are represented in the 2-dimensional hidden state. These features now interfere with each other as they are no longer orthogonal, meaning the presence of one will inherently trigger others. In this case, interpreting the signal is not trivial, as neurons will fire for multiple features.

Tangled Networks & Noisily Simulated Disentanglement

Consequently, the inner workings of the model are not understandable simply by reading the hidden state signal. These types of networks with emergent superposition are named tangled networks and are said to be noisily simulating larger disentangled networks.

Without the ReLU activation, superposition is not observed regardless of input sparsity. With ReLU, the model gains the capability to filter out negative signals (interference), allowing superposition to emerge. Therefore, input sparsity and non-linear activations are the primary drivers behind superposition.

← Previous Introduction Next → Core Methods