§6 Discussion — Feature Circuits in LLMs

Discussion

The surveyed works represent a paradigm shift in mechanistic interpretability, moving from the analysis of polysemantic neurons and coarse-grained components (e.g., attention heads) toward the construction of fine-grained sparse feature circuits. Sparse Autoencoders (SAEs) and Transcoders enable interpretable, fine-grained analysis of Transformers. By decomposing complex, polysemantic neural activations into interpretable sparse features, the works reviewed here successfully recover human-understandable algorithms for induction and counting, while revealing the mechanisms behind hallucinations, unfaithful reasoning, and hidden goals.

Feature circuits are not merely descriptive. Marks et al. [2024] edit their network to improve its generalization by removing spurious features. Ameisen et al. [2025] and Ge et al. [2024] validate their graphs through steering and ablation experiments, demonstrating that the discovered circuits are causally responsible for the observed behaviors, not merely correlated.

Limitations

Despite these advancements, important limitations remain:

Reconstruction error. Transcoders and SAEs are approximations of the underlying model. The resulting reconstruction error may cause the loss of key information, potentially biasing or omitting components of the true circuit.
Fixed attention scores. Some works treat Query-Key (QK) circuitry as fixed scores,[Ameisen et al., 2025][Dunefsky et al., 2024] missing integral insights into model behavior—particularly how the routing of information itself is computed.
Softmax non-linearity. Methods that trace attention scores[Kamath et al., 2025][Ge et al., 2024][He et al., 2024] fail to account for the non-linear softmax operation, meaning the bilinear attribution is an approximation of the true attention weight.
Circuit complexity and feature splitting. The resulting computational graphs are often highly complex and difficult to interpret. A single concept often appears split across several features; manual grouping and other simplifications are required to make the circuits understandable.
Locality. Most works discover circuits that are local to a specific prompt by linearizing the model with respect to that input. Constructing global circuits remains highly challenging. Although Transcoders represent a step toward global circuitry by providing input-invariant terms in the edges, fully generalized circuit discovery is not yet solved.
Scalability. This circuit discovery work is difficult to scale. Currently, only [Ameisen et al., 2025] and [Kamath et al., 2025] perform analysis on production-scale models (Claude 3.5).
No unified benchmark. There is no established benchmark or evaluation protocol that all works resort to for comparing the fidelity and utility of their discovery techniques, making systematic progress tracking difficult.

Future Research

Future research must address several open challenges:

Reduce reconstruction error in replacement models—better SAE and Transcoder training objectives, architectures (e.g., end-to-end jointly trained replacements), and regularization strategies.
Address feature splitting—develop methods that automatically group related features into coherent concepts, reducing the manual effort currently required to make circuits interpretable.
Explain attention patterns end-to-end—incorporating the softmax non-linearity and full QK dynamics into circuit attribution, rather than treating scores as fixed constants or using bilinear approximations.
Scale to frontier models—develop computationally efficient methods that can produce global, input-invariant circuits for models with hundreds of billions of parameters.
Establish benchmarks—create standardized evaluation protocols for circuit fidelity (how faithfully does the circuit reproduce the full model's behavior?) and circuit utility (how actionable is the circuit for model editing, safety analysis, etc.?).

← Previous Results Next → References