Discussion

The surveyed works represent a paradigm shift in mechanistic interpretability, moving from the analysis of polysemantic neurons and coarse-grained components (e.g., attention heads) toward the construction of fine-grained sparse feature circuits. Sparse Autoencoders (SAEs) and Transcoders enable interpretable, fine-grained analysis of Transformers. By decomposing complex, polysemantic neural activations into interpretable sparse features, the works reviewed here successfully recover human-understandable algorithms for induction and counting, while revealing the mechanisms behind hallucinations, unfaithful reasoning, and hidden goals.

Feature circuits are not merely descriptive. Marks et al. [2024] edit their network to improve its generalization by removing spurious features. Ameisen et al. [2025] and Ge et al. [2024] validate their graphs through steering and ablation experiments, demonstrating that the discovered circuits are causally responsible for the observed behaviors, not merely correlated.

Limitations

Despite these advancements, important limitations remain:

Future Research

Future research must address several open challenges:

  1. Reduce reconstruction error in replacement models—better SAE and Transcoder training objectives, architectures (e.g., end-to-end jointly trained replacements), and regularization strategies.
  2. Address feature splitting—develop methods that automatically group related features into coherent concepts, reducing the manual effort currently required to make circuits interpretable.
  3. Explain attention patterns end-to-end—incorporating the softmax non-linearity and full QK dynamics into circuit attribution, rather than treating scores as fixed constants or using bilinear approximations.
  4. Scale to frontier models—develop computationally efficient methods that can produce global, input-invariant circuits for models with hundreds of billions of parameters.
  5. Establish benchmarks—create standardized evaluation protocols for circuit fidelity (how faithfully does the circuit reproduce the full model's behavior?) and circuit utility (how actionable is the circuit for model editing, safety analysis, etc.?).
← Previous Results