Introduction
Why mechanistic interpretability matters, what limits neuron-level analysis, and how feature-based circuit discovery emerged as a response.
The capabilities of Large Language Models (LLMs) and their widespread use are increasing. These models are being integrated into critical areas such as law and medicine. However, our understanding of these systems remains fairly limited. The intelligence and capabilities within these systems appear to emerge from their massive scale (billions or trillions of parameters), and it has been established that their functioning mechanism extends beyond solely modeling language. As a result of their massive size, these models learn an underlying world model upon which the text is based and can run internal simulations.[Janus, 2022] They possess complex circuitry to conduct reasoning and mathematics.[Ivanitskiy et al., 2023]
Why Interpretability Is Urgent
Understanding the internal mechanism is crucial for the safety and alignment of these systems as they progressively get integrated into critical infrastructure. One popular case study[Lynch et al., 2025] demonstrates that all flagship models (including those from Anthropic, OpenAI, Google, Meta, and others) resorted to blackmailing critical employees when suspecting shutdown. Therefore, aligning the intentions of these systems with human values is an absolute necessity.
Mechanistic Interpretability is a sub-field of broader interpretability research that aims to explain the internal mechanisms of these models in a granular fashion, by explicitly specifying the neural network's internal computation. Exploring internal circuits is a popular method to reverse-engineer the models. Circuits are interpretable subcomputations responsible for certain model behaviours.
From Manual to Automated Circuit Discovery
Circuit discovery has predominantly relied on manually dissecting and inspecting the models. Olsson et al. [2022] located attention heads responsible for in-context learning by suppressing specific heads and observing the output. Meng et al. [2022] located and edited factual information stored in GPT by altering component activations to understand their causal contributions—a technique referred to as activation patching. While these methods can isolate components active in the generation of an output, they do not provide edge relations between them. Goldowsky-Dill et al. [2023] utilized path patching (also referred to as edge patching) to identify the edges between components.
These methods of analysis are based on manual human intervention and require intuition-based hypotheses to verify the mechanisms. This reliance on human intuition and manual experimentation limits the scalability of these approaches. Conmy et al. [2023] automated the circuit discovery process by applying edge patching recursively from the output node back to the inputs, identifying the computational subgraph responsible for a specific behavior. However, since patching each edge requires a separate forward pass, this method is computationally expensive.
Syed et al. [2023] formalized Attribution Patching, originally proposed by [Nanda, 2023]. This method approximates the effect of patching edges using a first-order Taylor expansion. By computing the gradients of the task-specific metric with respect to the activations (obtained via a backward pass), the method approximates the causal contributions of edges without requiring a separate forward pass for each edge. This makes patching applicable to industrial-scale models.
The Polysemanticity Problem
The mentioned methods allow us to connect causal components into circuits; however, the identified relations are binary—indicating simply whether component A sends information to B. Consequently, these methods do not provide interpretable, linear edge weights. More importantly, the identified components (often neurons) are not interpretable themselves. While we can identify a neuron's contribution to an output, this does not mean that is the neuron's sole function. Neurons are often shown to fire for many unrelated input features. These are known as polysemantic neurons, which are explained in Section 2.
Scope of This Survey
Sparse Autoencoders (SAEs) and Transcoders were developed to obtain interpretable causal units of analysis in LLMs, known as features, because neurons are non-interpretable units. This survey paper focuses on circuit discovery methods that use these interpretable features as their causal units.