Feature Circuits in LLMs

This blogpost surveys state-of-the-art mechanistic interpretability research focused on discovering insightful circuitry within Large Language Models (LLMs). Traditionally, such work relied on neurons as the fundamental units of analysis; however, neurons are often polysemantic and difficult to interpret. To address this, recent research leverages Transcoders and Sparse Autoencoders (SAEs) to decompose activations into interpretable features. These features are then connected using analytical attribution methods or activation patching to recover interpretable circuits—computational subgraphs responsible for specific LLM behaviors.

Key Works Covered

Dunefsky et al. (2024) — Transcoders Find Interpretable LLM Feature Circuits
Ameisen et al. (2025) — Circuit Tracing: Revealing Computational Graphs in Language Models
Kamath et al. (2025) — Tracing Attention Computation Through Feature Interactions
Ge et al. (2024) — Automatically Identifying Local and Global Circuits with Linear Computation Graphs
He et al. (2024) — Dictionary Learning Improves Patch-Free Circuit Discovery
Marks et al. (2024) — Sparse Feature Circuits
Lindsey et al. (2025) — On the Biology of a Large Language Model

First Section → Introduction

Feature Circuits in LLMs

Contents

Key Works Covered