Feature Circuits in LLMs
This blogpost surveys state-of-the-art mechanistic interpretability research focused on discovering
insightful circuitry within Large Language Models (LLMs). Traditionally, such work relied on
neurons as the fundamental units of analysis; however, neurons are often polysemantic and
difficult to interpret. To address this, recent research leverages Transcoders and Sparse
Autoencoders (SAEs) to decompose activations into interpretable features. These features are then
connected using analytical attribution methods or activation patching to recover interpretable
circuits—computational subgraphs responsible for specific LLM behaviors.