Feature Circuits in LLMs

Emre Uğur
This blogpost surveys state-of-the-art mechanistic interpretability research focused on discovering insightful circuitry within Large Language Models (LLMs). Traditionally, such work relied on neurons as the fundamental units of analysis; however, neurons are often polysemantic and difficult to interpret. To address this, recent research leverages Transcoders and Sparse Autoencoders (SAEs) to decompose activations into interpretable features. These features are then connected using analytical attribution methods or activation patching to recover interpretable circuits—computational subgraphs responsible for specific LLM behaviors.

Contents

Key Works Covered