References — Feature Circuits in LLMs

[1]
Ameisen, E., Lindsey, J., Pearce, A., Gurnee, W., Turner, N. L., Chen, B., et al.
Circuit Tracing: Revealing Computational Graphs in Language Models
Transformer Circuits Thread, 2025
[2]
Kissane, C., Krzyzanowski, R., Conmy, A., & Nanda, N.
Attention Output SAEs Improve Circuit Analysis
Alignment Forum, 2024
[3]
Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., & Garriga-Alonso, A.
Towards Automated Circuit Discovery for Mechanistic Interpretability
Advances in Neural Information Processing Systems (NeurIPS), 2023
[4]
Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L.
Sparse Autoencoders Find Highly Interpretable Features in Language Models
ICLR, 2024
[5]
Dunefsky, J., Chlenski, P., & Nanda, N.
Transcoders Find Interpretable LLM Feature Circuits
Advances in Neural Information Processing Systems (NeurIPS), 2024
[6]
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., et al.
A Mathematical Framework for Transformer Circuits
Transformer Circuits Thread, 2021
[7]
Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., et al.
Toy Models of Superposition
Transformer Circuits Thread, 2022
[8]
Ge, X., Zhu, F., Shu, W., Wang, J., He, Z., & Qiu, X.
Automatically Identifying Local and Global Circuits with Linear Computation Graphs
arXiv:2405.13868, 2024
[9]
Goldowsky-Dill, N., MacLeod, C., Sato, L., & Arora, A.
Localizing Model Behavior with Path Patching
arXiv:2304.05969, 2023
[10]
Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., et al.
Alignment Faking in Large Language Models
arXiv:2412.14093, 2024
[11]
Gurnee, W., Ameisen, E., Kauvar, I., Tarng, J., Pearce, A., Olah, C., & Batson, J.
When Models Manipulate Manifolds: The Geometry of a Counting Task
Transformer Circuits Thread, 2025
[12]
He, Z., Ge, X., Tang, Q., Sun, T., Cheng, Q., & Qiu, X.
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT
arXiv:2402.12201, 2024
[13]
Ivanitskiy, M. I., Spies, A. F., Räuker, T., Corlouer, G., Mathwin, C., et al.
Structured World Representations in Maze-Solving Transformers
arXiv:2312.02566, 2023
[14]
Janus.
Simulators
LessWrong, 2022
[15]
Kamath, H., Ameisen, E., Kauvar, I., Luger, R., Gurnee, W., Pearce, A., et al.
Tracing Attention Computation Through Feature Interactions
Transformer Circuits Thread, 2025
[16]
Lindsey, J., Gurnee, W., Ameisen, E., Chen, B., Pearce, A., Turner, N. L., et al.
On the Biology of a Large Language Model
Transformer Circuits Thread, 2025
[17]
Lynch, A., Wright, B., Larson, C., Troy, K. K., Ritchie, S. J., et al.
Agentic Misalignment: How LLMs Could be an Insider Threat
Anthropic Research, 2025
[18]
Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., & Mueller, A.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
arXiv:2403.19647, 2024
[19]
Meng, K., Bau, D., Andonian, A., & Belinkov, Y.
Locating and Editing Factual Associations in GPT
Advances in Neural Information Processing Systems (NeurIPS), 2022
[20]
Nanda, N.
Attribution Patching: Activation Patching At Industrial Scale
Neel Nanda's Blog, 2023
[21]
Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., et al.
In-context Learning and Induction Heads
Transformer Circuits Thread, 2022
[22]
Syed, A., Rager, C., & Conmy, A.
Attribution Patching Outperforms Automated Circuit Discovery
NeurIPS Workshop on Attributing Model Behavior at Scale, 2023
[23]
Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., et al.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Transformer Circuits Thread, 2024
[24]
Turpin, M., Michael, J., Perez, E., & Bowman, S. R.
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
Advances in Neural Information Processing Systems (NeurIPS), 2023

← Previous Discussion ↑ Back to Home