Bibliography
References
All citations link directly to the original research. Works on the Transformer Circuits Thread are linked to the canonical thread page.
-
[1]
Ameisen, E., Lindsey, J., Pearce, A., Gurnee, W., Turner, N. L., Chen, B., et al.
Circuit Tracing: Revealing Computational Graphs in Language ModelsTransformer Circuits Thread, 2025 -
[2]
Kissane, C., Krzyzanowski, R., Conmy, A., & Nanda, N.
Attention Output SAEs Improve Circuit AnalysisAlignment Forum, 2024 -
[3]
Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., & Garriga-Alonso, A.
Towards Automated Circuit Discovery for Mechanistic InterpretabilityAdvances in Neural Information Processing Systems (NeurIPS), 2023 -
[4]
Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L.
Sparse Autoencoders Find Highly Interpretable Features in Language ModelsICLR, 2024 -
[5]
Dunefsky, J., Chlenski, P., & Nanda, N.
Transcoders Find Interpretable LLM Feature CircuitsAdvances in Neural Information Processing Systems (NeurIPS), 2024 -
[6]
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., et al.
A Mathematical Framework for Transformer CircuitsTransformer Circuits Thread, 2021 -
[7]
Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., et al.
Toy Models of SuperpositionTransformer Circuits Thread, 2022 -
[8]
Ge, X., Zhu, F., Shu, W., Wang, J., He, Z., & Qiu, X.
Automatically Identifying Local and Global Circuits with Linear Computation GraphsarXiv:2405.13868, 2024 -
[9]
Goldowsky-Dill, N., MacLeod, C., Sato, L., & Arora, A.
Localizing Model Behavior with Path PatchingarXiv:2304.05969, 2023 -
[10]
Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., et al.
Alignment Faking in Large Language ModelsarXiv:2412.14093, 2024 -
[11]
Gurnee, W., Ameisen, E., Kauvar, I., Tarng, J., Pearce, A., Olah, C., & Batson, J.
When Models Manipulate Manifolds: The Geometry of a Counting TaskTransformer Circuits Thread, 2025 -
[12]
He, Z., Ge, X., Tang, Q., Sun, T., Cheng, Q., & Qiu, X.
Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPTarXiv:2402.12201, 2024 -
[13]
Ivanitskiy, M. I., Spies, A. F., Räuker, T., Corlouer, G., Mathwin, C., et al.
Structured World Representations in Maze-Solving TransformersarXiv:2312.02566, 2023 - [14]
-
[15]
Kamath, H., Ameisen, E., Kauvar, I., Luger, R., Gurnee, W., Pearce, A., et al.
Tracing Attention Computation Through Feature InteractionsTransformer Circuits Thread, 2025 -
[16]
Lindsey, J., Gurnee, W., Ameisen, E., Chen, B., Pearce, A., Turner, N. L., et al.
On the Biology of a Large Language ModelTransformer Circuits Thread, 2025 -
[17]
Lynch, A., Wright, B., Larson, C., Troy, K. K., Ritchie, S. J., et al.
Agentic Misalignment: How LLMs Could be an Insider ThreatAnthropic Research, 2025 -
[18]
Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., & Mueller, A.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language ModelsarXiv:2403.19647, 2024 -
[19]
Meng, K., Bau, D., Andonian, A., & Belinkov, Y.
Locating and Editing Factual Associations in GPTAdvances in Neural Information Processing Systems (NeurIPS), 2022 - [20]
-
[21]
Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., et al.
In-context Learning and Induction HeadsTransformer Circuits Thread, 2022 -
[22]
Syed, A., Rager, C., & Conmy, A.
Attribution Patching Outperforms Automated Circuit DiscoveryNeurIPS Workshop on Attributing Model Behavior at Scale, 2023 -
[23]
Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., et al.
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 SonnetTransformer Circuits Thread, 2024 -
[24]
Turpin, M., Michael, J., Perez, E., & Bowman, S. R.
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought PromptingAdvances in Neural Information Processing Systems (NeurIPS), 2023