1. [1]
    Ameisen, E., Lindsey, J., Pearce, A., Gurnee, W., Turner, N. L., Chen, B., et al.
    Circuit Tracing: Revealing Computational Graphs in Language Models
    Transformer Circuits Thread, 2025
  2. [2]
    Kissane, C., Krzyzanowski, R., Conmy, A., & Nanda, N.
    Attention Output SAEs Improve Circuit Analysis
    Alignment Forum, 2024
  3. [3]
    Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., & Garriga-Alonso, A.
    Towards Automated Circuit Discovery for Mechanistic Interpretability
    Advances in Neural Information Processing Systems (NeurIPS), 2023
  4. [4]
    Cunningham, H., Ewart, A., Riggs, L., Huben, R., & Sharkey, L.
    Sparse Autoencoders Find Highly Interpretable Features in Language Models
    ICLR, 2024
  5. [5]
    Dunefsky, J., Chlenski, P., & Nanda, N.
    Transcoders Find Interpretable LLM Feature Circuits
    Advances in Neural Information Processing Systems (NeurIPS), 2024
  6. [6]
    Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., et al.
    A Mathematical Framework for Transformer Circuits
    Transformer Circuits Thread, 2021
  7. [7]
    Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., et al.
    Toy Models of Superposition
    Transformer Circuits Thread, 2022
  8. [8]
    Ge, X., Zhu, F., Shu, W., Wang, J., He, Z., & Qiu, X.
    Automatically Identifying Local and Global Circuits with Linear Computation Graphs
    arXiv:2405.13868, 2024
  9. [9]
    Goldowsky-Dill, N., MacLeod, C., Sato, L., & Arora, A.
    Localizing Model Behavior with Path Patching
    arXiv:2304.05969, 2023
  10. [10]
    Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., et al.
    Alignment Faking in Large Language Models
    arXiv:2412.14093, 2024
  11. [11]
    Gurnee, W., Ameisen, E., Kauvar, I., Tarng, J., Pearce, A., Olah, C., & Batson, J.
    When Models Manipulate Manifolds: The Geometry of a Counting Task
    Transformer Circuits Thread, 2025
  12. [12]
  13. [13]
    Ivanitskiy, M. I., Spies, A. F., Räuker, T., Corlouer, G., Mathwin, C., et al.
    Structured World Representations in Maze-Solving Transformers
    arXiv:2312.02566, 2023
  14. [14]
    Janus.
    Simulators
    LessWrong, 2022
  15. [15]
    Kamath, H., Ameisen, E., Kauvar, I., Luger, R., Gurnee, W., Pearce, A., et al.
    Tracing Attention Computation Through Feature Interactions
    Transformer Circuits Thread, 2025
  16. [16]
    Lindsey, J., Gurnee, W., Ameisen, E., Chen, B., Pearce, A., Turner, N. L., et al.
    On the Biology of a Large Language Model
    Transformer Circuits Thread, 2025
  17. [17]
    Lynch, A., Wright, B., Larson, C., Troy, K. K., Ritchie, S. J., et al.
    Agentic Misalignment: How LLMs Could be an Insider Threat
    Anthropic Research, 2025
  18. [18]
    Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., & Mueller, A.
    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
    arXiv:2403.19647, 2024
  19. [19]
    Meng, K., Bau, D., Andonian, A., & Belinkov, Y.
    Locating and Editing Factual Associations in GPT
    Advances in Neural Information Processing Systems (NeurIPS), 2022
  20. [20]
  21. [21]
    Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., et al.
    In-context Learning and Induction Heads
    Transformer Circuits Thread, 2022
  22. [22]
    Syed, A., Rager, C., & Conmy, A.
    Attribution Patching Outperforms Automated Circuit Discovery
    NeurIPS Workshop on Attributing Model Behavior at Scale, 2023
  23. [23]
    Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., et al.
    Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
    Transformer Circuits Thread, 2024
  24. [24]
    Turpin, M., Michael, J., Perez, E., & Bowman, S. R.
    Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
    Advances in Neural Information Processing Systems (NeurIPS), 2023
← Previous Discussion