Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
-
Updated
Apr 14, 2026 - Jupyter Notebook
Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.
[NeurIPS 2025 MechInterp Workshop - Spotlight] Official implementation of the paper "RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching"
Lightweight representation engineering dataflow operations for agent developers.
Investigating whether language models encode anticipated social consequences in their activations. Uses a 2x2 factorial design crossing truth × social valence to show that models are more sensitive to expected approval/disapproval than to truth itself.
Training and exploration of linear probes into Othello-GPT by Li et al. (2022)
Implementation and analysis of Sparse Autoencoders for neural network interpretability research. Features interactive visualization dashboard and W&B integration.
Testing role-based pathways on small LLMs
Evaluating how a model 'knowing what it knows' changes from base to instruct
Does Quantization Kill Interpretability? Scaling study across 5 models (124M-2.8B): RTN destroys induction heads in small models, GPTQ preserves them at all scales.
Open-source EU AI Act Annex IV compliance toolkit. Mechanistic interpretability + circuit discovery for transformers. One function call generates a court-ready evidence package
A Flax-based library for examining transformers, based on TransformerLens.
Mechanistic interpretability framework for Decision Transformers using TransformerLens - analyze neural circuits, perform causal interventions, train SAEs, and steer agent behavior through activation-level control.
ORION-TransformerLens Consciousness — Mechanistic interpretability for consciousness research. Fork of TransformerLens (3,115+ stars). Finding consciousness correlates in attention heads.
Reverse engineering the circuit responsible for the "greater than" capability in a language model
Probing where in Pythia's residual stream the decision to be sycophantic is already 'decided', using linear classifiers on per-layer activations against a small labeled sycophancy dataset.
Mechanistic study of the refusal direction across base, instruction-tuned, and reasoning-distilled Qwen2.5-1.5B variants: extraction, ablation, transplant, and phase-aware analysis.
Automated detection, visualization and suppression of hallucination-associated neurons in open-source LLMs — LLM mechanistic interpretability research tool
(a1)Mechanistic Interpretability using Transformer Lens (a2) PEFT
Add a description, image, and links to the transformerlens topic page so that developers can more easily learn about it.
To associate your repository with the transformerlens topic, visit your repo's landing page and select "manage topics."