transformerlens

Here are 18 public repositories matching this topic...

yash-srivastava19 / arrakis

Arrakis is a library to conduct, track and visualize mechanistic interpretability experiments.

python pypi transformer garcon interpretability explainable-ai mechanistic-interpretability anthropic transformerlens research-tooling

Updated Apr 14, 2026
Jupyter Notebook

FarnoushRJ / RelP

Star

[NeurIPS 2025 MechInterp Workshop - Spotlight] Official implementation of the paper "RelP: Faithful and Efficient Circuit Discovery in Language Models via Relevance Patching"

language-model circuit-analysis interpretability explainable-ai interpretable-machine-learning explainability llms mechanistic-interpretability transformerlens neurips-2025

Updated Nov 3, 2025
Python

krnel-ai / krnel-graph

Star

Lightweight representation engineering dataflow operations for agent developers.

transformers pytorch dataflow parquet huggingface huggingface-transformers duckdb pylance mechanistic-interpretability lancedb transformerlens representation-engineering pragmatic-interpretability

Updated May 5, 2026
Python

stchakwdev / Pinocchio-Vector-Test

Star

Investigating whether language models encode anticipated social consequences in their activations. Uses a 2x2 factorial design crossing truth × social valence to show that models are more sensitive to expected approval/disapproval than to truth itself.

language-models ai-safety interpretability deception-detection mechanistic-interpretability transformerlens

Updated Dec 18, 2025
Python

zilaeric / othello-gpt-probing

Star

Training and exploration of linear probes into Othello-GPT by Li et al. (2022)

probe othello gpt interpretability explainability transformerlens

Updated Jun 29, 2023
Jupyter Notebook

ashioyajotham / exploring_saes

Star

Implementation and analysis of Sparse Autoencoders for neural network interpretability research. Features interactive visualization dashboard and W&B integration.

sparse-autoencoders interpretability activation-functions neuron-activity wandb transformerlens mech-interp

Updated Nov 21, 2025
Python

mduffster / self-referent-test

Star

Testing role-based pathways on small LLMs

research transformers pytorch ai-safety interpretability attention-mechanisms ai-alignment llm mechanistic-interpretability transformerlens

Updated Dec 11, 2025
Python

mduffster / epistemic_status

Star

Evaluating how a model 'knowing what it knows' changes from base to instruct

pytorch llm mechanistic-interpretability transformerlens

Updated Jan 21, 2026
Python

lciric / does-quantization-kill-interpretability

Star

Does Quantization Kill Interpretability? Scaling study across 5 models (124M-2.8B): RTN destroys induction heads in small models, GPTQ preserves them at all scales.

pythia quantization ai-safety sparse-autoencoder mechanistic-interpretability gptq transformerlens transformer-circuits induction-heads scaling-study

Updated Mar 11, 2026
Python

designer-coderajay / glassbox-mech

Star

Open-source EU AI Act Annex IV compliance toolkit. Mechanistic interpretability + circuit discovery for transformers. One function call generates a court-ready evidence package

Updated May 1, 2026
Python

alexjackson1 / tx

Star

A Flax-based library for examining transformers, based on TransformerLens.

deep-learning transformers flax jax transformerlens

Updated Feb 11, 2024
Python

sadhumitha-s / DT-Circuits

Star

Mechanistic interpretability framework for Decision Transformers using TransformerLens - analyze neural circuits, perform causal interventions, train SAEs, and steer agent behavior through activation-level control.

reinforcement-learning transformers ai-safety interpretable-ai mechanistic-interpretability decision-transformer transformerlens circuit-discovery

Updated May 8, 2026
Python

Alvoradozerouno / TransformerLens

Star

ORION-TransformerLens Consciousness — Mechanistic interpretability for consciousness research. Fork of TransformerLens (3,115+ stars). Finding consciousness correlates in attention heads.

orion iit consciousness mechanistic-interpretability transformerlens

Updated Feb 24, 2026
Python

ashioyajotham / greater-than-circuit

Star

Reverse engineering the circuit responsible for the "greater than" capability in a language model

attention-mechanism ablation-studies mechanistic-interpretability transformerlens activation-patterns gpt-2-small

Updated May 7, 2026
HTML

msmichellesamson / residual-stream-sycophancy

Star

Probing where in Pythia's residual stream the decision to be sycophantic is already 'decided', using linear classifiers on per-layer activations against a small labeled sycophancy dataset.

python scikit-learn pytorch matplotlib transformerlens interpretability-experiments

Updated May 4, 2026
Python

anki079 / refusal-in-reasoning-models

Star

Mechanistic study of the refusal direction across base, instruction-tuned, and reasoning-distilled Qwen2.5-1.5B variants: extraction, ablation, transplant, and phase-aware analysis.

jailbreak language-models ai-safety llm mechanistic-interpretability transformerlens qwen safety-alignment reasoning-language-models reasoning-models deepseek-r1 refusal-ablation refusal-direction

Updated May 8, 2026
Python

78Spinoza / LLMDeHallucinator

Star

Automated detection, visualization and suppression of hallucination-associated neurons in open-source LLMs — LLM mechanistic interpretability research tool

ai-safety pacmap model-editing mechanistic-interpretability transformerlens llm-hallucination llm-alignment h-neurons sparse-probing interpretability-research

Updated Mar 19, 2026

junthbasnet / CS781-LLMs

Star

(a1)Mechanistic Interpretability using Transformer Lens (a2) PEFT

llms transformerlens peft-fine-tuning-llm

Updated Sep 3, 2025
Jupyter Notebook

Improve this page

Add a description, image, and links to the transformerlens topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the transformerlens topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transformerlens

Here are 18 public repositories matching this topic...

yash-srivastava19 / arrakis

FarnoushRJ / RelP

krnel-ai / krnel-graph

stchakwdev / Pinocchio-Vector-Test

zilaeric / othello-gpt-probing

ashioyajotham / exploring_saes

mduffster / self-referent-test

mduffster / epistemic_status

lciric / does-quantization-kill-interpretability

designer-coderajay / glassbox-mech

alexjackson1 / tx

sadhumitha-s / DT-Circuits

Alvoradozerouno / TransformerLens

ashioyajotham / greater-than-circuit

msmichellesamson / residual-stream-sycophancy

anki079 / refusal-in-reasoning-models

78Spinoza / LLMDeHallucinator

junthbasnet / CS781-LLMs

Improve this page

Add this topic to your repo