G2VTCR

Structure-Amplified Complex Graph Network for TCR-Epitope Binding Prediction

Overview

G2VTCR models TCR-epitope interactions as atomic-level complex graphs. Instead of embedding TCR and epitope as separate sequences, G2VTCR constructs a unified graph where inter-molecular contacts are predicted by a contact predictor pre-trained on crystal structures. A Graph Attention Network (GATv2) then classifies binding from this complex graph.

Key ideas

Atomic-level complex modeling — TCR CDR3β and epitope are represented as atom-level graphs (via RDKit), then combined into a single graph with predicted inter-molecular edges
Structure-to-sequence transfer — a contact predictor is pre-trained on ~200 TCR-pMHC crystal structures (from TCR3d), then used to predict contacts for any sequence pair
Interaction-specific embeddings — the same TCR gets different graph representations depending on which epitope it is paired with

Architecture

Phase 1: Structural Pre-training
  TCR-pMHC crystal structures (TCR3d, 209 complexes)
    → Extract atom-level contact maps (< 5 Å)
    → Train ContactPredictor (cross-attention) to predict contacts from atom features

Phase 2: Complex Graph Construction
  CDR3β sequence + Epitope sequence
    → RDKit atomic graphs (atoms=nodes, bonds=edges)
    → ContactPredictor predicts inter-molecular edges
    → Merge into single heterogeneous complex graph
       (node types: TCR atom / epitope atom)
       (edge types: intra-TCR / intra-epitope / inter-molecular)

Phase 3: Binding Classification
  Complex graph → 4-layer GATv2 (with residual + LayerNorm)
    → Separate TCR/epitope readouts (mean + max pooling)
    → Cross-attention interaction features
    → MLP classifier → binding probability

Benchmark Results

Evaluated on three published benchmarks with baselines run on the same splits:

TChard (5-fold, epitope-hard splits)

Method	AUROC	AUPRC
Random	0.488	0.978
SVM (BLOSUM + physicochemical)	0.751	0.992
MLP (BLOSUM + physicochemical)	0.798	0.989
CNN (NetTCR-2.0 style)	0.861	0.992
G2VTCR	0.855	0.994

ePytope-TCR (mutation sensitivity)

Method	AUROC
SVM	0.515
CNN (NetTCR-2.0 style)	0.545
MLP	0.553
G2VTCR	0.577
Published: TITAN	0.57
Published: TULIP	0.59

IMMREP23 (unseen peptides)

Method	AUROC	AUPRC
All methods	0.48–0.55	0.17–0.24
G2VTCR	0.529	0.237 (best AUPRC)
Published 1st place	0.86	—

Installation

pip install torch torch-geometric rdkit-pypi scikit-learn pandas numpy
pip install -e .

Usage

Run the full benchmark pipeline

# All 3 benchmarks (TChard + IMMREP23 + ePytope-TCR)
python run_train.py

# Single benchmark
python run_train.py --benchmarks tchard
python run_train.py --benchmarks epytope

# Quick smoke test (5 epochs, few structures)
python run_train.py --quick --benchmarks tchard

# Skip structural pre-training (use saved checkpoint)
python run_train.py --skip-pretrain

Run on HPC with GPU

bash run_hpc.sh

Use as a library

from g2vtcr.models import AtomGraphBuilder, ComplexGraphBuilder, ContactPredictor, ComplexGNN

# Build atomic graphs
builder = ComplexGraphBuilder()
graph = builder.build("CASSLAPGATNEKLFF", "GILGFVFTL", contact_predictor=model)

# Or use the full training pipeline
from g2vtcr.train import run_pipeline
results = run_pipeline(benchmarks=('tchard',), device='cuda')

Project Structure

g2vtcr/
├── models/
│   ├── complex_gnn.py        # GATv2-based binding classifier
│   ├── contact_predictor.py   # Cross-attention contact prediction
│   └── graph_builder.py       # Atom-level graph construction (RDKit)
├── train.py                   # Training pipeline (Phase 1 + 2 + evaluation)
├── baselines.py               # SVM, MLP, CNN (NetTCR-2.0), Random baselines
├── data_processing.py         # Legacy v1 data utilities
├── embedding.py               # Legacy v1 WL + Doc2Vec embedding
├── clustering.py              # Legacy v1 DBSCAN clustering
├── classification.py          # Legacy v1 RandomForest classification
└── pattern_finding.py         # Legacy v1 TF-IDF motif extraction
run_train.py                   # CLI entry point
run_hpc.sh                     # HPC launch script
results/                       # Benchmark results and reports

Data

Training data is not included in this repository due to size. The pipeline automatically downloads benchmark datasets on first run:

TCR3d: ~350 TCR-pMHC crystal structures for contact predictor pre-training
TChard: 5 epitope-hard splits (~228K train / ~40K test each)
IMMREP23: IMMREP 2023 challenge dataset (22.6K train / 3.5K test)
ePytope-TCR: Mutation sensitivity benchmark (1.3K train / 4.2K test)

License

GNU General Public License v3.0

Contact

Shen Lab, Columbia University Zicheng Wang — zw2595@cumc.columbia.edu GitHub Issues

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
analysis		analysis
checkpoints		checkpoints
data		data
examples		examples
g2vtcr		g2vtcr
images		images
paper		paper
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
design.md		design.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run_experiments.sh		run_experiments.sh
run_hpc.sh		run_hpc.sh
run_train.py		run_train.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

G2VTCR

Overview

Key ideas

Architecture

Benchmark Results

TChard (5-fold, epitope-hard splits)

ePytope-TCR (mutation sensitivity)

IMMREP23 (unseen peptides)

Installation

Usage

Run the full benchmark pipeline

Run on HPC with GPU

Use as a library

Project Structure

Data

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

G2VTCR

Overview

Key ideas

Architecture

Benchmark Results

TChard (5-fold, epitope-hard splits)

ePytope-TCR (mutation sensitivity)

IMMREP23 (unseen peptides)

Installation

Usage

Run the full benchmark pipeline

Run on HPC with GPU

Use as a library

Project Structure

Data

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages