Structure-Amplified Complex Graph Network for TCR-Epitope Binding Prediction
G2VTCR models TCR-epitope interactions as atomic-level complex graphs. Instead of embedding TCR and epitope as separate sequences, G2VTCR constructs a unified graph where inter-molecular contacts are predicted by a contact predictor pre-trained on crystal structures. A Graph Attention Network (GATv2) then classifies binding from this complex graph.
- Atomic-level complex modeling — TCR CDR3β and epitope are represented as atom-level graphs (via RDKit), then combined into a single graph with predicted inter-molecular edges
- Structure-to-sequence transfer — a contact predictor is pre-trained on ~200 TCR-pMHC crystal structures (from TCR3d), then used to predict contacts for any sequence pair
- Interaction-specific embeddings — the same TCR gets different graph representations depending on which epitope it is paired with
Phase 1: Structural Pre-training
TCR-pMHC crystal structures (TCR3d, 209 complexes)
→ Extract atom-level contact maps (< 5 Å)
→ Train ContactPredictor (cross-attention) to predict contacts from atom features
Phase 2: Complex Graph Construction
CDR3β sequence + Epitope sequence
→ RDKit atomic graphs (atoms=nodes, bonds=edges)
→ ContactPredictor predicts inter-molecular edges
→ Merge into single heterogeneous complex graph
(node types: TCR atom / epitope atom)
(edge types: intra-TCR / intra-epitope / inter-molecular)
Phase 3: Binding Classification
Complex graph → 4-layer GATv2 (with residual + LayerNorm)
→ Separate TCR/epitope readouts (mean + max pooling)
→ Cross-attention interaction features
→ MLP classifier → binding probability
Evaluated on three published benchmarks with baselines run on the same splits:
| Method | AUROC | AUPRC |
|---|---|---|
| Random | 0.488 | 0.978 |
| SVM (BLOSUM + physicochemical) | 0.751 | 0.992 |
| MLP (BLOSUM + physicochemical) | 0.798 | 0.989 |
| CNN (NetTCR-2.0 style) | 0.861 | 0.992 |
| G2VTCR | 0.855 | 0.994 |
| Method | AUROC |
|---|---|
| SVM | 0.515 |
| CNN (NetTCR-2.0 style) | 0.545 |
| MLP | 0.553 |
| G2VTCR | 0.577 |
| Published: TITAN | 0.57 |
| Published: TULIP | 0.59 |
| Method | AUROC | AUPRC |
|---|---|---|
| All methods | 0.48–0.55 | 0.17–0.24 |
| G2VTCR | 0.529 | 0.237 (best AUPRC) |
| Published 1st place | 0.86 | — |
pip install torch torch-geometric rdkit-pypi scikit-learn pandas numpy
pip install -e .# All 3 benchmarks (TChard + IMMREP23 + ePytope-TCR)
python run_train.py
# Single benchmark
python run_train.py --benchmarks tchard
python run_train.py --benchmarks epytope
# Quick smoke test (5 epochs, few structures)
python run_train.py --quick --benchmarks tchard
# Skip structural pre-training (use saved checkpoint)
python run_train.py --skip-pretrainbash run_hpc.shfrom g2vtcr.models import AtomGraphBuilder, ComplexGraphBuilder, ContactPredictor, ComplexGNN
# Build atomic graphs
builder = ComplexGraphBuilder()
graph = builder.build("CASSLAPGATNEKLFF", "GILGFVFTL", contact_predictor=model)
# Or use the full training pipeline
from g2vtcr.train import run_pipeline
results = run_pipeline(benchmarks=('tchard',), device='cuda')g2vtcr/
├── models/
│ ├── complex_gnn.py # GATv2-based binding classifier
│ ├── contact_predictor.py # Cross-attention contact prediction
│ └── graph_builder.py # Atom-level graph construction (RDKit)
├── train.py # Training pipeline (Phase 1 + 2 + evaluation)
├── baselines.py # SVM, MLP, CNN (NetTCR-2.0), Random baselines
├── data_processing.py # Legacy v1 data utilities
├── embedding.py # Legacy v1 WL + Doc2Vec embedding
├── clustering.py # Legacy v1 DBSCAN clustering
├── classification.py # Legacy v1 RandomForest classification
└── pattern_finding.py # Legacy v1 TF-IDF motif extraction
run_train.py # CLI entry point
run_hpc.sh # HPC launch script
results/ # Benchmark results and reports
Training data is not included in this repository due to size. The pipeline automatically downloads benchmark datasets on first run:
- TCR3d: ~350 TCR-pMHC crystal structures for contact predictor pre-training
- TChard: 5 epitope-hard splits (~228K train / ~40K test each)
- IMMREP23: IMMREP 2023 challenge dataset (22.6K train / 3.5K test)
- ePytope-TCR: Mutation sensitivity benchmark (1.3K train / 4.2K test)
GNU General Public License v3.0
Shen Lab, Columbia University Zicheng Wang — zw2595@cumc.columbia.edu GitHub Issues
