Skip to content

princello/G2VTCR

Repository files navigation

G2VTCR

Structure-Amplified Complex Graph Network for TCR-Epitope Binding Prediction

G2VTCR Workflow

Overview

G2VTCR models TCR-epitope interactions as atomic-level complex graphs. Instead of embedding TCR and epitope as separate sequences, G2VTCR constructs a unified graph where inter-molecular contacts are predicted by a contact predictor pre-trained on crystal structures. A Graph Attention Network (GATv2) then classifies binding from this complex graph.

Key ideas

  1. Atomic-level complex modeling — TCR CDR3β and epitope are represented as atom-level graphs (via RDKit), then combined into a single graph with predicted inter-molecular edges
  2. Structure-to-sequence transfer — a contact predictor is pre-trained on ~200 TCR-pMHC crystal structures (from TCR3d), then used to predict contacts for any sequence pair
  3. Interaction-specific embeddings — the same TCR gets different graph representations depending on which epitope it is paired with

Architecture

Phase 1: Structural Pre-training
  TCR-pMHC crystal structures (TCR3d, 209 complexes)
    → Extract atom-level contact maps (< 5 Å)
    → Train ContactPredictor (cross-attention) to predict contacts from atom features

Phase 2: Complex Graph Construction
  CDR3β sequence + Epitope sequence
    → RDKit atomic graphs (atoms=nodes, bonds=edges)
    → ContactPredictor predicts inter-molecular edges
    → Merge into single heterogeneous complex graph
       (node types: TCR atom / epitope atom)
       (edge types: intra-TCR / intra-epitope / inter-molecular)

Phase 3: Binding Classification
  Complex graph → 4-layer GATv2 (with residual + LayerNorm)
    → Separate TCR/epitope readouts (mean + max pooling)
    → Cross-attention interaction features
    → MLP classifier → binding probability

Benchmark Results

Evaluated on three published benchmarks with baselines run on the same splits:

TChard (5-fold, epitope-hard splits)

Method AUROC AUPRC
Random 0.488 0.978
SVM (BLOSUM + physicochemical) 0.751 0.992
MLP (BLOSUM + physicochemical) 0.798 0.989
CNN (NetTCR-2.0 style) 0.861 0.992
G2VTCR 0.855 0.994

ePytope-TCR (mutation sensitivity)

Method AUROC
SVM 0.515
CNN (NetTCR-2.0 style) 0.545
MLP 0.553
G2VTCR 0.577
Published: TITAN 0.57
Published: TULIP 0.59

IMMREP23 (unseen peptides)

Method AUROC AUPRC
All methods 0.48–0.55 0.17–0.24
G2VTCR 0.529 0.237 (best AUPRC)
Published 1st place 0.86

Installation

pip install torch torch-geometric rdkit-pypi scikit-learn pandas numpy
pip install -e .

Usage

Run the full benchmark pipeline

# All 3 benchmarks (TChard + IMMREP23 + ePytope-TCR)
python run_train.py

# Single benchmark
python run_train.py --benchmarks tchard
python run_train.py --benchmarks epytope

# Quick smoke test (5 epochs, few structures)
python run_train.py --quick --benchmarks tchard

# Skip structural pre-training (use saved checkpoint)
python run_train.py --skip-pretrain

Run on HPC with GPU

bash run_hpc.sh

Use as a library

from g2vtcr.models import AtomGraphBuilder, ComplexGraphBuilder, ContactPredictor, ComplexGNN

# Build atomic graphs
builder = ComplexGraphBuilder()
graph = builder.build("CASSLAPGATNEKLFF", "GILGFVFTL", contact_predictor=model)

# Or use the full training pipeline
from g2vtcr.train import run_pipeline
results = run_pipeline(benchmarks=('tchard',), device='cuda')

Project Structure

g2vtcr/
├── models/
│   ├── complex_gnn.py        # GATv2-based binding classifier
│   ├── contact_predictor.py   # Cross-attention contact prediction
│   └── graph_builder.py       # Atom-level graph construction (RDKit)
├── train.py                   # Training pipeline (Phase 1 + 2 + evaluation)
├── baselines.py               # SVM, MLP, CNN (NetTCR-2.0), Random baselines
├── data_processing.py         # Legacy v1 data utilities
├── embedding.py               # Legacy v1 WL + Doc2Vec embedding
├── clustering.py              # Legacy v1 DBSCAN clustering
├── classification.py          # Legacy v1 RandomForest classification
└── pattern_finding.py         # Legacy v1 TF-IDF motif extraction
run_train.py                   # CLI entry point
run_hpc.sh                     # HPC launch script
results/                       # Benchmark results and reports

Data

Training data is not included in this repository due to size. The pipeline automatically downloads benchmark datasets on first run:

  • TCR3d: ~350 TCR-pMHC crystal structures for contact predictor pre-training
  • TChard: 5 epitope-hard splits (~228K train / ~40K test each)
  • IMMREP23: IMMREP 2023 challenge dataset (22.6K train / 3.5K test)
  • ePytope-TCR: Mutation sensitivity benchmark (1.3K train / 4.2K test)

License

GNU General Public License v3.0

Contact

Shen Lab, Columbia University Zicheng Wang — zw2595@cumc.columbia.edu GitHub Issues

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors