Skip to content

YangLabHKUST/STAT-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

STAT — Spatial Transcriptomics Analytical agenT

Ask in natural language, get a planned, verified, and executed analysis of spatial omics data.

PyPI version bioRxiv HuggingFace Spaces

Table of contents

Installation

Stable release from PyPI:

pip install stat-agent

With the full set of analysis skill dependencies (squidpy, scvi-tools, torch, liana, cell2location, …):

pip install "stat-agent[skills]"

Some skills require packages that aren't on PyPI or have conflict with other pacakges; install separately as needed:

# STAGATE (requires PyG + matching torch_geometric/torch_sparse/torch_scatter wheels)
pip install git+https://github.com/QIFEIDKN/STAGATE_pyG.git

Quick start

Web interface

stat-web                    # serves on http://localhost:8889
# or
./start_web.sh              # also starts a Jupyter Lab alongside

In the UI:

  1. Enter the path to your dataset directory.
  2. Configure your LLM provider and paste an API key.
  3. Click Load Dataset.
  4. Ask questions in the chat panel:
    • "Annotate cell types using the breast-cancer reference."
    • "Find spatially variable genes."
    • "Show CD8A expression in slice 1."
    • "Run RCTD deconvolution and overlay the dominant cell type."

Data format

STAT auto-detects your data layout. Place files in a single directory.

Single-slice

dataset/
├── tissue.h5ad          # Required: AnnData with x, y in obs
└── he.tif               # Optional: H&E image (pixel coords = cell coords)

Multi-slice

dataset/
├── tissue_slice_0.h5ad
├── he_slice_0.tif
├── tissue_slice_1.h5ad
└── he_slice_1.tif

Multi-omics (gene + protein)

dataset/
├── tissue.h5ad          # Gene expression
├── tissue_protein.h5ad  # Protein expression
├── he.tif
└── protein_CD3.tif

Coordinate convention. Cell coordinates (x, y) in adata.obs map directly to image pixel (x, y). No coordinate transformation. Note the array indexing swap: image array img[y, x] corresponds to cell (x, y).

Required AnnData fields: adata.obs['x'], adata.obs['y'], and the expression matrix adata.X. adata.obs['celltype'] is optional — annotation skills will populate it.

Built-in skills

Skills are auto-discovered from stat_agent/skills/{slug}/SKILL.md. Each skill carries metadata (modalities, data level, prerequisites) and a templated code body. The current catalog:

Cell type annotation

Skill Summary
Cell Type Annotation with scANVI Annotate cell types in spatial transcriptomics data using scANVI transfer learning from a reference scRNA-seq dataset.
Fast Cell Type Annotation (Clustering + LLM) Annotate cell types using unsupervised clustering, marker genes, and LLM-based annotation.
Cell Type Annotation via Spatial Mapping (Tangram) Map single-cell reference annotations onto spatial transcriptomics data using Tangram deep learning alignment.

Spot deconvolution

Skill Summary
Cell Type Deconvolution (RCTD) Perform cell type deconvolution (or annotation on spot) on spatial transcriptomics data (Visium spots) using RCTD with a single-cell refere…
Bayesian Cell Type Deconvolution (Cell2location) Reference-based Bayesian deconvolution of spot-level spatial transcriptomics using Cell2location.
Fast Spot Deconvolution (FlashDeconv) Ultra-fast reference-based cell type deconvolution for spot-level spatial data using FlashDeconv.

Spatial domains

Skill Summary
Spatial Domain Detection (SpaGCN) Identify spatial domains in spot-level spatial transcriptomics data using SpaGCN, integrating gene expression, spatial location, and H&E hi…
Spatial Domain Detection (STAGATE) Identify spatial domains using STAGATE (Spatial-Transcriptomics Graph Attention Auto-Encoder).
Spatial Domain Detection (GraphST) Identify spatial domains in spot-level data using GraphST (Graph Self-supervised Transformer).

Spatial statistics & niches

Skill Summary
Spatial Statistics Analysis Compute spatial statistics including Moran's I (spatial autocorrelation of genes), Ripley's K (spatial point pattern of cell types), co-occ…
Neighborhood Enrichment Analysis Compute neighborhood enrichment z-scores to identify which cell types are spatially co-localized or depleted from each other's neighborhood…
Spatial Niche Detection Identify spatial cellular niches using Harmonics hierarchical model.
Spatially Variable Genes (SpatialDE) Identify spatially variable genes using SpatialDE Gaussian process regression.

Differential expression & pathway

Skill Summary
Differential Gene Expression Analysis Find differentially expressed marker genes between groups using scanpy rank_genes_groups with Wilcoxon test.
GO Enrichment Analysis Find enriched Gene Ontology (GO) terms for a user-provided gene list.
Over-Representation & Pathway Enrichment Analysis (ORA) Test whether a gene list is enriched for specific pathways or gene sets using Over-Representation Analysis (Fisher's exact test).
Per-Cell Pathway Activity Scoring (ssGSEA) Compute per-cell pathway activity scores using single-sample Gene Set Enrichment Analysis (ssGSEA).
Two-Group Pathway Enrichment Comparison Compare pathway / gene-set enrichment between two user-provided gene lists (typically markers of two cell populations, clusters, or conditi…

Cell-cell communication

Skill Summary
Cell-Cell Communication Analysis (LIANA+) Analyze cell-cell communication using LIANA+ to identify significant ligand-receptor interactions between cell types.
Cell-Cell Communication Analysis (CellPhoneDB) Analyze cell-cell communication using CellPhoneDB statistical method to identify significant ligand-receptor interactions between cell type…

Multi-slice integration

Skill Summary
Batch Integration (Harmony) Integrate multiple spatial transcriptomics slices using Harmony batch correction.
Batch Integration (BBKNN) Correct batch effects across multiple slices using BBKNN (Batch Balanced K-Nearest Neighbors).
Batch Integration (Scanorama) Correct batch effects across multiple slices using Scanorama panoramic stitching.

Slice alignment & registration

Skill Summary
Spatial Alignment (STalign) Align two cell-level spatial transcriptomics slices using STalign.
Slice Registration (PASTE) Align multiple spatial transcriptomics slices using PASTE (Probabilistic Alignment of ST Experiments).

CNV inference & trajectory

Skill Summary
Expression-based CNV Inference (infercnvpy) Infer copy number variations (CNVs) from gene expression data using infercnvpy.
Pseudotime Trajectory Analysis (Palantir / DPT) Infer cell developmental trajectories and pseudotime ordering using expression-based methods.

Adding a new skill. Create stat_agent/skills/<your-slug>/SKILL.md with YAML frontmatter (name, title, description, filter_requirements, prerequisites, optional default_skill), then write the analysis instructions and code template in the body. The registry will pick it up at startup.

LLM providers

STAT supports five providers via a unified LLMBackend. In the web UI's Configure LLM panel, pick a Provider from the dropdown, then type the bare Model ID as it appears at that provider's API — no prefix needed. (Older saved configs that include a prefix like anthropic/… still work for backward compatibility.)

For programmatic use, export the corresponding environment variable before launching stat-web. Every model ID below has been verified end-to-end against the live provider API.

Provider Where to get a key Env var Default model Other verified IDs
OpenAI https://platform.openai.com/api-keys OPENAI_API_KEY gpt-5.4 gpt-5.5, gpt-4o
Anthropic https://console.anthropic.com/settings/keys ANTHROPIC_API_KEY claude-opus-4-7 claude-opus-4-6, claude-sonnet-4-6
Google Gemini https://aistudio.google.com/app/apikey GOOGLE_API_KEY gemini-3.1-pro-preview gemini-2.5-pro
DeepSeek https://platform.deepseek.com/api_keys DEEPSEEK_API_KEY deepseek-v4-pro deepseek-v4-flash
Poe (multi-model gateway) https://poe.com/api_key POE_API_KEY claude-sonnet-4.5 claude-opus-4.7, gpt-5.5, gemini-3.1-pro, deepseek-v4-pro-el

Poe caveat. claude-opus-4.6 and claude-sonnet-4.6 on Poe force extended-thinking on the bot side and are not yet supported through STAT — use claude-opus-4.7 instead, or switch to the direct Anthropic provider.

Tip. For long-context analysis (multi-slice integration, large reference profiles), prefer models with 200 k+ context: claude-opus-4-7, claude-opus-4-6, gpt-5.5, gemini-3.1-pro-preview.

Verify before a long run. Use the Test Connection button in the Configure LLM panel — it sends a one-token round-trip through the same LLMBackend code path as the agent and reports the exact error if anything is off.

Reproducing the paper

The analyses, figures, and benchmarks from the STAT paper live in a separate repository: https://github.com/chenyhvvvv/STAT-PaperRepro

License

BSD-3-Clause © STAT contributors.

About

STAT: Spatial Transcriptomics Analytical agenT - AI-powered platform for spatial omics analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors