A comprehensive benchmarking framework for evaluating foundation models on histopathology ROI datasets.
HistoROIBench provides a standardized evaluation pipeline for testing various pretrained models on pathology image ROI classification tasks. The framework supports multiple state-of-the-art pathology image encoder models and offers a complete workflow from feature extraction to multi-task evaluation.
- 🎯 Multi-Model Support: Integrates 20+ state-of-the-art pathology image encoders
- 🔬 Multi-Task Evaluation: Supports 5 different evaluation paradigms (Linear Probe, KNN, Proto, Few-shot, Zero-shot)
- 🚀 Efficient Pipeline: Pre-extract features to avoid redundant computation
- 📊 Unified Metrics: Standardized evaluation metrics output for easy model comparison
- 🛠️ Easy Extension: Modular design for easy addition of new models
The framework supports 20+ pretrained patch encoders, all loadable via the encoder_factory() function. Models are loaded via --model_name argument in feature extraction.
Mainstream Pathology Models
| Model | Model Name | Link |
|---|---|---|
| CONCH (v1) | conch_v1 |
MahmoodLab/CONCH |
| CONCH (v1.5) | conch_v15 |
MahmoodLab/conchv1_5 |
| UNI (v1) | uni_v1 |
MahmoodLab/UNI |
| UNI (v2) | uni_v2 |
MahmoodLab/UNI2-h |
| CTransPath | ctranspath |
MahmoodLab/hest-bench |
| Phikon (v1) | phikon |
owkin/phikon |
| Phikon (v2) | phikon_v2 |
owkin/phikon-v2 |
| Virchow (v1) | virchow |
paige-ai/Virchow |
| Virchow (v2) | virchow2 |
paige-ai/Virchow2 |
| GigaPath | gigapath |
prov-gigapath/prov-gigapath |
| H-Optimus (0) | hoptimus0 |
bioptimus/H-optimus-0 |
| H-Optimus (1) | hoptimus1 |
bioptimus/H-optimus-1 |
| Hibou-L | hibou_l |
histai/hibou-L |
| MUSK | musk |
xiangjx/musk |
Kaiko Series
| Model | Model Name | Link |
|---|---|---|
| Kaiko-ViT-S/8 | kaiko-vits8 |
Kaiko Models |
| Kaiko-ViT-S/16 | kaiko-vits16 |
Kaiko Models |
| Kaiko-ViT-B/8 | kaiko-vitb8 |
Kaiko Models |
| Kaiko-ViT-B/16 | kaiko-vitb16 |
Kaiko Models |
| Kaiko-ViT-L/14 | kaiko-vitl14 |
Kaiko Models |
Lunit Series
| Model | Model Name | Link |
|---|---|---|
| Lunit-ViT-S/8 | lunit-vits8 |
1aurent/lunit_dino |
General Vision Models
| Model | Model Name | Link |
|---|---|---|
| ResNet50 | resnet50 |
timm/resnet50 |
Note: Configure the corresponding model weight paths in
model_utils/model_weights.jsonbefore use. If a model's weight path is empty, the framework will attempt to automatically download from Hugging Face Hub (requires internet connection). Models requiring specific installations will return error messages with additional instructions. Gated models on HuggingFace require access requests.
The framework supports the following 5 evaluation tasks:
Train a linear classifier on frozen feature extractor to evaluate feature quality.
Use Cases:
- Evaluate discriminative power of pretrained features
- Quick model selection
Classification using K-nearest neighbors algorithm without training.
Use Cases:
- Evaluate feature clustering performance
- Non-parametric evaluation
Classification based on class prototypes (centroids).
Use Cases:
- Few-shot learning scenarios
- Class-balanced evaluation
Simulate few-shot learning scenarios to test model generalization capability.
Use Cases:
- Data-scarce scenarios
- N-way K-shot evaluation
Zero-shot classification using text-image alignment capabilities (requires multimodal support).
Use Cases:
- Open-vocabulary classification
- Cross-domain generalization evaluation
Dataset Preparation → Feature Extraction → Multi-Task Evaluation → Result Analysis
Use 00-ROI_Feature_Extract.py to extract image features from datasets.
Dataset Parameters:
--dataset_split_csv # Path to dataset split CSV file (required)
# CSV format should include: image path, label, split(train/test)
--class2id_txt # Path to class-to-ID mapping file (required)
# Format: one class name per line or "id:class_name"
--dataset_name # Dataset name for saving feature filesModel Parameters:
--model_name # Model name (see supported models list)
--resize_size # Image resize size, default: 448Inference Parameters:
--batch_size # Batch size, default: 256
--num_workers # Number of data loading workers, default: 8
--device # Device ID, e.g., 'cuda:0' or 'cpu'Save Path:
--save_dir # Directory path to save featurespython 00-ROI_Feature_Extract.py \
--dataset_split_csv /path/to/dataset.csv \
--class2id_txt /path/to/classes.txt \
--dataset_name CAMEL \
--model_name conch_v1 \
--resize_size 448 \
--batch_size 256 \
--num_workers 8 \
--device cuda:0 \
--save_dir ./ROI_FeaturesOutput Files:
Dataset_[dataset_name]_Model_[model_name]_Size_[size]_train.ptDataset_[dataset_name]_Model_[model_name]_Size_[size]_test.pt
Use 01-ROI_BenchMark_Main.py to run various evaluation tasks.
General Parameters:
--TASK # Task list, comma-separated (required)
# Options: Linear-Probe,KNN,Proto,Few-shot,Zero-shot
--class2id_txt # Path to class mapping file
--train_feature_file # Path to training feature file (required)
--test_feature_file # Path to test feature file (required)
--val_feature_file # Path to validation feature file (optional)
--log_dir # Directory to save results
--log_description # Experiment description
--device # Computing device, default: cuda (if available)Linear Probe Parameters:
--max_iteration # Maximum iterations, default: 1000
--use_sklearn # Use sklearn's logistic regression, default: FalseKNN & Proto Parameters:
--n_neighbors # Number of neighbors for KNN, default: 20Few-shot Parameters:
--n_iter # Number of few-shot episodes, default: 100
--use_all_way # Use all classes, default: True
--n_way # N-way settings, comma-separated, default: '2,3,4,5,6,7,8,9,10'
--n_shot # K-shot settings, comma-separated
# Default: '1,2,4,8,16,32,64,128,256'Zero-shot Parameters:
--zeroshot_model_name # Model name for zero-shot (must support text encoding)
--zeroshot_prompt_file # Path to prompt file
# Format: one complete prompt per line for each class
--zeroshot_batch_size # Batch size, default: 32
--num_workers # Number of data loader workers, default: 41. Run Single Task:
python 01-ROI_BenchMark_Main.py \
--TASK Linear-Probe \
--train_feature_file ./ROI_Features/Dataset_[CAMEL]_Model_[conch_v1]_Size_[448]_train.pt \
--test_feature_file ./ROI_Features/Dataset_[CAMEL]_Model_[conch_v1]_Size_[448]_test.pt \
--class2id_txt /path/to/classes.txt \
--log_dir ./results \
--device cuda:02. Run Multiple Tasks:
python 01-ROI_BenchMark_Main.py \
--TASK Linear-Probe,KNN,Proto,Few-shot \
--train_feature_file ./ROI_Features/train.pt \
--test_feature_file ./ROI_Features/test.pt \
--class2id_txt /path/to/classes.txt \
--log_dir ./results \
--n_neighbors 20 \
--n_way 2,3,4,5 \
--n_shot 1,2,4,8,16 \
--device cuda:03. Run Zero-shot Task:
python 01-ROI_BenchMark_Main.py \
--TASK Zero-shot \
--train_feature_file ./ROI_Features/train.pt \
--test_feature_file ./ROI_Features/test.pt \
--class2id_txt /path/to/classes.txt \
--zeroshot_model_name conch_v1 \
--zeroshot_prompt_file /path/to/prompts.txt \
--log_dir ./results \
--device cuda:04. Run All Tasks:
python 01-ROI_BenchMark_Main.py \
--TASK Linear-Probe,KNN,Proto,Few-shot,Zero-shot \
--train_feature_file ./ROI_Features/train.pt \
--test_feature_file ./ROI_Features/test.pt \
--class2id_txt /path/to/classes.txt \
--zeroshot_model_name conch_v1 \
--zeroshot_prompt_file /path/to/prompts.txt \
--log_dir ./results \
--max_iteration 1000 \
--n_neighbors 20 \
--n_way 2,3,4,5 \
--n_shot 1,2,4,8,16 \
--device cuda:0Evaluation results will be saved in the directory specified by --log_dir, organized by task type:
log_dir/
├── Linear-Probe/
│ ├── Linear-Probe_detailed_results.csv # Per-sample predictions with probabilities
│ └── Linear-Probe_complete_results.json # Complete metrics and confusion matrix
├── KNN/
│ ├── KNN_detailed_results.csv
│ └── KNN_complete_results.json
├── Proto/
│ ├── Proto_detailed_results.csv
│ └── Proto_complete_results.json
├── Few-shot/
│ ├── way_2/
│ │ ├── Fewshot_2way_1shot_detailed_results.csv
│ │ ├── Fewshot_2way_1shot_complete_results.json
│ │ ├── Fewshot_2way_1shot_per_episode_metrics.json
│ │ ├── Fewshot_2way_1shot_few_shot_results.json
│ │ └── ...
│ └── ...
└── Zero-shot/
├── Zero-shot_detailed_results.csv
└── Zero-shot_complete_results.json
File Descriptions:
-
*_detailed_results.csv: Contains per-sample predictions with columns:img_name: Image file name (if available)true_label: True class labelpredicted_label: Predicted class labelprobabilities: Probability distribution over all classes
-
*_complete_results.json: Contains comprehensive evaluation metrics:task_name: Task identifiermetrics: Dictionary with accuracy, balanced_accuracy, precision, recall, f1_score, auroc, etc.confusion_matrix: Confusion matrix as 2D arraynum_samples: Total number of samplesnum_classes: Number of classesadditional_info: Task-specific additional information
-
*_per_episode_metrics.json(Few-shot only): Metrics for each individual episode -
*_few_shot_results.json(Few-shot only): Aggregated few-shot metrics with mean and std across episodes
Each task's metrics file contains:
- Accuracy
- Balanced Accuracy
- Precision, Recall, F1 Score
- Confusion Matrix
- ROC-AUC (if applicable)
- Detailed per-sample prediction probabilities
Use 02-Bootstrap_Statistical_Analysis.py to perform Bootstrap statistical analysis on evaluation results and calculate confidence intervals for metrics.
--results_dir # Path to results directory (default: ./results)
# Should contain experiment subdirectories with task results
--n_bootstrap # Number of Bootstrap samples (default: 1000)
--random_state # Random seed for reproducibility (default: 42)Basic Usage:
python 02-Bootstrap_Statistical_Analysis.py \
--results_dir ./results \
--n_bootstrap 1000 \
--random_state 42Custom Parameters:
python 02-Bootstrap_Statistical_Analysis.py \
--results_dir /path/to/results \
--n_bootstrap 1000 \
--random_state 42The script will:
- Traverse all experiments in the results directory
- Process each task (Linear-Probe, KNN, Proto) that has
*_detailed_results.csvfiles - Calculate Bootstrap confidence intervals (95% CI by default) for all metrics
- Save results to JSON files
Output Structure:
results/
├── bootstrap_ci_summary.json # Summary of all experiments
└── {experiment_name}/
├── Linear-Probe/
│ └── Linear-Probe_bootstrap_ci.json # Bootstrap CI for Linear-Probe
├── KNN/
│ └── KNN_bootstrap_ci.json # Bootstrap CI for KNN
└── Proto/
└── Proto_bootstrap_ci.json # Bootstrap CI for Proto
Bootstrap CI JSON Format:
Each *_bootstrap_ci.json file contains:
{
"n_samples": 216912,
"n_classes": 2,
"n_bootstrap": 1000,
"ci_level": 0.95,
"metrics": {
"acc": {
"value": 0.6991,
"ci_lower": 0.6978,
"ci_upper": 0.7004,
"std": 0.0007,
"n_valid_bootstrap": 1000
},
"macro_auc": {
"value": 0.7331,
"ci_lower": 0.7315,
"ci_upper": 0.7347,
"std": 0.0008,
"n_valid_bootstrap": 1000
},
...
}
}Metrics Included:
acc: Accuracybacc: Balanced Accuracymacro_f1,weighted_f1,micro_f1: F1 Scoresmacro_precision,weighted_precision: Precisionmacro_recall,weighted_recall: Recallmacro_auc,weighted_auc,micro_auc: ROC-AUCmacro_auprc,weighted_auprc,micro_auprc: Average Precisionbrier_score: Brier Scoreece,mce: Calibration Errorsquadratic_kappa,linear_kappa: Cohen's Kappa
Features:
- ✅ Skip existing results: Automatically skips processing if output JSON already exists
- ✅ Batch processing: Processes all experiments and tasks in one run
- ✅ Comprehensive metrics: Calculates CI for all evaluation metrics
- ✅ Reproducible: Uses random seed for consistent results
Notes:
- The script only processes tasks that have
*_detailed_results.csvfiles - If a
*_bootstrap_ci.jsonfile already exists, it will be skipped (useful for incremental processing) - Bootstrap sampling ensures robust statistical inference for model comparison
This framework provides complete example datasets in the example_dataset/ directory, demonstrating the required file formats and structure.
The example dataset is based on CRC-100K dataset and includes the following files:
example_dataset/
├── CRC-100K.csv # Dataset split file
├── CRC-100K.txt # Class mapping file
└── CRC-100K-Zero_Shot_Prompts.txt # Zero-shot prompt file
The dataset split CSV file should contain the following columns:
Train/Validation/Test Separated Format
train_path,train_label,val_path,val_label,test_path,test_label
/path/to/train1.tif,8,,,/path/to/test1.tif,8.0
/path/to/train2.tif,8,,,/path/to/test2.tif,8.0
...In general, no validation set is set for the evaluation of ROI BenchMark Testing.
Example: See
example_dataset/CRC-100K.csv
Description:
image_path/train_path/test_path: Absolute or relative path to image fileslabel/train_label/test_label: Class label (integer, starting from 0)split: Dataset split identifier (train/val/test)- Empty columns are used as placeholders (e.g., when validation set is empty)
class2id_txt file format (both formats supported):
ClassName,ID
class_name_1,0
class_name_2,1
class_name_3,2
Example: See
example_dataset/CRC-100K.txtNORM,0 STR,1 TUM,2 MUS,3 MUC,4 LYM,5 DEB,6 BACK,7 ADI,8
Description:
- Class names can be abbreviations (e.g.,
TUM) or full names (e.g.,Tumor) - IDs must be consecutive integers starting from 0
- Order must correspond to the labels in the dataset CSV
One complete prompt per line corresponding to each class, in the same order as class2id_txt:
A histopathology image showing class_name_1
A histopathology image showing class_name_2
A histopathology image showing class_name_3
Example: See
example_dataset/CRC-100K-Zero_Shot_Prompts.txtThis is a pathology image showing normal tissue characteristics This is a pathology image showing stroma tissue characteristics This is a pathology image showing tumor tissue characteristics This is a pathology image showing muscle tissue characteristics This is a pathology image showing mucosa tissue characteristics This is a pathology image showing lymphocytes tissue characteristics This is a pathology image showing debris tissue characteristics This is a pathology image showing background tissue characteristics This is a pathology image showing adipose tissue characteristics
Description:
- Each line corresponds to one class, order must exactly match the class mapping file
- Recommend using descriptive prompts that include tissue type characteristics
- Empty lines and comment lines starting with
#will be ignored - Prompt quality directly affects zero-shot performance; domain-specific terminology is recommended
Before use, configure model weight paths in model_utils/model_weights.json:
{
"conch_v1": "/path/to/conch_v1/pytorch_model.bin",
"uni_v1": "/path/to/uni_v1/weights.pth",
"phikon": "/path/to/phikon/checkpoint.pth",
...
}If a model's weight path is an empty string "", the framework will attempt to automatically download from Hugging Face Hub (requires internet connection).
If you want to add a new model to the framework, especially for multimodal models that support zero-shot evaluation, you need to implement the run_zero_shot method in your model class.
For multimodal models (e.g., vision-language models), if you want to enable zero-shot evaluation, your model class should implement the run_zero_shot method. This method enables the model to perform classification using text prompts without any training examples.
Method Signature:
def run_zero_shot(self, texts, image_features: torch.Tensor, device: str):
"""
Perform zero-shot classification using text prompts and image features.
Args:
texts: List of text prompts (one per class)
image_features: Pre-extracted image features tensor [N, D]
device: Device string (e.g., 'cuda:0' or 'cpu')
Returns:
probs: Classification probabilities tensor [N, num_classes]
"""You can refer to the Conchv1InferenceEncoder implementation in model_utils/model_factory.py for a complete example:
class Conchv1InferenceEncoder(BasePatchEncoder):
# ... initialization code ...
def _from_text_to_embeddings(self, texts, device: str):
"""Convert text prompts to embeddings."""
from .model_zoo.conch.open_clip_custom import tokenize
tokenized_prompts = tokenize(texts=texts, tokenizer=self.tokenizer)
tokenized_prompts = tokenized_prompts.to(device)
text_features = self.model.encode_text(tokenized_prompts)
return text_features
def run_zero_shot(self, texts, image_features: torch.Tensor, device: str):
"""Perform zero-shot classification."""
from torch.nn import functional as F
image_features = image_features.to(device)
text_features = self._from_text_to_embeddings(texts, device)
logit_scale = self.model.logit_scale.exp()
similarity = torch.matmul(image_features, text_features.T) * logit_scale
probs = F.softmax(similarity, dim=-1)
return probs.detach()Key Implementation Steps:
- Text Encoding: Convert text prompts to embeddings using the model's text encoder
- Feature Alignment: Ensure image features and text features are on the same device
- Similarity Calculation: Compute similarity between image features and text features (often using cosine similarity with a learned temperature scale)
- Probability Conversion: Apply softmax to convert similarities to classification probabilities
Notes:
- The
image_featuresparameter contains pre-extracted features from the feature extraction stage - The
textsparameter is a list of prompts, one for each class in the same order asclass2id_txt - The method should return a tensor of shape
[N, num_classes]where N is the number of images - If your model doesn't support zero-shot (e.g., vision-only models), you don't need to implement this method, and zero-shot evaluation will be skipped
- Choose Appropriate Image Size: Different models have different recommended input sizes (224, 448, etc.), refer to the original model papers
- Adjust Batch Size: Adjust
--batch_sizeaccording to GPU memory; feature extraction can use larger batches - Few-shot Parameter Settings: Ensure
--n_shotdoes not exceed the minimum class sample count - Zero-shot Model Selection: Only multimodal models (e.g., CONCH) can perform zero-shot evaluation
- Multi-task Evaluation: Recommend extracting features first, then running multiple tasks on the same feature files to save time
Here's a complete example workflow:
Step 1: Extract Features
# Extract features using CONCH v1
python 00-ROI_Feature_Extract.py \
--dataset_split_csv ./example_dataset/CRC-100K.csv \
--class2id_txt ./example_dataset/CRC-100K.txt \
--dataset_name CRC-100K \
--model_name conch_v1 \
--resize_size 448 \
--batch_size 128 \
--device cuda:0 \
--save_dir ./example_dataset/featuresStep 2: Run Comprehensive Evaluation
# Run all evaluation tasks
python 01-ROI_BenchMark_Main.py \
--TASK Linear-Probe,KNN,Proto \
--train_feature_file ./example_dataset/features/Dataset_[CRC-100K]_Model_[conch_v1]_Size_[448]_train.pt \
--test_feature_file ./example_dataset/features/Dataset_[CRC-100K]_Model_[conch_v1]_Size_[448]_test.pt \
--class2id_txt ./example_dataset/CRC-100K.txt \
--log_dir ./results/CRC-100K_conch_v1 \
--log_description "Comprehensive evaluation of CONCH v1 on CRC-100K dataset" \
--device cuda:0Step 3: Bootstrap Statistical Analysis
# Perform Bootstrap analysis to calculate confidence intervals
python 02-Bootstrap_Statistical_Analysis.py \
--results_dir ./results \
--n_bootstrap 1000 \
--random_state 42Step 4: Analyze Results
# Results will be saved in ./results/CRC-100K_conch_v1/
# Each task folder contains:
# - Detailed metrics and predictions (*_detailed_results.csv)
# - Complete results (*_complete_results.json)
# - Bootstrap confidence intervals (*_bootstrap_ci.json)
#
# Summary of all experiments: ./results/bootstrap_ci_summary.jsonFreezes the feature extractor and trains only a linear classifier. This task evaluates how linearly separable the learned features are.
Advantages: Fast, simple, good indicator of feature quality
Metrics: Accuracy, Balanced Accuracy, F1 Score, Confusion Matrix
Non-parametric classification using K-nearest neighbors in feature space.
Advantages: No training required, intuitive
Metrics: Accuracy, Balanced Accuracy, per-class performance
Classifies samples based on distance to class centroids (prototypes).
Advantages: Works well with imbalanced datasets, interpretable
Metrics: Accuracy, Balanced Accuracy, prototype distances
Evaluates generalization with limited labeled samples (N-way K-shot).
Advantages: Tests model robustness in low-data regimes
Metrics: Average accuracy across episodes, standard deviation
Classifies without any training examples using text descriptions.
Advantages: No labeled data needed, open-vocabulary capability
Requirements: Model must support text encoding (multimodal)
Metrics: Accuracy, per-class performance
Contributions, issues, and feature requests are welcome!
This project benefits from the following excellent open-source projects and tools:
- TRIDENT - A toolkit for large-scale whole-slide image processing developed by Mahmood Lab, which provided inspiration and reference for our work
- Timm - PyTorch Image Models library
- HuggingFace - Platform for hosting and distributing pretrained models
- All authors and contributors of the open-source models
We thank the community for their continuous contributions and support!
This project follows the licenses of the respective models. Please ensure compliance with each model's usage terms before use.
If you use this framework in your research, please cite the relevant model papers and this repository.
For questions or suggestions, please submit an Issue or contact the project maintainers.
Happy Benchmarking! 🎉