Repository for multimodal action recognition using synthetic data from Sims4Action, with domain adaptation for real-world datasets (Toyota Smarthome, ETRI-Activity3D).
This repository contains implementations for two papers:
Multimodal Domain Generation for Synthetic-to-Real Activity Recognition
- Trains only on synthetic Sims4Action data
- Generates novel modalities via adversarial domain generation
- Evaluates on real Toyota Smarthome and ETRI-Activity3D datasets
- Domain generator code in
L2A-OT/
Unsupervised Modality Selection for Cross-Domain Action Recognition
- Selects beneficial modalities via correlation and MMD thresholds
- Adds YOLO object detections as a 5th modality
- Late fusion strategies (Borda Count, Sum, Product)
git clone <repo-url>
pip install torch torchvision numpy pandas tqdm tensorboard opencv-python pillowRequirements:
- PyTorch >= 1.7
- numpy, pandas, tensorboard, tqdm
- OpenCV (for video processing)
Expected structure for each modality:
dataset_root/
├── <Action>/
│ └── <video_id>/
│ ├── heatmaps.avi
│ ├── limbs.avi
│ ├── optical_flow.avi
│ └── rgb.avi
| Modality | Model | Input | Description |
|---|---|---|---|
| Heatmaps (H) | S3D | 1 ch | Gaussian maps at AlphaPose joint locations |
| Limbs (L) | S3D | 1 ch | Lines connecting skeleton joints |
| Optical Flow (OF) | S3D | 3 ch | Farneback algorithm (HSV encoded) |
| RGB | S3D | 3 ch | Original video frames |
| YOLO | MLP | 80-dim | Reciprocal distance vector to detected objects (ModSelect only) |
The MMGen paper trains action classifiers on synthetic data with generated novel domains to improve generalization to real-world data.
The domain generator transforms source modalities (top row) into novel synthetic domains (bottom row), creating a more diverse training set:
H: Heatmaps, L: Limbs, OF: Optical Flow. The model trains on all 8 modalities (4 source + 4 novel) but evaluates on real data.
The approach uses three networks:
- Frozen Classifier (AC_f): Pre-trained S3D action classifier, frozen during domain generation training
- Task Classifier (AC): S3D trained on both source and novel modalities
- Domain Generator (DG): Transforms source modalities into novel synthetic domains
Additionally, a Domain Classifier (DC) (ResNet18) is trained to distinguish modalities and provides the Sinkhorn distance loss for training DG.
-
Phase 1 - Train DC: The domain classifier learns to distinguish source modalities (4-way classification: H, L, OF, RGB). This creates a feature space that separates domains.
-
Phase 2 - Train DG + AC: DC is frozen (eval mode). DG and AC train jointly:
- DG generates novel domains from source modalities
- The Sinkhorn distance (computed using DC's frozen features) measures how different novel domains are from source domains
- DG maximizes this distance (novelty loss) while AC learns to classify actions on both source and novel domains
- AC_f (frozen) ensures generated domains preserve action semantics
Why not alternating? DC is not an adversary—it's a fixed domain distance metric. DG doesn't try to "fool" DC; instead, DG uses DC's frozen features to measure domain novelty. A stable feature space is crucial for meaningful Sinkhorn distances.
The paper evaluates all 15 modality combinations. Below we show the full 4-modality example.
Prerequisites for L2A-OT: Download ResNet18 pretrained weights for the domain classifier:
cd L2A-OT/checkpoints
wget https://download.pytorch.org/models/resnet18-5c106cde.pthTrain an S3D classifier using early fusion (channel concatenation) on Sims4Action:
python main.py --gpu 0 1 \
--dataset sims_video_multimodal \
--modalities heatmaps limbs optical_flow rgb \
--dataset_roots /path/to/sims/heatmaps /path/to/sims/limbs /path/to/sims/optical_flow /path/to/sims/rgb \
--n_modalities 4 \
--n_channels_each_modality 1 1 3 3 \
--n_channels 8 \
--epochs 200 \
--img_dim 112 \
--seq_len 16This produces a checkpoint at experiments/<exp>/model/model_best_val_acc.pth.tar, used as the frozen action classifier AC_f.
Train all components in a single run. The script first trains DC for num_iterations_D iterations, then jointly trains DG and AC for num_iterations_G iterations:
cd L2A-OT
python main_SIMS_S3D.py \
--gpu 0 1 \
--modalities heatmaps limbs optical_flow rgb \
--modality_indices 0 1 2 3 \
--dataset_roots /path/to/sims/heatmaps /path/to/sims/limbs /path/to/sims/optical_flow /path/to/sims/rgb \
--dataset_roots_test /path/to/adl/heatmaps /path/to/adl/limbs /path/to/adl/optical_flow /path/to/adl/rgb \
--pretrained_model_C ../experiments/<exp>/model/model_best_val_acc.pth.tar \
--num_iterations_D 1000 \
--num_iterations_G 30000 \
--test_every 500 \
--save_img_every 1000 \
--exp_tag GAN_h_l_of_rgb \
--batch_size 6
# Parameter mapping:
# --pretrained_model_C → AC_f (frozen action classifier)
# --num_iterations_D → DC training iterations
# --num_iterations_G → DG + AC joint training iterationsThe frozen action classifier AC_f (--pretrained_model_C) provides the classification loss to ensure generated domains preserve action semantics.
Evaluate the trained task classifier on Toyota Smarthome or ETRI:
cd L2A-OT
python main_SIMS_S3D.py \
--gpu 0 1 \
--modalities heatmaps limbs optical_flow rgb \
--modality_indices 0 1 2 3 \
--dataset_roots /path/to/sims/heatmaps /path/to/sims/limbs /path/to/sims/optical_flow /path/to/sims/rgb \
--dataset_roots_test /path/to/adl/heatmaps /path/to/adl/limbs /path/to/adl/optical_flow /path/to/adl/rgb \
--pretrained_model_DGC checkpoints/GAN_h_l_of_rgb/best_val_DGC.tar \
--pretrained_model_G checkpoints/GAN_h_l_of_rgb/G_iteration_30000.pth \
--test_classifier_only \
--exp_tag eval_GAN_h_l_of_rgb
# Parameter mapping:
# --pretrained_model_DGC → AC (action classifier / task model)
# --pretrained_model_G → DG (domain generator)Outputs balanced and unbalanced accuracy on both source and novel domains.
The paper tests all 15 combinations. Examples:
# Single modality: Limbs only
--modalities limbs --modality_indices 1 --n_channels 1
# Two modalities: Heatmaps + Limbs
--modalities heatmaps limbs --modality_indices 0 1 --n_channels 2
# Three modalities: H + L + OF
--modalities heatmaps limbs optical_flow --modality_indices 0 1 2 --n_channels 5The domain generator DG is trained with:
- Novelty Loss: Maximizes Sinkhorn distance between source and novel modality distributions
- Diversity Loss: Maximizes Sinkhorn distance between different novel modalities
- Classification Loss: Novel modalities should be correctly classified by frozen AC_f
- Cycle Loss: Reconstruction consistency (DG(DG(x)) ≈ x)
L_DG = λ_c * L_class + λ_r * L_cycle - λ_d * (L_novelty + L_diversity)
Default: λ_c = λ_d = 1, λ_r = 10
- Input size: 112 × 112
- Sequence length: 16 frames
- Video chunks: 90 frames
- Optimizer: Adam (lr=1e-4, β1=0.5, β2=0.999)
- Weight decay: 5e-5
- Pre-training: 200 epochs
- Joint training: ~50 epochs equivalent
Training produces:
checkpoints/<exp_tag>/G_iteration_<N>.pth: Domain generator (DG) weightscheckpoints/<exp_tag>/best_val_DGC.tar: Best action classifier (AC)results/<exp_tag>/*.jpg: Sample generated imagesruns/<exp_tag>/: TensorBoard logs
The t-SNE visualization below shows how the domain generator learns to produce novel modalities that are distinct from the source modalities, effectively diversifying the training distribution:
Each color represents a different modality. Source and novel modalities form separate clusters, indicating the generator has learned to produce diverse but semantically consistent domains.
The ModSelect paper proposes an unsupervised modality selection method that identifies beneficial modalities without requiring target domain labels. It trains unimodal classifiers, computes prediction correlations and embedding MMD between modalities, and uses these metrics to select modalities that improve late fusion performance.
- Train unimodal S3D classifiers on each modality (H, L, OF, RGB) and an MLP on YOLO detection vectors
- Evaluate all 31 modality combinations with late fusion strategies
- ModSelect (unsupervised): Compute prediction correlation ρ and MMD between classifier embeddings to select beneficial modalities via Winsorized Mean thresholds
YOLO Representation: For each frame, a 80-dimensional vector v is computed where v[i] is the reciprocal Euclidean distance between the person's bounding box center and the i-th detected object's center. The vector is normalized: v ← v/||v||. Objects closer to the person have larger weights.
Train S3D classifiers for image-based modalities:
# Heatmaps
python main.py --gpu 0 1 --dataset sims_video --modality heatmaps \
--n_channels 1 --epochs 200 --dataset_video_root /path/to/heatmaps
# Limbs
python main.py --gpu 0 1 --dataset sims_video --modality limbs \
--n_channels 1 --epochs 200 --dataset_video_root /path/to/limbs
# Optical Flow
python main.py --gpu 0 1 --dataset sims_video --modality optical_flow \
--n_channels 3 --epochs 200 --dataset_video_root /path/to/optical_flow
# RGB
python main.py --gpu 0 1 --dataset sims_video --modality rgb \
--n_channels 3 --epochs 200 --dataset_video_root /path/to/rgbTrain MLP for YOLO detection vectors:
# YOLO (MLP on detection vectors)
# Requires both RGB videos (for video indexing) and precomputed YOLO detections
python main.py --gpu 0 1 \
--dataset YOLO_detections_only \
--model_vid YOLO_mlp \
--epochs 200 \
--dataset_video_root /path/to/rgb \
--detections_root /path/to/yolo_detections
# Detection files expected at: /path/to/yolo_detections/<Action>/<video_id>/detections.csvBefore late fusion, test each trained classifier to generate CSV files with predictions and embeddings:
# Test each modality (repeat for all 5 modalities)
python main.py --gpu 0 1 --test_only \
--dataset sims_video --modality heatmaps --n_channels 1 \
--pretrained_model experiments/<exp_heatmaps>/model/model_best_val_acc.pth.tar \
--eval_dataset_root /path/to/target/heatmaps
# For YOLO:
python main.py --gpu 0 1 --test_only \
--dataset YOLO_detections_only --model_vid YOLO_mlp \
--pretrained_model experiments/<exp_yolo>/model/model_best_val_acc.pth.tar \
--dataset_video_root /path/to/target/rgb \
--detections_root /path/to/target/yolo_detectionsThis produces (in experiments/<exp>/logs/):
results_test_*.csv— Top-5 predictions per sample (for late fusion)*_embeddings.npy— Classifier embeddings (for MMD computation)*_scores.npy— Raw class scores (for correlation computation)
Combine predictions from multiple modalities using voting strategies. The fusion scripts expect CSV files with columns: vid_id, label, pred1, pred2, pred3, pred4, pred5.
Borda Count Voting:
python utils/late_fusion_borda_count.py \
--csv_roots results_h.csv results_l.csv results_of.csv results_rgb.csv \
--modalities heatmaps limbs optical_flow rgbWith save path (for multimodal version):
python utils/late_fusion_borda_count_multimodal.py \
--csv_roots results_h.csv results_l.csv results_of.csv results_rgb.csv results_yolo.csv \
--modalities heatmaps limbs optical_flow rgb yolo \
--save_path results_fused.txtThe paper evaluates 6 strategies. This repository implements:
| Strategy | Script | Status |
|---|---|---|
| Borda Count | utils/late_fusion_borda_count.py |
✓ Implemented |
| Sum | utils/late_fusion_sum_square_multimodal.py |
✓ Implemented |
| Squared Sum | utils/late_fusion_sum_square_multimodal.py |
✓ Implemented |
| Product | — | Not implemented |
| Maximum | — | Not implemented |
| Median | — | Not implemented |
The paper proposes selecting modalities based on:
- Prediction Correlation ρ(m,n): High correlation between correct predictions is more likely than between wrong ones
- MMD between embeddings: Lower domain discrepancy indicates better agreement
Thresholds: The Winsorized Mean (λ=0.2) is used to compute selection thresholds. Modalities are selected if they meet either criterion (high ρ OR low MMD).
The utils/modselect_analysis/ directory contains scripts for computing MMD, visualizing embeddings, and analyzing the relationship between domain discrepancy and performance.
After testing classifiers (Step 2 above), compute mean embedding vectors for each modality:
python utils/modselect_analysis/mean_embedding.py \
--embeddings_path experiments/<exp>/logs/results_test_*_embeddings.npy \
--save_path results/<dataset>/<modality>/mean_vec.npyCompute MMD between all modality pairs using mean embeddings:
python utils/modselect_analysis/mmd_table.py --datasets Sims Toyota
# Outputs: MMD_tables/mmd_Sims_Toyota.npy, .svg, .pdfThe MMD is computed as the Euclidean distance between mean embedding vectors (linear kernel). YOLO is excluded due to different embedding dimensionality.
Generate t-SNE visualizations of classifier embeddings:
python utils/modselect_analysis/tsne.py \
--embeddings_path results/Sims/heatmaps/*_embeddings.npy \
--labels_path results/Sims/*_labels.npy \
--save_path tsne_plots/sims_heatmaps.svgPlot the relationship between domain discrepancy metrics and late fusion performance:
# Correlation vs Performance comparison
python utils/modselect_analysis/line_plot_performance_discrepancy.py
# MMD/Energy Distance vs Performance comparison
python utils/modselect_analysis/line_plot_similarity_discrepancy.pyThese scripts expect precomputed data in correlations/, MMD_tables/, and performance/ directories.
| Script | Purpose |
|---|---|
mean_embedding.py |
Compute mean embedding vector from saved embeddings |
mmd_table.py |
Compute and visualize MMD matrix between modalities |
energy_dist.py |
Compute energy distance between embedding distributions |
tsne.py |
Generate t-SNE visualizations of embeddings |
line_plot_performance_discrepancy.py |
Plot correlation vs late fusion performance |
line_plot_similarity_discrepancy.py |
Plot MMD/energy distance vs performance |
- Action classes: 10 (shared subset between Sims4Action, Toyota Smarthome, ETRI)
- Evaluation metric: Mean per-class accuracy (balanced accuracy)
- Late fusion operates on class probability scores
- YOLO not included in MMD experiments (different embedding size from S3D)
- Energy distance requires the
dcorlibrary
python utils/generate_skeletons_heatmaps.py \
--root_dir /path/to/alphapose_results \
--result_dir /path/to/outputProduces both heatmaps and limbs modalities.
python utils/generate_optical_flow.py \
--root_dir /path/to/rgb_videos \
--result_dir /path/to/output \
--n_workers 3| Argument | Description |
|---|---|
--dataset |
sims_video, sims_video_multimodal, adl, YOLO_detections_only, sims_video_with_YOLO_detections |
--model_vid |
s3d, i3d, YOLO_mlp, s3d_yolo_fusion |
--modality |
Single modality: heatmaps, limbs, optical_flow, rgb |
--modalities |
Multiple modalities for early fusion |
--n_channels |
Total input channels (sum of all modalities) |
--G_path |
Domain generator checkpoint for inference (MMGen) |
--split_policy |
frac, cross-subject, cross-view-1, cross-view-2 |
--test_only |
Run evaluation only (no training) |
--pretrained_model |
Path to pretrained model checkpoint |
| Argument | Description |
|---|---|
--detections_root |
Root folder containing YOLO detection CSVs |
--yolo_arch |
MLP architecture: SimpleNet, BaseNet, TanyaNet, PyramidNet, LongNet, LastNet |
--pretrained_YOLO_mlp |
Pretrained YOLO MLP checkpoint |
--pretrained_s3d_yolo_fusion |
Pretrained S3D+YOLO fusion model |
--fine_tune_yolo_mlp |
Fine-tune YOLO MLP submodel |
| Argument | Description | Maps to |
|---|---|---|
--pretrained_model_C |
Frozen action classifier for classification loss | AC_f |
--pretrained_model_DGC |
Action classifier checkpoint | AC |
--pretrained_model_G |
Domain generator checkpoint | DG |
--num_iterations_D |
Domain classifier training iterations | DC |
--num_iterations_G |
Domain generator + action classifier training iterations | DG + AC |
--test_classifier_only |
Evaluation mode (no training) | — |
tensorboard --logdir experiments/<exp-folder>/logs
# or for L2A-OT:
tensorboard --logdir L2A-OT/runs/<exp_tag>├── main.py # Entry point for action classifier (AC) training
├── lib/ # Model architectures (S3D, I3D, YOLO MLP)
├── datasets/ # Data loaders
├── training/ # Training loops
├── testing/ # Evaluation scripts
├── utils/ # Preprocessing and fusion utilities
│ ├── late_fusion_*.py # Late fusion strategies (Borda, Sum, etc.)
│ ├── generate_*.py # Data preprocessing scripts
│ └── modselect_analysis/ # ModSelect analysis scripts
│ ├── mean_embedding.py # Mean embedding computation
│ ├── mmd_table.py # MMD matrix computation
│ ├── energy_dist.py # Energy distance computation
│ ├── tsne.py # t-SNE visualization
│ └── line_plot_*.py # Discrepancy vs performance plots
└── L2A-OT/ # Domain generation training (MMGen)
├── main_SIMS_S3D.py # Main training script (DC, DG, AC)
├── model.py # Domain generator (DG) architecture (StarGAN-style)
├── resnet.py # Domain classifier (DC) - ResNet18
├── lib/ # S3D backbone for AC and AC_f
└── utils/ # Sinkhorn loss, augmentation
- MMGen is based on L2A-OT: Learning to Generate Novel Domains for Domain Generalization (ECCV 2020)
- S3D architecture from Rethinking Spatiotemporal Feature Learning (ECCV 2018)


