Hello,
I am currently training a land cover classification model using terramind_v1_base. From the documentation and the paper, TerraMind supports a dual-scale training objective (pixel-level embeddings + unimodal tokens from the FSQ-VAEs).
However, I build my model as follow and I am a bit confused on what is going on. My understanding is that this only uses pixel-level patch embeddings for all modalities (including NDVI). The unimodal tokenizer branch (FSQ-VAE tokens) is not used automatically in this setup.
model = BACKBONE_REGISTRY.build(
"terramind_v1_base",
modalities=["S2L2A", {"fc": 3}, "DEM", {"NDVI": 1}],
pretrained=True,
)
model = terratorch.tasks.ClassificationTask(
model_factory="EncoderDecoderFactory",
model_args={
"backbone": "terramind_v1_base",
"backbone_pretrained": True,
"backbone_modalities": ["S2L2A", {"fc" : 3}, "DEM", {"NDVI" : 1}],
"necks": [
{"name": "SelectIndices", "indices": [2, 5, 8, 11]},
{"name": "ReshapeTokensToImage", "remove_cls_token": False},
],
"decoder": "IdentityDecoder",
"num_classes": 7,
"head_dropout": 0.1,
},
loss="ce",
optimizer="AdamW",
lr=1e-4,
optimizer_hparams={"weight_decay": 0.01},
scheduler="ReduceLROnPlateau",
scheduler_hparams={"factor": 0.5, "patience": 5},
freeze_backbone=True,
freeze_decoder=True,
freeze_head=False,
class_names=["Forest", "Grassland", "Cropland", "Buildup", "Baresoil", "Water", "Mangroves"],
)
Thus, my questions are:
- Does terramind_v1_base handle only pixel embeddings?
- If I want to leverage the dual-scale fusion (pixel + unimodal tokens) as described in the paper (which reportedly improves performance), how should I proceed in practice?
For example, would the right approach be:
Build a pixel-level backbone with ["S2L2A", "DEM", "FC"] and then separately encode NDVI (and other modalities if wanted) with FULL_MODEL_REGISTRY.build("terramind_v1_tokenizer_ndvi"),. Concatenate the tokens before feeding into the backbone (or at some neck stage). Train the fused model for classification.
Hello,
I am currently training a land cover classification model using terramind_v1_base. From the documentation and the paper, TerraMind supports a dual-scale training objective (pixel-level embeddings + unimodal tokens from the FSQ-VAEs).
However, I build my model as follow and I am a bit confused on what is going on. My understanding is that this only uses pixel-level patch embeddings for all modalities (including NDVI). The unimodal tokenizer branch (FSQ-VAE tokens) is not used automatically in this setup.
Thus, my questions are:
For example, would the right approach be:
Build a pixel-level backbone with ["S2L2A", "DEM", "FC"] and then separately encode NDVI (and other modalities if wanted) with
FULL_MODEL_REGISTRY.build("terramind_v1_tokenizer_ndvi"),. Concatenate the tokens before feeding into the backbone (or at some neck stage). Train the fused model for classification.