We introduce Motion-S, a novel framework for text-driven sign language motion generation. Motion-S adapts the generative masked modeling approach of MoMask to the domain of sign language, enabling high-quality 3D sign motion synthesis from natural language descriptions. Our approach employs a hierarchical residual vector quantization (RVQ) scheme to represent sign motions as multi-layer discrete tokens, preserving fine-grained details essential for accurate sign language expression. The framework consists of two key components: a Masked Transformer that generates base-layer motion tokens conditioned on text input through iterative masked token prediction, and a Residual Transformer that progressively refines the motion by predicting residual-layer tokens. This design enables efficient bidirectional generation of sign motions with precise semantic alignment to textual descriptions, making it suitable for applications in accessibility, education, and human-computer interaction.
Train Mask for 500 epochs
uv run python -m transformer.train_transformer \
--vq_path models/rvq_vae_best.pth \
--epochs 500 \
--use_amp \
--batch_size 64 \
--gradient_accumulation_steps 2 \
--num_workers 4 \
--output_dir transformer_checkpointsTrain the residual after for also 500 epochs
uv run python -m transformer.train_transformer \
--vq_path models/rvq_vae_best.pth \
--train_residual_only \
--mask_checkpoint transformer_checkpoints/best_model.pth \
--residual_epochs 500 \
--use_amp \
--batch_size 64 \
--gradient_accumulation_steps 2 \
--num_workers 4 \
--output_dir residual_checkpointsThis is a public excerpt/minimal version of work done at Signvrse