Hi VILA Team, great appreciation for the amazing work!
Recently, I have been adapting the NVILA model as the video understanding model for a research project, and I noticed that NVILA-8B and NVILA-8B-Video are more than just two models going through an extra video instruction-tuning stage, as mentioned in #167. The two models actually have different visual sampling feature dimensions, and the base model architectures also differ, with one annotated as qwen2vl and the other as qwen2.5-vl. I am wondering if there is any missing information about the models in the repository?
@Lyken17