--- license: apache-2.0 language: - vi - en tags: - vision-language-model - vlm - qwen3 - fastvlm - vietnamese base_model: Qwen/Qwen3-0.6B datasets: - 5CD-AI/Viet-multimodal-open-r1-8k-verified --- # Belle-VLM: Vietnamese Vision Language Model ## Model Description Belle-VLM is a Vision Language Model trained for Vietnamese multimodal reasoning tasks. ### Architecture - **LLM Backbone**: Qwen3-0.6B - **Vision Encoder**: FastViTHD (MobileCLIP) - **Projector**: MLP 2-layer (3072 -> 1024) ### Training - **Dataset**: 5CD-AI/Viet-multimodal-open-r1-8k-verified - **Method**: LoRA fine-tuning - **Epochs**: 2 - **Learning Rate**: 2e-05 ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained( "beyoru/Belle-VLM", trust_remote_code=True, torch_dtype=torch.float16, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("beyoru/Belle-VLM", trust_remote_code=True) ``` ## Training Details | Parameter | Value | |-----------|-------| | Base Model | Qwen/Qwen3-0.6B | | Vision Tower | mobileclip_l_384 | | LoRA Rank | 8 | | LoRA Alpha | 16 | | Batch Size | 1 x 1 | | Epochs | 2 | ## License Apache 2.0