# Best Practices for Rapidly Training Vision-Language (VL) Models This document provides best practices for quickly training vision-language (VL) models from scratch. Model Links - [Qwen2.5-VL-7B-Instruct](https://www.modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct) - [Qwen3-8B](https://www.modelscope.cn/models/Qwen/Qwen3-8B) Trained Model Link - [Simple-VL-8B](https://www.modelscope.cn/models/swift/Simple-VL-8B/summary) The training workflow builds upon the Qwen2.5-VL-7B-Instruct model architecture by replacing its internal large language model (LLM) component with the weights from Qwen3-8B , thereby enhancing the model's visual understanding capabilities. The process involves the following steps: 1. Modify the original model’s configuration file config.json to align with Qwen3-8B. 2. Initialize and load new model weights, saving them as a new model. 3. Fine-tune the new model in two stages: 1. Stage 1 : Train only the vision-to-language alignment module (aligner), freezing the ViT and LLM components. 2. Stage 2 : Unfreeze all modules and perform joint fine-tuning to improve overall performance. ## Model Modification ### Config File (config.json) Update Due to structural differences between Qwen2.5-7B-Instruct and Qwen3-8B (e.g., number of layers, hidden dimensions), create a new config.json based on the Qwen2.5-VL-7B-Instruct config and update the following parameters to match Qwen3-8B: ``` Modified Parameters 1. hidden_size 3584->4096 2. intermediate_size: 18944->12288 3. num_attention_heads: 28->32 4. num_key_value_heads: 4->8 5. num_hidden_layers: 28->32 6. vocab_size:152064->151936 7. max_window_layers:28->36 Newly Added Parameter 1. head_dim: 128 ``` ### Model Weight Initialization and Replacement Use the following Python script to initialize, replace, and save the model weights: ```python import torch from modelscope import Qwen2_5_VLForConditionalGeneration, AutoModelForCausalLM, AutoConfig from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import Qwen2_5_VLPatchMerger, Qwen2_5_VLModel from accelerate import Accelerator # Load original VL model and Qwen3-8B model qwen2_5_vl_7b_model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "Qwen/Qwen2.5-VL-7B-Instruct", device_map="cuda", torch_dtype=torch.bfloat16 ) device = qwen2_5_vl_7b_model.device qwen3_8b_model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen3-8B", device_map=device, torch_dtype=torch.bfloat16 ) # Load configurations old_config = AutoConfig.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct") new_config = AutoConfig.from_pretrained("/path/to/new_config_dir") # Path to new config directory # Replace merger (aligner) layer new_merger = Qwen2_5_VLPatchMerger( dim=new_visual_config.out_hidden_size, context_dim=new_visual_config.hidden_size, spatial_merge_size=new_visual_config.spatial_merge_size, ).to(device).to(torch.bfloat16) qwen2_5_vl_7b_model.visual.merger = new_merger # Replace LLM part of the VL model new_llm_model = Qwen2_5_VLModel(new_config).to(device).to(torch.bfloat16) for name, param in qwen3_8b_model.model.named_parameters(): if name in new_llm_model.state_dict(): new_llm_model.state_dict()[name].copy_(param) qwen2_5_vl_7b_model.model = new_llm_model qwen2_5_vl_7b_model.lm_head = qwen3_8b_model.lm_head # Save modified model accelerator = Accelerator() accelerator.save_model( model=qwen2_5_vl_7b_model, save_directory="/path/to/save/Qwen3-VL-Model", max_shard_size="4GB", safe_serialization=True ) ``` ## Training To simplify the process, we skip pre-training and proceed directly to supervised fine-tuning (SFT). The training is divided into two stages: ### Stage 1: Train Aligner Layer Train only the vision-to-language alignment module while freezing the ViT and LLM parts: ```bash NNODES=$WORLD_SIZE \ NODE_RANK=$RANK \ NPROC_PER_NODE=8 \ MAX_PIXELS=1003520 \ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ swift sft \ --model /path/to/new_vl_model \ --model_type qwen2_5_vl \ --train_type full \ --dataset xxx \ --torch_dtype bfloat16 \ --attn_impl flash_attn \ --freeze_vit true \ --freeze_llm true \ --freeze_aligner false \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --learning_rate 5e-6 \ --gradient_accumulation_steps 8 \ --eval_steps -1 \ --save_steps 1000 \ --save_total_limit 10 \ --logging_steps 5 \ --max_length 8192 \ --output_dir output \ --warmup_ratio 0.05 \ --dataloader_num_workers 4 \ --dataset_num_proc 8 \ --deepspeed zero2 ``` ### Stage 2: Full Model Training Unfreeze all modules and jointly train to enhance the model's visual understanding: ```bash NNODES=$WORLD_SIZE \ NODE_RANK=$RANK \ NPROC_PER_NODE=8 \ MAX_PIXELS=1003520 \ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \ swift sft \ --model /path/to/stage1_checkpoint \ --model_type qwen2_5_vl \ --train_type full \ --dataset xxx \ --torch_dtype bfloat16 \ --attn_impl flash_attn \ --freeze_vit false \ --freeze_llm false \ --freeze_aligner false \ --num_train_epochs 3 \ --per_device_train_batch_size 2 \ --learning_rate 5e-6 \ --gradient_accumulation_steps 8 \ --eval_steps -1 \ --save_steps 1000 \ --save_total_limit 10 \ --logging_steps 5 \ --max_length 8192 \ --output_dir output \ --warmup_ratio 0.05 \ --dataloader_num_workers 4 \ --dataset_num_proc 8 \ --deepspeed zero2 ``` ## Inference / Deployment / Evaluation ### Inference Perform inference using `swift infer`: ```bash swift infer \ --model /path/to/stage2_checkpoint ``` ### Deoloyment Accelerate model serving with vLLM: ```bash CUDA_VISIBLE_DEVICES=0 \ MAX_PIXELS=1003520 \ VIDEO_MAX_PIXELS=50176 \ FPS_MAX_FRAMES=12 \ swift deploy \ --model /path/to/stage2_checkpoint \ --infer_backend vllm \ --gpu_memory_utilization 0.9 \ --max_model_len 8192 \ --max_new_tokens 2048 \ --limit_mm_per_prompt '{"image": 5, "video": 2}' \ --served_model_name Qwen3-VL ``` ### Evaluation Evaluate the trained VL model using [EvalScope](https://github.com/modelscope/evalscope/). Example Evaluation Using MMMU Benchmark ```python from evalscope import TaskConfig, run_task task_cfg_dict = TaskConfig( work_dir='outputs', eval_backend='VLMEvalKit', eval_config={ 'data': ['MMMU_DEV_VAL'], 'mode': 'all', 'model': [ { 'api_base': 'http://localhost:8000/v1/chat/completions', 'key': 'EMPTY', 'name': 'CustomAPIModel', 'temperature': 0.6, 'type': 'Qwen3-VL', 'img_size': -1, 'video_llm': False, 'max_tokens': 512, } ], 'reuse': False, 'nproc': 64, 'judge': 'exact_matching' }, ) run_task(task_cfg=task_cfg_dict) ```