interactSpeech / docs /source_en /BestPractices /Rapidly-Training-VL-model.md

Add files using upload-large-folder tool

0947ff8 verified 5 months ago

6.94 kB

	# Best Practices for Rapidly Training Vision-Language (VL) Models

	This document provides best practices for quickly training vision-language (VL) models from scratch.

	Model Links
	- [Qwen2.5-VL-7B-Instruct](https://www.modelscope.cn/models/Qwen/Qwen2.5-VL-7B-Instruct)
	- [Qwen3-8B](https://www.modelscope.cn/models/Qwen/Qwen3-8B)

	Trained Model Link
	- [Simple-VL-8B](https://www.modelscope.cn/models/swift/Simple-VL-8B/summary)


	The training workflow builds upon the Qwen2.5-VL-7B-Instruct model architecture by replacing its internal large language model (LLM) component with the weights from Qwen3-8B , thereby enhancing the model's visual understanding capabilities. The process involves the following steps:

	1. Modify the original model’s configuration file config.json to align with Qwen3-8B.
	2. Initialize and load new model weights, saving them as a new model.
	3. Fine-tune the new model in two stages:
	1. Stage 1 : Train only the vision-to-language alignment module (aligner), freezing the ViT and LLM components.
	2. Stage 2 : Unfreeze all modules and perform joint fine-tuning to improve overall performance.


	## Model Modification

	### Config File (config.json) Update
	Due to structural differences between Qwen2.5-7B-Instruct and Qwen3-8B (e.g., number of layers, hidden dimensions), create a new config.json based on the Qwen2.5-VL-7B-Instruct config and update the following parameters to match Qwen3-8B:


	```
	Modified Parameters
	1. hidden_size 3584->4096
	2. intermediate_size: 18944->12288
	3. num_attention_heads: 28->32
	4. num_key_value_heads: 4->8
	5. num_hidden_layers: 28->32
	6. vocab_size:152064->151936
	7. max_window_layers:28->36

	Newly Added Parameter
	1. head_dim： 128
	```

	### Model Weight Initialization and Replacement
	Use the following Python script to initialize, replace, and save the model weights:
	```python
	import torch
	from modelscope import Qwen2_5_VLForConditionalGeneration, AutoModelForCausalLM, AutoConfig
	from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import Qwen2_5_VLPatchMerger, Qwen2_5_VLModel
	from accelerate import Accelerator

	# Load original VL model and Qwen3-8B model
	qwen2_5_vl_7b_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"Qwen/Qwen2.5-VL-7B-Instruct",
	device_map="cuda",
	torch_dtype=torch.bfloat16
	)
	device = qwen2_5_vl_7b_model.device

	qwen3_8b_model = AutoModelForCausalLM.from_pretrained(
	"Qwen/Qwen3-8B",
	device_map=device,
	torch_dtype=torch.bfloat16
	)

	# Load configurations
	old_config = AutoConfig.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
	new_config = AutoConfig.from_pretrained("/path/to/new_config_dir") # Path to new config directory

	# Replace merger (aligner) layer
	new_merger = Qwen2_5_VLPatchMerger(
	dim=new_visual_config.out_hidden_size,
	context_dim=new_visual_config.hidden_size,
	spatial_merge_size=new_visual_config.spatial_merge_size,
	).to(device).to(torch.bfloat16)
	qwen2_5_vl_7b_model.visual.merger = new_merger

	# Replace LLM part of the VL model
	new_llm_model = Qwen2_5_VLModel(new_config).to(device).to(torch.bfloat16)

	for name, param in qwen3_8b_model.model.named_parameters():
	if name in new_llm_model.state_dict():
	new_llm_model.state_dict()[name].copy_(param)

	qwen2_5_vl_7b_model.model = new_llm_model
	qwen2_5_vl_7b_model.lm_head = qwen3_8b_model.lm_head

	# Save modified model
	accelerator = Accelerator()
	accelerator.save_model(
	model=qwen2_5_vl_7b_model,
	save_directory="/path/to/save/Qwen3-VL-Model",
	max_shard_size="4GB",
	safe_serialization=True
	)
	```


	## Training
	To simplify the process, we skip pre-training and proceed directly to supervised fine-tuning (SFT). The training is divided into two stages:

	### Stage 1: Train Aligner Layer
	Train only the vision-to-language alignment module while freezing the ViT and LLM parts:
	```bash
	NNODES=$WORLD_SIZE \
	NODE_RANK=$RANK \
	NPROC_PER_NODE=8 \
	MAX_PIXELS=1003520 \
	CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
	swift sft \
	--model /path/to/new_vl_model \
	--model_type qwen2_5_vl \
	--train_type full \
	--dataset xxx \
	--torch_dtype bfloat16 \
	--attn_impl flash_attn \
	--freeze_vit true \
	--freeze_llm true \
	--freeze_aligner false \
	--num_train_epochs 3 \
	--per_device_train_batch_size 2 \
	--learning_rate 5e-6 \
	--gradient_accumulation_steps 8 \
	--eval_steps -1 \
	--save_steps 1000 \
	--save_total_limit 10 \
	--logging_steps 5 \
	--max_length 8192 \
	--output_dir output \
	--warmup_ratio 0.05 \
	--dataloader_num_workers 4 \
	--dataset_num_proc 8 \
	--deepspeed zero2
	```

	### Stage 2: Full Model Training

	Unfreeze all modules and jointly train to enhance the model's visual understanding:

	```bash
	NNODES=$WORLD_SIZE \
	NODE_RANK=$RANK \
	NPROC_PER_NODE=8 \
	MAX_PIXELS=1003520 \
	CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
	swift sft \
	--model /path/to/stage1_checkpoint \
	--model_type qwen2_5_vl \
	--train_type full \
	--dataset xxx \
	--torch_dtype bfloat16 \
	--attn_impl flash_attn \
	--freeze_vit false \
	--freeze_llm false \
	--freeze_aligner false \
	--num_train_epochs 3 \
	--per_device_train_batch_size 2 \
	--learning_rate 5e-6 \
	--gradient_accumulation_steps 8 \
	--eval_steps -1 \
	--save_steps 1000 \
	--save_total_limit 10 \
	--logging_steps 5 \
	--max_length 8192 \
	--output_dir output \
	--warmup_ratio 0.05 \
	--dataloader_num_workers 4 \
	--dataset_num_proc 8 \
	--deepspeed zero2
	```

	## Inference / Deployment / Evaluation

	### Inference
	Perform inference using `swift infer`:
	```bash
	swift infer \
	--model /path/to/stage2_checkpoint
	```

	### Deoloyment
	Accelerate model serving with vLLM:
	```bash
	CUDA_VISIBLE_DEVICES=0 \
	MAX_PIXELS=1003520 \
	VIDEO_MAX_PIXELS=50176 \
	FPS_MAX_FRAMES=12 \
	swift deploy \
	--model /path/to/stage2_checkpoint \
	--infer_backend vllm \
	--gpu_memory_utilization 0.9 \
	--max_model_len 8192 \
	--max_new_tokens 2048 \
	--limit_mm_per_prompt '{"image": 5, "video": 2}' \
	--served_model_name Qwen3-VL
	```

	### Evaluation
	Evaluate the trained VL model using [EvalScope](https://github.com/modelscope/evalscope/).

	Example Evaluation Using MMMU Benchmark
	```python
	from evalscope import TaskConfig, run_task

	task_cfg_dict = TaskConfig(
	work_dir='outputs',
	eval_backend='VLMEvalKit',
	eval_config={
	'data': ['MMMU_DEV_VAL'],
	'mode': 'all',
	'model': [
	{
	'api_base': 'http://localhost:8000/v1/chat/completions',
	'key': 'EMPTY',
	'name': 'CustomAPIModel',
	'temperature': 0.6,
	'type': 'Qwen3-VL',
	'img_size': -1,
	'video_llm': False,
	'max_tokens': 512,
	}
	],
	'reuse': False,
	'nproc': 64,
	'judge': 'exact_matching'
	},
	)

	run_task(task_cfg=task_cfg_dict)
	```