MotiVate Qwen3-VL Pose Bridge
This repository contains a pose-conditioned adaptation of Qwen3-VL-4B-Instruct for exercise description and brief corrective feedback generation.
The model is not a plain LoRA checkpoint. In addition to the Qwen LoRA adapter, inference also requires a custom pose bridge:
qwen_adapter/: LoRA weights applied to the Qwen language modelpose_projector.pt: maps pose latents into Qwen token embedding spacepose_adapter.pt: pose encoder adapter weightspose_bridge_config.json: pose token metadata used during embedding injection
Intended Output Format
The model is trained to produce exactly two labeled lines:
Description: ...
Feedback: ...
Description summarizes the observed movement.
Feedback gives short coaching guidance.
What To Upload
For a usable Hugging Face model repo, upload at least:
qwen_adapter/pose_projector.ptpose_adapter.ptpose_bridge_config.jsontokenizer.jsontokenizer_config.jsonspecial_tokens_map.jsonadded_tokens.jsonpreprocessor_config.jsonvideo_preprocessor_config.jsonchat_template.jinjavocab.jsonmerges.txt- this
README.md
You do not need to upload optimizer, scheduler, RNG, or trainer state files unless you want to resume training.
Important Loading Note
This repo cannot be loaded as a single vanilla:
AutoModelForImageTextToText.from_pretrained(repo_id)
That call only loads the base Qwen model weights or a standard Transformers checkpoint layout. This project uses a custom runtime assembly step:
- load the base Qwen3-VL model
- load the tokenizer / processor from this repo
- attach the LoRA adapter from
qwen_adapter/ - rebuild the pose bridge modules
- load
pose_projector.ptandpose_adapter.pt - inject pose embeddings at
<|pose|>token positions during generation
Example:
from pathlib import Path
from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoProcessor
from main import build_pose_training_model, resolve_runtime, set_seed
from qwen_pose import load_config
repo_dir = Path("path/to/downloaded/hf-repo")
config_path = Path("path/to/stage2_pose_lora.json")
train_config = load_config(config_path)
training_args = train_config.training.build_training_arguments()
device, model_dtype = resolve_runtime(training_args)
set_seed(train_config.data.sampling_seed)
processor = AutoProcessor.from_pretrained(
str(repo_dir),
min_pixels=train_config.data.image_min_pixels,
max_pixels=train_config.data.image_max_pixels,
)
if processor.tokenizer.pad_token_id is None:
processor.tokenizer.pad_token = processor.tokenizer.eos_token
qwen_model = AutoModelForImageTextToText.from_pretrained(
train_config.model.model_name,
torch_dtype=model_dtype,
)
pose_token = train_config.pose.pose_special_token
if pose_token not in set(processor.tokenizer.additional_special_tokens or []):
processor.tokenizer.add_special_tokens({"additional_special_tokens": [pose_token]})
qwen_model.resize_token_embeddings(len(processor.tokenizer))
pose_token_id = processor.tokenizer.convert_tokens_to_ids(pose_token)
qwen_model = PeftModel.from_pretrained(
qwen_model,
str(repo_dir / "qwen_adapter"),
is_trainable=False,
)
original_init_checkpoint_path = train_config.model.init_checkpoint_path
object.__setattr__(train_config.model, "init_checkpoint_path", str(repo_dir))
try:
model, pose_loader = build_pose_training_model(
train_config=train_config,
qwen_model=qwen_model,
device=device,
pose_token_id=pose_token_id,
)
finally:
object.__setattr__(train_config.model, "init_checkpoint_path", original_init_checkpoint_path)
model = model.to(device)
model.eval()
Training Summary
- Base model:
Qwen/Qwen3-VL-4B-Instruct - Adaptation: language-side LoRA plus custom pose bridge
- Task: exercise movement description and short feedback generation
- Output format: two labeled lines beginning with
Description:andFeedback:
Model tree for naifenn/qwen_3_pose_output_stage2_lora
Base model
Qwen/Qwen3-VL-4B-Instruct