Instructions to use LinhanWang/Qwen3-VL-2B-Instruct-Action with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use LinhanWang/Qwen3-VL-2B-Instruct-Action with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("LinhanWang/Qwen3-VL-2B-Instruct-Action") model = AutoModelForImageTextToText.from_pretrained("LinhanWang/Qwen3-VL-2B-Instruct-Action") - Notebooks
- Google Colab
- Kaggle
Qwen3-VL-2B-Instruct-Action
Qwen/Qwen3-VL-2B-Instruct extended with 2048 FAST action tokens so it can be used as the VLM backbone for the QwenFast framework in starVLA (autoregressive VLA via π₀-FAST-style discrete action tokens).
The base model weights are unchanged; only the input/output embedding tables are resized and the new rows are randomly initialised. The tokenizer, processor, and config are saved alongside so the directory loads as a drop-in replacement for the base model.
What was added
| Value | |
|---|---|
| Base model | Qwen/Qwen3-VL-2B-Instruct |
| New tokens | 2048 FAST action tokens (added as special tokens) |
| Tokenizer (base / total before) | 151643 / 151669 |
| Embedding size before / after | 151936 → 153984 |
| Action token id range | [151936, 153983] |
| Init strategy for new rows | normal (μ=0, σ=0.02) |
| Source token list | fast_tokens.txt |
| Saved dtype | bfloat16 |
The mapping {token: id} is stored in added_custom_token_id_map.json.
How it was produced
Built with the helper at starVLA/model/modules/vlm/tools/add_qwen_special_tokens/add_special_tokens_to_qwen.py, equivalent to:
python starVLA/model/modules/vlm/tools/add_qwen_special_tokens/add_special_tokens_to_qwen.py \
--model-id Qwen/Qwen3-VL-2B-Instruct \
--tokens-file starVLA/model/modules/vlm/tools/add_qwen_special_tokens/fast_tokens.txt \
--save-dir ./results/Qwen3-VL-2B-Instruct-Action \
--init-strategy normal
Use in starVLA
Set framework.qwenvl.base_vlm to this repo in your training YAML. Example: examples/SimplerEnv/train_files/config_2b_fast.yaml:
framework:
name: QwenFast
qwenvl:
base_vlm: LinhanWang/Qwen3-VL-2B-Instruct-Action
attn_implementation: sdpa
action_model:
action_model_type: FAST
action_dim: 7
future_action_window_size: 15
past_action_window_size: 0
Standalone load
import torch
from transformers import AutoProcessor, AutoTokenizer, Qwen3VLForConditionalGeneration
repo = "LinhanWang/Qwen3-VL-2B-Instruct-Action"
tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=True)
proc = AutoProcessor.from_pretrained(repo, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
repo, dtype=torch.bfloat16, attn_implementation="sdpa", device_map="cuda"
)
print(len(tok), model.get_input_embeddings().weight.shape[0]) # 153717 153984
The newly added embedding rows carry no learned signal — fine-tune the model (e.g. with QwenFast on a LeRobot dataset) before relying on the action tokens.
License
Apache-2.0, inherited from Qwen/Qwen3-VL-2B-Instruct.
- Downloads last month
- 106
Model tree for LinhanWang/Qwen3-VL-2B-Instruct-Action
Base model
Qwen/Qwen3-VL-2B-Instruct