TVP-SFTBox-Qwen2VL-2B

Box expert LoRA adapter for Thinking with Visual Primitives.

Stage 2: Specialized SFT (Box Expert) โ€” Grounding, counting, and spatial reasoning with structured thinking.

  • Base model: Qwen/Qwen2-VL-2B-Instruct
  • LoRA: r=64, alpha=128
  • Training data: 30K grounding (neg_ratio=0.15) + 8K counting + 3K spatial = 41K samples
  • Template: "1. Analyzing the request โ†’ 2. Object grounding โ†’ 3. Conclusion"

Example Output

1. **Analyzing the request**
The user asks me to locate the person in this image.
2. **Object grounding**
I see a <|ref|>person<|/ref|><|box|>[[511,208,738,963]]<|/box|>.
3. **Conclusion**
The person is located at the specified coordinates.

See the project repo for full instructions.

Framework versions

  • PEFT 0.12.0
Downloads last month
77
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for yunfengwang/TVP-SFTBox-Qwen2VL-2B

Base model

Qwen/Qwen2-VL-2B
Adapter
(163)
this model

Collection including yunfengwang/TVP-SFTBox-Qwen2VL-2B