TVP-SFTBox-Qwen2VL-2B

Box expert LoRA adapter for Thinking with Visual Primitives.

Stage 2: Specialized SFT (Box Expert) — Grounding, counting, and spatial reasoning with structured thinking.

Base model: Qwen/Qwen2-VL-2B-Instruct
LoRA: r=64, alpha=128
Training data: 30K grounding (neg_ratio=0.15) + 8K counting + 3K spatial = 41K samples
Template: "1. Analyzing the request → 2. Object grounding → 3. Conclusion"

Example Output

1. **Analyzing the request**
The user asks me to locate the person in this image.
2. **Object grounding**
I see a <|ref|>person<|/ref|><|box|>[[511,208,738,963]]<|/box|>.
3. **Conclusion**
The person is located at the specified coordinates.

See the project repo for full instructions.