Model Card for Qween
This model is a vision‑language model fine‑tuned on DoraVQA, a pedagogically structured video question‑answering dataset extracted from Dora the Explorer. The model uses Group Relative Policy Optimization (GRPO) to learn spatial reasoning, navigation, object selection, and simple compositional logic from only 38 hours of educational video.
Model Details
Model Description
This model fine‑tunes Qwen2‑VL or Qwen3‑VL using GRPO on the DoraVQA dataset. DoraVQA consists of 5,344 question‑answer pairs aligned to the context → question → pause → answer structure of children’s educational television. The model learns to generate open‑ended answers during training and is evaluated on multiple‑choice reasoning benchmarks to test generalization.
- Developed by: Bishoy Galoaa, Xiangyu Bai and Sarah Ostadabbas
- Model type: Vision‑Language Model (VLM) fine‑tuned with Reinforcement Learning (GRPO)
- Language(s) (NLP): English, Espanol
- License: follows base model license
- Finetuned from model [optional]: Qwen2‑VL or Qwen3‑VL (various sizes)
Model Sources
- Repository: Stay Tuned!
- Paper: arXiv
- Demo: Stay Tuned!
Uses
Direct Use
- Video question answering
- Spatial reasoning tasks
- Object selection and spatial localization
- Navigation reasoning
- Multimodal temporal reasoning
- Research on structured supervision and pedagogical learning signals
Downstream Use [optional]
- Fine‑tuning on other structured educational datasets
- Transfer learning for spatial reasoning benchmarks
- Integration into multimodal tutoring or reasoning systems
- Research on reinforcement learning for VLMs
Out-of-Scope Use
- Safety‑critical decision‑making
- Real‑time navigation or robotics
- Applications involving minors or sensitive personal data
- High‑stakes factual retrieval
- Open‑world perception requiring robust real‑world grounding
Bias, Risks, and Limitations
The dataset is derived from a single children’s TV show, which may introduce narrative or cultural biases.
- Counting remains a known failure mode.
- The model may hallucinate in visually ambiguous scenes.
- Limited exposure to real‑world scenes reduces robustness.
- Answers may reflect scripted patterns rather than general reasoning.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Users should evaluate the model in their target domain before deployment and avoid using it in high‑risk or safety‑critical contexts.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoModelForVision2Seq, AutoProcessor
model = AutoModelForVision2Seq.from_pretrained(
checkpoint_path,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = processor(
text=[text],
images=images if images else None,
padding=True,
return_tensors="pt",
)
with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=1.0,
top_p=0.9,
top_k=50,
)
response = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=False,
clean_up_tokenization_spaces=False,
)[0]
Training Details
Training Data
DoraVQA — 5,344 question‑answer pairs extracted from 96 episodes (8 seasons) of Dora the Explorer. Each example includes:
- Context frames
- Transcript window
- Explicit question
- Ground‑truth answer from the show
- Precise timestamp alignment
Training Procedure
The model is trained using Group Relative Policy Optimization (GRPO), which uses group‑relative advantages instead of a value network. The reward combines F1 score and normalized Levenshtein distance between generated and ground‑truth answers.
Preprocessing [optional]
- Extract transcript segments from SRT files
- Align question timestamps with video frames
- Sample frames before and during the pause segment
- Format each example as {images, transcript context, question, answer}
Training Hyperparameters
- Training regime: bf16 mixed precision (typical for Qwen‑VL; not explicitly stated)
- Learning rate: 1e‑4
- KL coefficient: 0.01
- Reward scaling: 2.0
- Group size: 8
- Sampling: temperature 1.0, top‑p 0.9
Evaluation
Testing Data, Factors & Metrics
Testing Data
- DoraVQA (test split)
- CVBench
- Video‑MME
- NEXT‑QA
Factors
- Spatial vs. non‑spatial reasoning
- Immediate vs. sequential reasoning
- Text‑only, visual‑only, and multimodal inputs
Metrics
- Top‑1 accuracy (for MCQ benchmarks)
- Reward‑aligned correctness (during training)
- Qualitative spatial reasoning performance
Results
- +8–14 point improvement on DoraVQA
- 86.16% on CVBench (state‑of‑the‑art)
- Strong transfer to Video‑MME and NEXT‑QA
- Improved spatial localization, navigation, and object selection
- Counting improves but remains challenging
Summary
Structured educational content provides strong supervision signals that significantly improve spatial reasoning in VLMs, even with small datasets.
Technical Specifications
Model Architecture and Objective
- Base: Qwen2‑VL or Qwen3‑VL
- Objective: Reinforcement learning with GRPO
- Reward: F1 + normalized Levenshtein distance
- Input: Multimodal (video frames + transcript + question)
Compute Infrastructure
Hardware
1xH200 GPU.
Software
Linux.
Citation [optional]
BibTeX:
@misc{galoaa2026structuredscalelearningspatial,
title={Structured Over Scale: Learning Spatial Reasoning from Educational Video},
author={Bishoy Galoaa and Xiangyu Bai and Sarah Ostadabbas},
year={2026},
eprint={2601.23251},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.23251},
}

