Model Card for Qween

This model is a vision‑language model fine‑tuned on DoraVQA, a pedagogically structured video question‑answering dataset extracted from Dora the Explorer. The model uses Group Relative Policy Optimization (GRPO) to learn spatial reasoning, navigation, object selection, and simple compositional logic from only 38 hours of educational video.

Model Details

Model Description

This model fine‑tunes Qwen2‑VL or Qwen3‑VL using GRPO on the DoraVQA dataset. DoraVQA consists of 5,344 question‑answer pairs aligned to the context → question → pause → answer structure of children’s educational television. The model learns to generate open‑ended answers during training and is evaluated on multiple‑choice reasoning benchmarks to test generalization.

Developed by: Bishoy Galoaa, Xiangyu Bai and Sarah Ostadabbas
Model type: Vision‑Language Model (VLM) fine‑tuned with Reinforcement Learning (GRPO)
Language(s) (NLP): English, Espanol
License: follows base model license
Finetuned from model [optional]: Qwen2‑VL or Qwen3‑VL (various sizes)

Model Sources

Repository: Stay Tuned!
Paper: arXiv
Demo: Stay Tuned!

Uses

Direct Use

Video question answering
Spatial reasoning tasks
Object selection and spatial localization
Navigation reasoning
Multimodal temporal reasoning
Research on structured supervision and pedagogical learning signals

Downstream Use [optional]

Fine‑tuning on other structured educational datasets
Transfer learning for spatial reasoning benchmarks
Integration into multimodal tutoring or reasoning systems
Research on reinforcement learning for VLMs

Out-of-Scope Use

Safety‑critical decision‑making
Real‑time navigation or robotics
Applications involving minors or sensitive personal data
High‑stakes factual retrieval
Open‑world perception requiring robust real‑world grounding

Bias, Risks, and Limitations

The dataset is derived from a single children’s TV show, which may introduce narrative or cultural biases.

Counting remains a known failure mode.
The model may hallucinate in visually ambiguous scenes.
Limited exposure to real‑world scenes reduces robustness.
Answers may reflect scripted patterns rather than general reasoning.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Users should evaluate the model in their target domain before deployment and avoid using it in high‑risk or safety‑critical contexts.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoModelForVision2Seq, AutoProcessor

model = AutoModelForVision2Seq.from_pretrained(
    checkpoint_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = processor(
    text=[text],
    images=images if images else None,
    padding=True,
    return_tensors="pt",
)
with torch.no_grad():
  generated_ids = model.generate(
      **inputs,
      max_new_tokens=max_new_tokens,
      do_sample=True,
      temperature=1.0,
      top_p=0.9,
      top_k=50,
  )
  response = processor.batch_decode(
      generated_ids_trimmed,
      skip_special_tokens=False,
      clean_up_tokenization_spaces=False,
  )[0]

Training Details

Training Data

DoraVQA — 5,344 question‑answer pairs extracted from 96 episodes (8 seasons) of Dora the Explorer. Each example includes:

Context frames
Transcript window
Explicit question
Ground‑truth answer from the show
Precise timestamp alignment

Training Procedure

The model is trained using Group Relative Policy Optimization (GRPO), which uses group‑relative advantages instead of a value network. The reward combines F1 score and normalized Levenshtein distance between generated and ground‑truth answers.

Preprocessing [optional]

Extract transcript segments from SRT files
Align question timestamps with video frames
Sample frames before and during the pause segment
Format each example as {images, transcript context, question, answer}

Training Hyperparameters

Training regime: bf16 mixed precision (typical for Qwen‑VL; not explicitly stated)
Learning rate: 1e‑4
KL coefficient: 0.01
Reward scaling: 2.0
Group size: 8
Sampling: temperature 1.0, top‑p 0.9

Evaluation

Testing Data, Factors & Metrics

Testing Data

DoraVQA (test split)
CVBench
Video‑MME
NEXT‑QA

Factors

Spatial vs. non‑spatial reasoning
Immediate vs. sequential reasoning
Text‑only, visual‑only, and multimodal inputs

Metrics

Top‑1 accuracy (for MCQ benchmarks)
Reward‑aligned correctness (during training)
Qualitative spatial reasoning performance

Results

+8–14 point improvement on DoraVQA
86.16% on CVBench (state‑of‑the‑art)
Strong transfer to Video‑MME and NEXT‑QA
Improved spatial localization, navigation, and object selection
Counting improves but remains challenging

Summary

Structured educational content provides strong supervision signals that significantly improve spatial reasoning in VLMs, even with small datasets.

Technical Specifications

Model Architecture and Objective

Base: Qwen2‑VL or Qwen3‑VL
Objective: Reinforcement learning with GRPO
Reward: F1 + normalized Levenshtein distance
Input: Multimodal (video frames + transcript + question)

Compute Infrastructure

Hardware

1xH200 GPU.

Software

Linux.

Citation [optional]

BibTeX:

@misc{galoaa2026structuredscalelearningspatial,
      title={Structured Over Scale: Learning Spatial Reasoning from Educational Video}, 
      author={Bishoy Galoaa and Xiangyu Bai and Sarah Ostadabbas},
      year={2026},
      eprint={2601.23251},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.23251}, 
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bishoygaloaa/Qween

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

(3297)

this model

Datasets used to train bishoygaloaa/Qween

Paper for bishoygaloaa/Qween

Structured Over Scale: Learning Spatial Reasoning from Educational Video

Paper • 2601.23251 • Published Jan 30