Mr-Big-Eye Orchestrator v2 β€” distilled agentic video-QA LoRA (Qwen2.5-7B)

A QLoRA adapter that turns Qwen/Qwen2.5-7B-Instruct into an agentic video-QA orchestrator β€” a tool-calling policy that, given a question about a long video, plans a retrieval strategy, issues tool calls over a multimodal evidence store, checks evidence sufficiency, and answers with grounded timestamp citations. The policy is distilled from a strong proprietary teacher's tool-call trajectories (behavioral cloning of an agent, not plain QA fine-tuning).

TL;DR β€” on a 181-case leakage-free hard heldout (judged pass_rate), this adapter lifts the base 7B from 0.547 β†’ 0.669 (+12.2pp), with the biggest jump on audio-visual / temporal ("joint") reasoning: 0.362 β†’ 0.621 (+25.9pp). It is the best of three compared checkpoints and meets the pre-registered acceptance bar (+3pp over the previous model).

⚠️ This is a LoRA adapter + an agent policy, not a standalone chat model. It emits tool calls in the Mr-Big-Eye harness schema and needs that harness (retrieval indexes + a VLM tool) to run end-to-end.


1. What the model actually does (architecture)

Each video is pre-indexed offline into a multimodal evidence store: ASR transcript (SenseVoice + FSMN-VAD, sentence-level timestamps), scene/dense frames (SigLIP2 embeddings + VLM captions), and slides/OCR, each in its own vector index. At question time the orchestrator runs an agent loop over a tool suite:

  • retrieve_video_evidence / retrieve_transcript_evidence / retrieve_slide_evidence β€” modality-specific retrieval over the per-video indexes
  • search_transcript_keyword, segment_focus, build_timeline β€” targeted lookup, temporal zoom, and event ordering
  • align_audiovisual_evidence β€” couple audio (transcript) with visual (frames) for "joint" questions
  • retrieve_hypothesis_evidence, assess_evidence_sufficiency, stitched_verify β€” hypothesis-driven gathering, a stop-discipline sufficiency check, and grounding verification
  • answer_with_evidence β€” final answer with [TRANSCRIPT:t=..] / [FRAME:t=..] citations that are scored for correctness

The adapter is the policy that decides which tool to call when, with what arguments, and when evidence is sufficient to answer β€” the hard part of long-video QA.

2. Results (judged pass_rate, 181-case leakage-free heldout)

90 Video-MME + 91 WorldSense questions; no heldout video was ever trained on (verified 0 leakage at the trajectory source). All three models evaluated apples-to-apples on the same cases, judge on, served from a single vLLM with hot-swapped LoRA modules, 1 run. base = Qwen2.5-7B-Instruct; pilot = an earlier 355-sample adapter; v2 = this model.

pass_rate base pilot v2 (this) v2βˆ’base v2βˆ’pilot
overall 0.547 0.630 0.669 +12.2pp +3.9pp βœ…
joint (audio-visual + temporal) 0.362 0.621 0.621 +25.9pp +0.0
audio 0.356 0.444 0.444 +8.9pp +0.0
visual 0.785 0.708 0.815 +3.1pp +10.8pp
overview 0.846 0.923 0.923 +7.7pp +0.0
β€” Video-MME (90) 0.700 0.722 0.800 +10.0pp +7.8pp
β€” WorldSense (91) 0.396 0.538 0.538 +14.2pp +0.0

Per-modality n is 13–65; treat <3pp deltas as ties. W&B: training run ts8urbus, eval run gi7iicvl. Per-model breakdowns are in eval/.

3. Distillation methodology

  • Teacher: a strong proprietary orchestrator (deepseek-v4-pro) drives the same tool harness; its full tool-call trajectories are captured per question.
  • Quality tiering (the key data-quality lever): each trajectory is graded β€” tier-1 (judge-correct, no runtime guard), tier-2 (correct, only a benign dedup guard), tier-3 (teacher wrong / degenerate). Only tier-1+2 are kept β‡’ the SFT set is 100% teacher-correct; 40% of raw trajectories (all judge_wrong) are discarded.
  • SFT shaping: trajectories are sliced into per-step (context β†’ next tool call) samples; forced/guard targets are excluded so the student learns the policy, not the harness's safety rails. base64 image blobs are stripped from context.
  • QLoRA: r=32, Ξ±=64, 4-bit nf4, Liger fused cross-entropy (fits 8192-token agent contexts on a 20 GB card), lr 1e-4 cosine, batch 1 Γ— grad-accum 8, bf16, 2 epochs. ~1,499 SFT samples. Best checkpoint selected by held-out task pass_rate, not eval loss (see Β§5).

4. Data engineering

  • Difficulty/modality rebalance: short, near-trivial clips are dropped; training is refocused on harder medium videos selected by a modality-weighted greedy sampler that raises the multimodal (joint) share.
  • External audio-visual enrichment: integrated WorldSense (audio-visual-synchronised MCQ) to push the joint+audio share of new training data from ~14% β†’ 70%.
  • Leakage control: held-out videos are force-excluded from training and never teacher-captured; 0 leak verified at source.
  • Robust ingestion: per-frame VLM content-filter rejections are tolerated (drop one caption, keep the video) instead of failing a whole video β€” recovered held-out items.

5. Engineering rigor & findings (why this is more than a fine-tune)

  • Quality ≫ quantity, measured. A prior 4.5Γ— data scale-up (355 β†’ 1,618 samples) by adding easy clips regressed βˆ’4.2pp on hard cases; rebalancing toward harder, multimodal data at ~constant size produced the best model instead.
  • Honest model selection. All 20-step checkpoints were retained; the eval-loss minimum β‰  the pass_rate maximum, so the shipped checkpoint was chosen (and re-verified) by hard-set pass_rate. A cheap checkpoint-trajectory probe showed pass_rate peaks mid-training while eval loss keeps falling β€” so a 3rd epoch was rejected as overfitting.
  • Teacher-capacity ceiling (a clean negative result). The audio-visual enrichment gave no joint/audio gain over the pilot because the teacher itself fails ~56% on the hardest audio-visual cases β€” the bottleneck there is the teacher/tooling, not the student or data volume. This redirected strategy from "more data" to "stronger teacher / audio tool."
  • Reproducible infra: 4-GPU sharded ingest + teacher capture + sharded evaluation; W&B tracking; single-server multi-LoRA eval.

6. Usage (vLLM, OpenAI-compatible + tool calls)

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --enable-lora --lora-modules v2=michaelrhs/mr-big-eye-orch-v2 \
  --max-lora-rank 32 --enable-auto-tool-choice --tool-call-parser hermes \
  --max-model-len 16384

Then drive it with the Mr-Big-Eye harness (point the orchestrator at the served v2 model; keep your VLM/judge endpoints unchanged). The adapter expects the harness's tool schemas; loaded as a bare chat model it will answer but without evidence tools.

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct", torch_dtype="bfloat16")
model = PeftModel.from_pretrained(base, "michaelrhs/mr-big-eye-orch-v2")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

7. Limitations

  • Research artifact; non-commercial (derives from WorldSense data, CC-BY-NC-4.0).
  • It is a tool-calling agent policy for the Mr-Big-Eye harness, not a general VQA/chat model.
  • Audio-visual reasoning is capped by the teacher (Β§5): expect gains mainly on visual/temporal retrieval, not on the hardest non-speech-audio questions.
  • Evaluated at nβ‰ˆ181 (per-modality 13–65) β€” small-N; <3pp deltas are ties.

8. Links

Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for michaelrhs/mr-big-eye-orch-v2

Base model

Qwen/Qwen2.5-7B
Adapter
(2138)
this model