VQVA
/

BAGEL-World-model

Visual Question-Visual Answering (VQVA)

dataset-construction

instruction-tuning

visual-reasoning

Model card Files Files and versions

ZichengD commited on Oct 16, 2025

Commit

e11274f

·

verified ·

1 Parent(s): aa4d759

Update README.md

Files changed (1) hide show

README.md +48 -3

README.md CHANGED Viewed

@@ -1,3 +1,48 @@
----
-license: cc-by-4.0
----

+---
+tags:
+- Visual Question-Visual Answering (VQVA)
+- dataset-construction
+- image-editing
+- multimodal
+- instruction-tuning
+- visual-reasoning
+---
+# 🥯 **BAGEL-World-model**
+**A agentic data-centric framework for producing large-scale interleaved Visual Question–Visual Answering (VQ-VA) data.**
+---
+The BAGEL-World framework outputs high-quality VQ-VA data via the following steps:
+### 🔄**Preprocessing**
+Filters and classify noisy web-interleaved data into design- and knowledge-related documents.
+### 🤖**Agentic Pipeline for VQ-VA Data Creation**
+**1. Retriever** selects image pairs containing non-trivial transformations from interleaved documents that can serve as the basis for free-form questions.
+**2. Instruction** Generator write a natural-language question about one image so that the other image serves as the correct answer.
+**3. Filterer** removes low-quality triplets ⟨Question Image, Question Text, Answer Image⟩.
+**4. Rewriter** increases instruction diversity by producing multiple variants of the original questions.
+**5. Reasoner** generates a language-based chain-of-thought explanation describing how the source image should be transformed to obtain the target image.
+The framework at last outputs **interleaved quadruplets**:
+- 🧠 *Question Image*
+- 💬 *Visual Question / Instruction*
+- 🔍 *Reasoning Trace*
+- 🎨 *Answer Image*
+Stay tuned for updates and examples!