VQVA
/

BAGEL-World-model

Visual Question-Visual Answering (VQVA)

dataset-construction

instruction-tuning

visual-reasoning

Model card Files Files and versions

BAGEL-World-model / README.md

ZichengD's picture

Update README.md

e11274f verified 8 months ago

|

history blame contribute delete

1.45 kB

	---
	tags:
	- Visual Question-Visual Answering (VQVA)
	- dataset-construction
	- image-editing
	- multimodal
	- instruction-tuning
	- visual-reasoning
	---

	# 🥯 BAGEL-World-model

	A agentic data-centric framework for producing large-scale interleaved Visual Question–Visual Answering (VQ-VA) data.


	---

	The BAGEL-World framework outputs high-quality VQ-VA data via the following steps:

	### 🔄Preprocessing

	Filters and classify noisy web-interleaved data into design- and knowledge-related documents.


	### 🤖Agentic Pipeline for VQ-VA Data Creation

	1. Retriever selects image pairs containing non-trivial transformations from interleaved documents that can serve as the basis for free-form questions.

	2. Instruction Generator write a natural-language question about one image so that the other image serves as the correct answer.

	3. Filterer removes low-quality triplets ⟨Question Image, Question Text, Answer Image⟩.

	4. Rewriter increases instruction diversity by producing multiple variants of the original questions.

	5. Reasoner generates a language-based chain-of-thought explanation describing how the source image should be transformed to obtain the target image.

	The framework at last outputs interleaved quadruplets:

	- 🧠 Question Image
	- 💬 Visual Question / Instruction
	- 🔍 Reasoning Trace
	- 🎨 Answer Image





	Stay tuned for updates and examples!