mair-lab
/

thinking-sft-simple

Model card Files Files and versions

thinking-sft-simple / README.md

rabiulawal's picture

Create README.md

fe15986 verified 6 months ago

|

history blame contribute delete

1.66 kB

	---
	language:
	- en
	base_model:
	- BAAI/Emu3-Stage1
	---
	# EARL - SFT think (S) (8B)

	Model Size: 8B parameters
	Base Model: [BAAI/Emu3-Stage1](https://huggingface.co/BAAI/Emu3-Stage1)
	Dataset: Simple Edit
	Training Objective: Supervised Fine-Tuning (SFT) with Chain-of-Thought reasoning

	This model is introduced in our paper: [EARL: The Promise of RL for Autoregressive Image Editing](https://arxiv.org/abs/2508.01119).

	## Overview

	EARL - SFT think (S) is a fine-tuned 8B vision-language model designed for autoregressive image editing. It extends the base Emu3 model with chain-of-thought supervision, enabling step-by-step reasoning to tackle complex editing tasks. Training leverages the Simple Edit dataset, focusing on editable instructions grounded in visual understanding.

	🔗 Inference script and usage: [GitHub Repository](https://github.com/saba96/EARL?tab=readme-ov-file)

	## Benchmark Results

	\| Model \| OmniEdit \| EmuEdit \| AURORA \| MB \| VisMin \| I2EBench \| AVG \|
	\|------------------\|----------\|---------\|--------\|------\|--------\|----------\|---------\|
	\| SFT (S) \| 5.73 \| 3.66 \| 3.58 \| 3.19 \| 3.57 \| 3.59 \| 3.88 \|
	\| SFT think (S) \| 4.34 \| 3.76 \| 2.88 \| 3.36 \| 3.46 \| 3.21 \| 3.50 \|

	> ⚠️ Despite integrating reasoning capabilities, the SFT think variant underperforms slightly compared to the standard SFT model in average benchmark scores.

	## Intended Use

	This model is suited for research and development in image editing tasks that benefit from interpretable reasoning, such as instructional or multi-step visual modifications.