Any-to-Any
Bagel
Diffusers
Safetensors
sgt
semantic-generative-tuning
unified-multimodal
image-segmentation
visual-understanding
visual-generation
Instructions to use Two-hot/SGT-Gen2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Bagel
How to use Two-hot/SGT-Gen2 with Bagel:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
File size: 3,281 Bytes
b7d583f a7c0530 b7d583f a7c0530 b7d583f a7c0530 b7d583f a7c0530 b7d583f a7c0530 b7d583f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | ---
license: apache-2.0
pipeline_tag: any-to-any
library_name: bagel-mot
tags:
- sgt
- semantic-generative-tuning
- unified-multimodal
- image-segmentation
- visual-understanding
- visual-generation
---
# SGT: Semantic Generative Tuning for Unified Multimodal Models
This repository hosts checkpoints fine-tuned with **Semantic Generative Tuning (SGT)** β a training
paradigm that couples visual *understanding* and *generation* in Unified Multimodal Models (UMMs)
by using **image segmentation as a generative proxy**.
> Unified multimodal models typically optimize understanding and generation with *misaligned*
> objectives (sparse text tokens vs. dense pixel targets), which isolates the two capabilities.
> SGT introduces segmentation β a **high-level semantic task** β as a unified generative objective
> that aligns the two branches, improves feature linear separability, and optimizes visual-textual
> attention allocation.
## π§ Method Overview
SGT reformulates classical visual tasks as generative proxies and establishes a **hierarchical
taxonomy** (low-/mid-/high-level). Extensive experiments show that **high-level semantic tasks
(e.g. image segmentation) are the optimal proxy**, outperforming depth, edge, reconstruction and
MAE/inpainting for synergizing understanding and generation.
Key findings:
1. **High-level > low-level**: segmentation gives larger gains in visual understanding
than depth / edge / pixel reconstruction.
2. **Perception, not reasoning**: visual supervision mainly strengthens perception
(spatial, hallucination, vision-centric, general VQA), rather than abstract reasoning (e.g. math, chart)
3. **Architecture-agnostic**: the gains hold for both **BAGEL** and **OmniGen2**.
## π¦ Released Artifacts
| Repo | Type | Base Model | Content |
|---|---|---|---|
| [`Two-hot/SGT-BAGEL`](https://huggingface.co/Two-hot/SGT-BAGEL) | model | BAGEL-7B-MoT | SGT fine-tuned BAGEL checkpoint |
| [`Two-hot/SGT-Gen2`](https://huggingface.co/Two-hot/SGT-Gen2) | model | OmniGen2 | SGT fine-tuned OmniGen2 checkpoint (transformer/ only) |
| [`Two-hot/SAM-SGT`](https://huggingface.co/datasets/Two-hot/SAM-SGT) | dataset | β | Segmentation training data (tar-sharded) used by SGT |
### Use the SAM-SGT dataset
See [`Two-hot/SAM-SGT`](https://huggingface.co/datasets/Two-hot/SAM-SGT) for the data
layout and the extraction instructions.
## π Highlights
- **+6.02%** average gain over BAGEL on the **CV-Bench** evaluation.
- Consistent improvements in **spatial reasoning**, **hallucination resistance**, **vision-centric**, and **general VQA**.
- Generation: gains across **GenEval** dimensions (Position / Color etc.).
- Verified on two representative UMM architectures (**BAGEL**, **OmniGen2**).
## π License
Apache-2.0. Base models remain under their original licenses:
BAGEL (Apache-2.0, based on Qwen2.5-7B + SigLIP + FLUX VAE) and
OmniGen2 (based on Qwen2.5-VL + diffusion transformer).
## βοΈ Citation
If you find this work useful, please cite our paper:
```bibtex
@article{sgt2026,
title = {Semantic Generative Tuning for Unified Multimodal Models},
author = {Songsong Yu, Yuxin Chen, Ying Shan, and Yanwei Li},
journal = {arxiv},
year = {2026}
}
``` |