Any-to-Any
Bagel
Safetensors
sgt
semantic-generative-tuning
unified-multimodal
image-segmentation
visual-understanding
visual-generation
Instructions to use Two-hot/SGT-BAGEL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Bagel
How to use Two-hot/SGT-BAGEL with Bagel:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
File size: 3,281 Bytes
2491b5a c45a231 2491b5a c45a231 2491b5a c45a231 2491b5a c45a231 2491b5a c45a231 2491b5a c45a231 2491b5a c45a231 2491b5a 633a211 c45a231 633a211 c45a231 2491b5a c45a231 2491b5a c45a231 2491b5a c45a231 2491b5a c45a231 633a211 2491b5a c45a231 2491b5a c45a231 633a211 c45a231 2491b5a c45a231 2491b5a c45a231 2491b5a c45a231 2491b5a 633a211 2491b5a c45a231 2491b5a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | ---
license: apache-2.0
pipeline_tag: any-to-any
library_name: bagel-mot
tags:
- sgt
- semantic-generative-tuning
- unified-multimodal
- image-segmentation
- visual-understanding
- visual-generation
---
# SGT: Semantic Generative Tuning for Unified Multimodal Models
This repository hosts checkpoints fine-tuned with **Semantic Generative Tuning (SGT)** β a training
paradigm that couples visual *understanding* and *generation* in Unified Multimodal Models (UMMs)
by using **image segmentation as a generative proxy**.
> Unified multimodal models typically optimize understanding and generation with *misaligned*
> objectives (sparse text tokens vs. dense pixel targets), which isolates the two capabilities.
> SGT introduces segmentation β a **high-level semantic task** β as a unified generative objective
> that aligns the two branches, improves feature linear separability, and optimizes visual-textual
> attention allocation.
## π§ Method Overview
SGT reformulates classical visual tasks as generative proxies and establishes a **hierarchical
taxonomy** (low-/mid-/high-level). Extensive experiments show that **high-level semantic tasks
(e.g. image segmentation) are the optimal proxy**, outperforming depth, edge, reconstruction and
MAE/inpainting for synergizing understanding and generation.
Key findings:
1. **High-level > low-level**: segmentation gives larger gains in visual understanding
than depth / edge / pixel reconstruction.
2. **Perception, not reasoning**: visual supervision mainly strengthens perception
(spatial, hallucination, vision-centric, general VQA), rather than abstract reasoning (e.g. math, chart)
3. **Architecture-agnostic**: the gains hold for both **BAGEL** and **OmniGen2**.
## π¦ Released Artifacts
| Repo | Type | Base Model | Content |
|---|---|---|---|
| [`Two-hot/SGT-BAGEL`](https://huggingface.co/Two-hot/SGT-BAGEL) | model | BAGEL-7B-MoT | SGT fine-tuned BAGEL checkpoint |
| [`Two-hot/SGT-Gen2`](https://huggingface.co/Two-hot/SGT-Gen2) | model | OmniGen2 | SGT fine-tuned OmniGen2 checkpoint (transformer/ only) |
| [`Two-hot/SAM-SGT`](https://huggingface.co/datasets/Two-hot/SAM-SGT) | dataset | β | Segmentation training data (tar-sharded) used by SGT |
### Use the SAM-SGT dataset
See [`Two-hot/SAM-SGT`](https://huggingface.co/datasets/Two-hot/SAM-SGT) for the data
layout and the extraction instructions.
## π Highlights
- **+6.02%** average gain over BAGEL on the **CV-Bench** evaluation.
- Consistent improvements in **spatial reasoning**, **hallucination resistance**, **vision-centric**, and **general VQA**.
- Generation: gains across **GenEval** dimensions (Position / Color etc.).
- Verified on two representative UMM architectures (**BAGEL**, **OmniGen2**).
## π License
Apache-2.0. Base models remain under their original licenses:
BAGEL (Apache-2.0, based on Qwen2.5-7B + SigLIP + FLUX VAE) and
OmniGen2 (based on Qwen2.5-VL + diffusion transformer).
## βοΈ Citation
If you find this work useful, please cite our paper:
```bibtex
@article{sgt2026,
title = {Semantic Generative Tuning for Unified Multimodal Models},
author = {Songsong Yu, Yuxin Chen, Ying Shan, and Yanwei Li},
journal = {arxiv},
year = {2026}
}
``` |