File size: 4,136 Bytes

860e08f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b0dab8e
 
7b2e061
 
6235a2b
4a866df
b0dab8e
 
 
 
 
 
 
b555322
fd0037b
6235a2b
fd0037b
 
6235a2b
b555322
b0dab8e
 
 
 
 
00f8dd2
b0dab8e
 
 
b555322
 
00f8dd2
7b2e061
b0dab8e
b555322
 
 
 
fd0037b
 
 
 
 
 
 
6235a2b
 
 
 
 
 
00f8dd2
b555322
 
 
 
 
 
 
 
 
00f8dd2
b0dab8e

---
license: apache-2.0
datasets:
- BLIP3o/BLIP3o-Pretrain-Long-Caption
- BLIP3o/BLIP3o-Pretrain-JourneyDB
- BLIP3o/BLIP3o-60k
- FreedomIntelligence/ShareGPT-4o-Image
- UCSC-VLAA/GPT-Image-Edit-1.5M
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
- Efficient-Large-Model/Sana_1600M_512px_diffusers
pipeline_tag: text-to-image
---
# TBAC-UniImage-3B

[Arxiv](https://arxiv.org/abs/2508.08098) | [Github](https://github.com/DruryXu/TBAC-UniImage)

![Teaser](./assets/teaser.jpg)

## Overview
This repository contains the official model checkpoints of **TBAC-UniImage-3B**, an unified understanding and generation model developed by Basic Algorithm Center, Platform and Content Group, Tencent.

Our model is composed of two components: the [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) serves as the understanding module, while the [SANA-1600M](https://huggingface.co/Efficient-Large-Model/Sana_1600M_512px_diffusers) acts as the generation module. The conditions for generation are originate from representations of different Qwen2.5-VL-3B-Instruct layers.

![Model](./assets/model.png)


## Text-to-Image Generation Performance

### Qualitative Results
![t2i](./assets/t2i.png)

### GenEval and DPG-Bench
| Method | Base (M)LLM | GenEval | DPG-Bench |
| :--- | :--- | :--- | :--- |
| MetaQuery | Qwen2.5-VL-3B-Instruct | 0.78 | 81.10 |
| | Qwen2.5-VL-7B-Instruct | 0.80 | 82.05 |
| BILP-3o | Qwen2.5-VL-3B-Instruct | 0.81 | 79.36 |
| | Qwen2.5-VL-7B-Instruct | 0.83 | 80.73 |
| BAGEL | MoT-7B | 0.82 | - |
| Show-o2 | Qwen2.5-1.5B-Instruct | 0.73 | 85.02 |
| | Qwen2.5-7B-Instruct | 0.76 | 86.14 |
| Tar | Qwen2.5-1.5B-Instruct | 0.76 | 82.96
| | Qwen2.5-7B-Instruct | 0.84 | 84.65 |
| Qwen-Image | Qwen2.5-VL-7B-Instruct | 0.87 | 88.32
| **Ours** | **Qwen2.5-VL-3B-Instruct** | **0.87** | 80.97 |

### TIIF-Bench

![TIIF](./assets/tiif_bench.png)

## Image Editing Performance

The input image is processed by the Qwen2.5-VL image encoder and then fed into the MLLM along with text and learnable queries. We use only the learnable queries, which have fused the multimodal information, as the generative condition, without directly incorporating any image VAE representations like other works. Despite this, the model still achieves promising multimodal understanding and consistency performance in Image Editing tasks.

### Qualitative Results
![t2i](./assets/ti2i.png)

### ImgEdit

![ImgEdit](./assets/imgedit.png)

## Train and Inference
Please refer to [Github](https://github.com/DruryXu/TBAC-UniImage) for train and inference codes.

### Few Prompts Used in Teaser

- A serene photoaraph of a ainder and white cat sittina in a sunlit arassy field. The catis positioned sliahtiv to the riaht. aazing upwards with a calm expression. Its fur is a soft orande with distinct white patches on its chest and face. The foreground features out-of-focus blades of grass, creating a dreamy bokeh effect. The background is a blurred mix of soft greens and browns, suggesting a natural outdoor setting. The lighting is warm and golden, highlighting the cat's fur and casting gentle shadows. The image has a shallow depth of field, emphasizing the cat while the background remains softly blurred. Photorealistic, tranquil, natural lighting, warm color palette, high contrast, intimate, peaceful atmosphere.
- a chinese rabbit made of white ceramic with blue ink, stunning intricate designs, (geometric and floral patterns, fine art))
- nighttime view of a gothic horror castle with vivid colors and a wood press style, dark fantasy
- few palm tree and an old car, simple background, pop art style
- Steampunk architecture in the forest, reactor, rusty green color scheme with Studio Ghibli style, lots of details, mechanical, green, forest, trees, moss, 8K, Unreal Engine, C4D rendering, Ultra HD details
- An astronaut holding a stop sign on the moon.

## Acknowledgements
The training and inference codes are modified from [MetaQuery](https://github.com/facebookresearch/metaquery). We thank them for their contribution!

## About
Created by the Tencent PCG Basic Algorithm Center. All rights reserved.