BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Paper β’ 2505.09568 β’ Published β’ 99
Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)
Illuma is an image generation model cloned from Salesforce/BLIP3o-NEXT-GRPO-TexT-3B β the first truly open-source image generation model with:
AR (3B Qwen2.5-VL) + SANA 1.5 Diffusion Decoder
Illuma uses a two-stage generation process:
The GRPO (Group Relative Policy Optimization) RL training improves text rendering in generated images (GenEval 0.73 β 0.90).
This model includes a custom handler (handler.py) for deployment on HF Inference Endpoints:
fuhaddesmond/illuma as the model repositoryimport requests
API_URL = "https://YOUR_ENDPOINT_ID.aws.endpoints.huggingface.cloud"
headers = {"Authorization": "Bearer hf_YOUR_TOKEN"}
payload = {
"inputs": "A neon sign that says 'ILLUMA' glowing in purple against a dark wall",
"parameters": {
"seq_len": 729,
"top_p": 0.95,
"top_k": 1200
}
}
response = requests.post(API_URL, headers=headers, json=payload)
import base64
from PIL import Image
from io import BytesIO
image_data = base64.b64decode(response.json()["image"])
image = Image.open(BytesIO(image_data))
image.save("illuma_output.png")
# Clone BLIP3o repo (BLIP3o-NEXT branch)
git clone --branch BLIP3o-NEXT --single-branch https://github.com/JiuhaiChen/BLIP3o.git
cd BLIP3o
pip install -e .
# Download model
python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='fuhaddesmond/illuma', repo_type='model'))"
# Run inference
python inference.py /path/to/downloaded/model
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="fuhaddesmond/illuma",
repo_type="model"
)
| Detail | Value |
|---|---|
| Base Model | BLIP3o-NEXT-GRPO-TexT-3B |
| Parameters | ~4B (3B AR + diffusion decoder) |
| Architecture | Qwen2.5-VL + SANA 1.5 |
| License | Apache 2.0 |
| GRPO Training | GenEval 0.73 β 0.90 |
| Specialty | Text rendering in images |
@article{chen2025blip3,
title={BLIP3-o: A Family of Fully Open Unified Multimodal Models},
author={Chen, Jiuhai and others},
journal={arXiv preprint arXiv:2505.09568},
year={2025}
}