| --- |
| pipeline_tag: image-text-to-text |
| library_name: transformers |
| license: apache-2.0 |
| --- |
| |
| # LLaDA-o |
|
|
| We introduce **LLaDA-o**, an effective and length-adaptive omni diffusion model for unified multimodal understanding and generation. |
|
|
| LLaDA-o extends diffusion language modeling to a broader multimodal setting, supporting both visual understanding and visual generation within a single framework. The released codebase provides a practical inference pipeline for interleaved text-image processing and a notebook-based workflow for reproducible experiments. |
|
|
| It was presented in the paper [LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model](https://arxiv.org/abs/2603.01068). |
|
|
| Code: https://github.com/ML-GSAI/LLaDA-o |
|
|
| ## Highlights |
|
|
| - Unified multimodal modeling for both understanding and generation |
| - Support for text-to-image generation |
| - Support for image understanding |
| - Support for instruction-based image editing |
| - Reproducible inference workflow through `multimodal_demo.ipynb` |
|
|
| ## Supported Tasks |
|
|
| The current release is designed for the following multimodal inference settings: |
|
|
| - **Text-to-image**: generate images from natural language prompts |
| - **Image understanding**: produce textual responses conditioned on an input image |
| - **Image editing**: edit an image according to a textual instruction |
| - **Interleaved multimodal inference**: process text and image context within a shared diffusion-based framework |
|
|
| ## Quick Start |
|
|
| Please first download the model checkpoint locally, then use the official repository for inference: |
|
|
| ```bash |
| git clone https://github.com/ML-GSAI/LLaDA-o |
| cd LLaDA-o |
| bash init_env.sh |
| ``` |
|
|
| The recommended inference entry point is: |
|
|
| - `multimodal_demo.ipynb` |
|
|
| In the notebook, set: |
|
|
| ```python |
| MODEL_PATH = "/path/to/local/GSAI-ML-LLaDA-o" |
| ``` |
|
|
| and run the cells sequentially to perform text-to-image generation, image understanding, and image editing. |
|
|
| ## Notes |
|
|
| - The current inference pipeline expects a local checkpoint path. |
| - The released demo is intended for GPU-based inference. |
| - For a complete inference workflow and implementation details, please refer to the official GitHub repository. |
|
|
| ## Citation |
|
|
| If you find LLaDA-o useful in your research, please consider citing: |
|
|
| ```bibtex |
| @article{you2026lladao, |
| title={LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model}, |
| author={You, Zebin and Zhang, Xiaolu and Zhou, Jun and Li, Chongxuan and Wen, Ji-Rong}, |
| journal={arXiv preprint arXiv:2603.01068}, |
| year={2026} |
| } |
| ``` |
|
|
| ## Contact |
|
|
| If you have any questions, please feel free to contact us at zebin@ruc.edu.cn. |