--- license: cc-by-nc-4.0 pipeline_tag: any-to-any library_name: diffusers --- # OneDiffusion This repository contains the OneDiffusion model presented in the paper [One Diffusion to Generate Them All](https://arxiv.org/abs/2411.16318). [Project Page](https://lehduong.github.io/OneDiffusion-homepage/) [Github Repository](https://github.com/lehduong/OneDiffusion) VAE model: SD3 VAE Text encoder: T5-xl ## Installation ``` conda create -n onediffusion_env python=3.8 && conda activate onediffusion_env && pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu118 && pip install "git+https://github.com/facebookresearch/pytorch3d.git" && pip install -r requirements.txt ``` ## Quick start Check `inference.py` for more detailed. For text-to-image, you can use below code snipe. ``` import torch from onediffusion.diffusion.pipelines.onediffusion import OneDiffusionPipeline device = torch.device('cuda:0') pipeline = OneDiffusionPipeline.from_pretrained("lehduong/OneDiffusion").to(device=device, dtype=torch.bfloat16) NEGATIVE_PROMPT = "monochrome, greyscale, low-res, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, artist name, poorly drawn, bad anatomy, wrong anatomy, extra limb, missing limb, floating limbs, disconnected limbs, mutation, mutated, ugly, disgusting, blurry, amputation" output = pipeline( prompt="[[text2image]] A bipedal black cat wearing a huge oversized witch hat, a wizards robe, casting a spell,in an enchanted forest. The scene is filled with fireflies and moss on surrounding rocks and trees", negative_prompt=NEGATIVE_PROMPT, num_inference_steps=50, guidance_scale=4, height=1024, width=1024, ) output.images[0].save('text2image_output.jpg') ``` You can run the gradio demo with: ``` python gradio_demo.py --captioner molmo # [molmo, llava, disable] ``` The demo provides guidance and helps format the prompt properly for each task. - By default, it loads the **quantized** Molmo for captioning source images. ~~which significantly increases memory usage. You generally need a GPU with at least $40$ GB of memory to run the demo.~~ You generally need a GPU with at least $21$ GB of memory to run the demo. - Opting to use LLaVA can reduce this requirement to $\approx 27$ GB, though the resulting captions may be less accurate in some cases. - You can also manually provide the caption for each input image and run with `disable` mode. In this mode, the returned caption is an empty string, but you should still press the `Generate Caption` button so that the code formats the input text properly. The memory requirement for this mode is $\approx 12$ GB. Note that the above required memory can change if you use higher resolution or more input images. ## Qualitative Results ### 1. Text-to-Image