--- license: apache-2.0 base_model: - stabilityai/stable-diffusion-3-medium tags: - Custom-Pipeline --- # 🌀 DreamCoil-Diffusion-Mini *Developed as part of the research at **EngineerG Lab**.* 🔬 **DreamCoil-Diffusion-Mini** (`muverqqw/DreamCoil-Diffusion-Mini`) is a highly optimized, lightweight modification of Stable Diffusion 3 Medium. We completely removed the heavy T5-XXL text encoder and replaced it with the compact **`Qwen3-Embedding-0.6B`**. This dramatically reduces VRAM usage, RAM requirements, and model loading times, while maintaining a strong level of prompt understanding. This alignment is made possible by a custom-trained neural network — the **DreamCoil Projector** (an MLP that maps Qwen's 1024-dimensional hidden states into SD3's 4096-dimensional latent space). Additionally, this pipeline includes a built-in **Safe VAE Decode** patch to prevent "black square" (NaN) generation errors common in SD3. ### 🌟 Key Features * **No T5 Required:** Fast loading and low VRAM footprint. * **Powered by Qwen:** Uses `Qwen3-Embedding-0.6B` as the primary semantic engine. * **Custom Projector:** Specifically trained to bridge the Qwen language model and the SD3 transformer. * **NaN-Safe VAE:** The custom pipeline automatically handles VAE NaN outputs, ensuring stable generation. ### ⚠️ IMPORTANT: Always Use Negative Prompts! Because the `0.6B` language model is significantly smaller than the original `4.7B` T5 encoder, it might occasionally miss fine details or hallucinate. **Using a negative prompt is highly recommended** to strictly guide the model and achieve the best visual results. ## 🚀 Quick Start (Usage) Because we fundamentally changed the architecture (replacing T5 with Qwen), the standard `diffusers` loading mechanism might throw key mismatch errors. To solve this, we provide a custom loading script. This script automatically downloads our custom pipeline logic and uses a helper function (`load_dreamcoil_model`) to correctly initialize the Qwen text encoder and the MLP projector. --- ### Run this script: ```python import os import shutil import sys from huggingface_hub import hf_hub_download # --- 1. Settings --- REPO_ID = "muverqqw/DreamCoil-Diffusion-Mini" FILENAME = "pipeline.py" LOCAL_FILENAME = "dreamcoil_pipeline.py" # --- 2. Download Custom Architecture --- print(f"📦 Downloading DreamCoil architecture from {REPO_ID}...") cached_file = hf_hub_download(repo_id=REPO_ID, filename=FILENAME) # Copy and rename to avoid conflicts with system modules shutil.copy(cached_file, LOCAL_FILENAME) sys.path.append(os.getcwd()) # Import the custom loader try: from dreamcoil_pipeline import load_dreamcoil_model print("✅ Architecture imported successfully.") except ImportError as e: print(f"❌ Import error: {e}") if 'dreamcoil_pipeline' in sys.modules: import importlib importlib.reload(sys.modules['dreamcoil_pipeline']) from dreamcoil_pipeline import load_dreamcoil_model # --- 3. Load the Model --- print("🚀 Loading weights (this might take a minute)...") pipe = load_dreamcoil_model(model_id=REPO_ID, device="cuda") # --- 4. Generation --- prompt = ( "A high-quality, realistic photography shot of a young woman with long blonde hair, seen from behind. " "She is wearing a light, semi-transparent white summer dress. She stands on a sandy beach, " "looking at the beautiful turquoise ocean waves with white sea foam. Bright sunny day, " "natural lighting, cinematic composition, 8k resolution, highly detailed skin and fabric textures." ) # A strong negative prompt is highly recommended for this mini-encoder! negative_prompt = ( "deformed, distorted, disfigured, poorly drawn, bad anatomy, wrong anatomy, " "extra limb, missing limb, floating limbs, mutated, ugly, blurry, text, watermark" ) print("🎨 Generating image...") image = pipe( prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=28, guidance_scale=7.0 ).images[0] # --- 5. Save/Display --- image.save("dreamcoil_output.png") print("✅ Image saved as dreamcoil_output.png") ``` --- ## 🛠 Training Details The creation of **DreamCoil-Diffusion-Mini** was conducted in a strict two-stage process: * **Projector Alignment:** First, we trained the custom `DreamCoilProjector` (MLP) to properly map the 1024-dimensional hidden states of `Qwen3-Embedding-0.6B` into the 4096-dimensional latent space expected by the SD3 Medium Transformer. During this stage, the base model weights were frozen. * **LoRA Fine-Tuning:** Once the text encoder was aligned, we performed LoRA fine-tuning directly on the model to adapt the visual generation capabilities to the new semantic understanding of the Qwen encoder. *All training artifacts and LoRA weights are included in this repository.* --- ## ⚠️ Limitations * **Complex Prompts:** Because a `0.6B` text encoder replaces the original `4.7B` T5, the model may struggle with highly complex, multi-subject prompts or precise text rendering compared to the base SD3. * **Prompt Dependency:** The model relies heavily on negative prompts to steer away from artifacts. --- ## ☕ Support the Project This model was developed as part of the independent research at **EngineerG Lab**. Training custom projectors and fine-tuning requires significant GPU resources. If you find this model useful and want to support our future developments, consider buying us a coffee! Every donation helps rent GPUs for the next breakthrough. ❤️
Donate with Donatello --- ## 📊 Performance Benchmark (NVIDIA T4 16GB) We conducted a head-to-head comparison between **DreamCoil-Mini** and the **Original SD3-Medium** on a standard NVIDIA T4 GPU (16GB VRAM) using Kaggle environments. | Metric | DreamCoil-Mini 🌀 | Original SD3-Medium | Improvement | | :--- | :--- | :--- | :--- | | **Generation Time** | **35.11 s** | 118.92 s | **~3.4x Faster** | | **Peak VRAM** | **11.53 GB** | 13.66 GB* | **-2.13 GB** | | **Load Time** | **38.05 s** | 68.84 s | **~1.8x Faster** | | **Prompt Alignment (CLIP Score)** | 27.37 | **28.81** | -5% difference | *\*Original SD3 requires CPU offloading to run on a 16GB T4, which significantly slows down generation.* ### 📈 Analysis: * **Speed King:** DreamCoil-Mini is **340% faster** than the original model on mid-range hardware because it avoids slow CPU-to-GPU data transfers. * **Efficient Semantics:** By replacing the 4.7B T5-XXL with a 0.6B Qwen encoder, we maintained **95% of the prompt following capability** while drastically reducing the model's footprint. * **Accessibility:** This model makes SD3-level generation viable for users with older or mid-range GPUs (12GB - 16GB VRAM) without the painful slowness of offloading.