ZYI-0.2 β Lightweight Text-to-Image Diffusion Model
ZYI-0.2 is a lightweight text-to-image diffusion model (~67M parameters) designed to generate small images from natural language prompts.
The model is optimized for educational use, experimentation, and running on low-VRAM GPUs.
It uses:
- PyTorch
- CLIP text encoder
- UNet diffusion backbone
- DDIM fast sampling
Image resolution:
128 Γ 128
Example Generations
Text-to-Image Examples
Model Details
Model name: ZYI-0.2 Parameters: ~67M Architecture: Text-conditioned UNet diffusion Text encoder: CLIP (ViT-B/32) Training dataset: COCO-style captions dataset Framework: PyTorch
How to Use
Download the repository:
from huggingface_hub import snapshot_download
import sys
path = snapshot_download("caikybaldo999/ZYI-0.2")
sys.path.append(path)
Then generate an image:
from inference import generate
from IPython.display import display
img = generate("a group of people at the beach")
display(img)
Installation
Install dependencies:
pip install torch transformers numpy Pillow tqdm
Inference Speed
Using DDIM sampling:
| Steps | Time |
|---|---|
| 1000 (DDPM) | slow |
| 30 (DDIM) | fast |
Typical generation time:
~1β3 seconds on GPU
Repository Structure
ZYI-0.2/
β
βββ model.pt
βββ model.py
βββ ddim_sampler.py
βββ inference.py
βββ requirements.txt
βββ README.md
Limitations
- Low resolution (128x128)
- Not trained on large-scale datasets
- May produce artifacts
This model is mainly intended for learning and experimentation.
License
This project is released for research and educational use.
Author
Created by:
Caiky Baldo
Hugging Face https://huggingface.co/caikybaldo999

.png)
.png)
.png)
.png)