ZYI-0.2 — Lightweight Text-to-Image Diffusion Model

ZYI-0.2 is a lightweight text-to-image diffusion model (~67M parameters) designed to generate small images from natural language prompts.

The model is optimized for educational use, experimentation, and running on low-VRAM GPUs.

It uses:

PyTorch
CLIP text encoder
UNet diffusion backbone
DDIM fast sampling

Image resolution:

128 × 128

Example Generations

Text-to-Image Examples

Model Details

Model name: ZYI-0.2 Parameters: ~67M Architecture: Text-conditioned UNet diffusion Text encoder: CLIP (ViT-B/32) Training dataset: COCO-style captions dataset Framework: PyTorch

How to Use

Download the repository:

from huggingface_hub import snapshot_download
import sys

path = snapshot_download("caikybaldo999/ZYI-0.2")

sys.path.append(path)

Then generate an image:

from inference import generate
from IPython.display import display

img = generate("a group of people at the beach")
display(img)

Installation

Install dependencies:

pip install torch transformers numpy Pillow tqdm

Inference Speed

Using DDIM sampling:

Steps	Time
1000 (DDPM)	slow
30 (DDIM)	fast

Typical generation time:

~1–3 seconds on GPU

Repository Structure

ZYI-0.2/
│
├── model.pt
├── model.py
├── ddim_sampler.py
├── inference.py
├── requirements.txt
└── README.md

Limitations

Low resolution (128x128)
Not trained on large-scale datasets
May produce artifacts

This model is mainly intended for learning and experimentation.

License

This project is released for research and educational use.

Author

Created by:

Caiky Baldo

Hugging Face https://huggingface.co/caikybaldo999

Downloads last month: -; Downloads are not tracked for this model. How to track

caikybaldo999
/

ZYI-0.2