Model Card for Model ID
Model Description
my-coco-diffusion-model is a pixel-space text-to-image diffusion model trained using the 𧨠Diffusers library.
It uses:
- UNet2DConditionModel
- DDPM (1000-timestep) noise schedule
- CLIPTokenizer + CLIPTextModel (
openai/clip-vit-large-patch14) - 256Γ256 RGB images
- Cross-attention text conditioning
- Trained from scratch on a subset of COCO captions
- Trained on NVIDIA A100 80GB
This is an experimental research model created by Guus (AI Nexus Studios).
Developed By
- Guus @ AI Nexus Studios
Model Type
- Denoising Diffusion Probabilistic Model (DDPM)
- Text-conditioned
- Pixel-space UNet (in_channels=3, out_channels=3)
Languages
- English captions
License
MIT (suggested)
Finetuned From
- Not finetuned β trained from scratch
Model Sources
- Repository: https://huggingface.co/guus4324343/my-coco-diffusion-model
- DDPM Paper: https://arxiv.org/abs/2006.11239
- CLIP Paper: https://arxiv.org/abs/2103.00020
- COCO Paper: https://arxiv.org/abs/1405.0312
π§ Uses
Direct Use
- Research on pixel-space diffusion
- Studying diffusion learned from scratch
- Basic low-resolution text-to-image synthesis
- Educational ML experiments
Downstream Use
- Further training
- Fine-tuning
- Architecture research
- Integrating alternative text encoders
- Switching to VAE latent-space models
Out-of-Scope Use
β οΈ NOT suitable for:
- High-quality realistic image generation
- Safety-critical applications
- Harmful, sensitive, or NSFW generation
- Deepfakes / misinformation
- Commercial deployment
β οΈ Bias, Risks, and Limitations
- Model trained on COCO β inherits dataset biases
- Pixel-space diffusion produces low-detail images
- Model is unstable at low epoch counts (1β3 epochs)
- Poor at fine details (faces, text, hands)
- May output distorted or abstract content
- Not trained with safety filtering
- Not appropriate for real-world decision making
π How to Get Started with the Model
Below is a minimal working example to generate an image from a prompt.
import torch
from diffusers import UNet2DConditionModel, DDPMScheduler
from transformers import CLIPTokenizer, CLIPTextModel
import torchvision.utils as vutils
device = "cuda"
# Load tokenizer + text encoder
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to(device)
# Load UNet
model = UNet2DConditionModel.from_pretrained(
"guus4324343/my-coco-diffusion-model"
).to(device)
scheduler = DDPMScheduler(num_train_timesteps=1000)
scheduler.set_timesteps(250)
prompt = "a cute cat sitting in a box"
tokens = tokenizer(prompt, return_tensors="pt", padding="max_length",
truncation=True).input_ids.to(device)
text_emb = text_encoder(tokens)[0]
latents = torch.randn(1, 3, 256, 256, device=device)
for t in scheduler.timesteps:
with torch.autocast("cuda"):
eps = model(latents, t, encoder_hidden_states=text_emb).sample
latents = scheduler.step(eps, t, latents).prev_sample
img = (latents.clamp(-1, 1) + 1) / 2
vutils.save_image(img, "sample.png")
ποΈ Training Details
Training Data
- Dataset:
Multimodal-Fatima/COCO_captions_train - Used: 50,000 image-caption pairs (subset)
- Resolution: 256Γ256
- Caption field:
sentences_raw[0] - Preprocessing:
Resize(256)
CenterCrop(256)
ToTensor()
Normalize([0.5], [0.5])
Training Procedure
Model Architecture
- UNet2DConditionModel
- in_channels: 3
- out_channels: 3
- block_out_channels: (128, 256, 512, 512)
- layers_per_block: 2
- cross_attention_dim: 1024 (CLIP-L/14)
Hyperparameters
- Optimizer: AdamW
- LR:
1e-4 - Batch size: 8
- Scheduler: DDPM, 1000 timesteps
- Epochs: 3
- Mixed precision: FP16 (torch.amp)
Hardware
- NVIDIA A100 80GB
- CUDA acceleration enabled
- High-speed NVMe storage
Software
- Python 3.10
- PyTorch (CUDA 12.1)
- Diffusers
- Transformers
- Accelerate
π Evaluation
Testing Data
No formal testing β qualitative inspection only.
Factors
- Epoch count
- Prompt complexity
- Sampling steps
Metrics
None β experimental model
Results
- Epoch 1: abstract textures
- Epoch 3: early emergence of shapes
- More epochs required for meaningful images
π± Environmental Impact
Estimated using the Machine Learning Impact Calculator:
- Hardware: NVIDIA A100 80GB
- Train time: 3β6 hours
- Power draw: ~250β300W
- Estimated COβ emissions: 0.2β0.6 kg COβ (depending on region's energy mix)
π Technical Specifications
Model Architecture and Objective
- DDPM UNet
- Pixel-space denoising
- Cross-attention text conditioning
- No VAE or latent compression
- MSE loss on predicted noise
Compute Infrastructure
- Cloud A100 instance
- High-bandwidth memory
- Python + CUDA toolchain
Citation
BibTeX
@misc{guus2025cocodiffusion,
title={My COCO Diffusion Model},
author={Guus},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/guus4324343/my-coco-diffusion-model}}
}
Model Card Contact
For questions or issues: Guus / AI Nexus Studios
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support