You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Model Card for Model ID

Model Description

my-coco-diffusion-model is a pixel-space text-to-image diffusion model trained using the 🧨 Diffusers library.

It uses:

UNet2DConditionModel
DDPM (1000-timestep) noise schedule
CLIPTokenizer + CLIPTextModel (openai/clip-vit-large-patch14)
256×256 RGB images
Cross-attention text conditioning
Trained from scratch on a subset of COCO captions
Trained on NVIDIA A100 80GB

This is an experimental research model created by Guus (AI Nexus Studios).

Developed By

Guus @ AI Nexus Studios

Model Type

Denoising Diffusion Probabilistic Model (DDPM)
Text-conditioned
Pixel-space UNet (in_channels=3, out_channels=3)

Languages

English captions

License

MIT (suggested)

Finetuned From

Not finetuned — trained from scratch

Model Sources

Repository: https://huggingface.co/guus4324343/my-coco-diffusion-model
DDPM Paper: https://arxiv.org/abs/2006.11239
CLIP Paper: https://arxiv.org/abs/2103.00020
COCO Paper: https://arxiv.org/abs/1405.0312

🧠 Uses

Direct Use

Research on pixel-space diffusion
Studying diffusion learned from scratch
Basic low-resolution text-to-image synthesis
Educational ML experiments

Downstream Use

Further training
Fine-tuning
Architecture research
Integrating alternative text encoders
Switching to VAE latent-space models

Out-of-Scope Use

⚠️ NOT suitable for:

High-quality realistic image generation
Safety-critical applications
Harmful, sensitive, or NSFW generation
Deepfakes / misinformation
Commercial deployment

⚠️ Bias, Risks, and Limitations

Model trained on COCO → inherits dataset biases
Pixel-space diffusion produces low-detail images
Model is unstable at low epoch counts (1–3 epochs)
Poor at fine details (faces, text, hands)
May output distorted or abstract content
Not trained with safety filtering
Not appropriate for real-world decision making

🚀 How to Get Started with the Model

Below is a minimal working example to generate an image from a prompt.

import torch
from diffusers import UNet2DConditionModel, DDPMScheduler
from transformers import CLIPTokenizer, CLIPTextModel
import torchvision.utils as vutils

device = "cuda"

# Load tokenizer + text encoder
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to(device)

# Load UNet
model = UNet2DConditionModel.from_pretrained(
    "guus4324343/my-coco-diffusion-model"
).to(device)

scheduler = DDPMScheduler(num_train_timesteps=1000)
scheduler.set_timesteps(250)

prompt = "a cute cat sitting in a box"

tokens = tokenizer(prompt, return_tensors="pt", padding="max_length",
                   truncation=True).input_ids.to(device)
text_emb = text_encoder(tokens)[0]

latents = torch.randn(1, 3, 256, 256, device=device)

for t in scheduler.timesteps:
    with torch.autocast("cuda"):
        eps = model(latents, t, encoder_hidden_states=text_emb).sample
    latents = scheduler.step(eps, t, latents).prev_sample

img = (latents.clamp(-1, 1) + 1) / 2
vutils.save_image(img, "sample.png")

🏋️ Training Details

Training Data

Dataset: Multimodal-Fatima/COCO_captions_train
Used: 50,000 image-caption pairs (subset)
Resolution: 256×256
Caption field: sentences_raw[0]
Preprocessing:

Resize(256)
CenterCrop(256)
ToTensor()
Normalize([0.5], [0.5])

Training Procedure

Model Architecture

UNet2DConditionModel
in_channels: 3
out_channels: 3
block_out_channels: (128, 256, 512, 512)
layers_per_block: 2
cross_attention_dim: 1024 (CLIP-L/14)

Hyperparameters

Optimizer: AdamW
LR: 1e-4
Batch size: 8
Scheduler: DDPM, 1000 timesteps
Epochs: 3
Mixed precision: FP16 (torch.amp)

Hardware

NVIDIA A100 80GB
CUDA acceleration enabled
High-speed NVMe storage

Software

Python 3.10
PyTorch (CUDA 12.1)
Diffusers
Transformers
Accelerate

📊 Evaluation

Testing Data

No formal testing — qualitative inspection only.

Factors

Epoch count
Prompt complexity
Sampling steps

Metrics

None — experimental model

Results

Epoch 1: abstract textures
Epoch 3: early emergence of shapes
More epochs required for meaningful images

🌱 Environmental Impact

Estimated using the Machine Learning Impact Calculator:

Hardware: NVIDIA A100 80GB
Train time: 3–6 hours
Power draw: ~250–300W
Estimated CO₂ emissions: 0.2–0.6 kg CO₂ (depending on region's energy mix)

🛠 Technical Specifications

Model Architecture and Objective

DDPM UNet
Pixel-space denoising
Cross-attention text conditioning
No VAE or latent compression
MSE loss on predicted noise

Compute Infrastructure

Cloud A100 instance
High-bandwidth memory
Python + CUDA toolchain

Citation

BibTeX

@misc{guus2025cocodiffusion,
  title={My COCO Diffusion Model},
  author={Guus},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/guus4324343/my-coco-diffusion-model}}
}

Model Card Contact

For questions or issues: Guus / AI Nexus Studios

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

guus4324343
/

my-coco-diffusion-model