Diffusers
Safetensors
clip_text_model

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Model Card for Model ID

Model Description

my-coco-diffusion-model is a pixel-space text-to-image diffusion model trained using the 🧨 Diffusers library.

It uses:

  • UNet2DConditionModel
  • DDPM (1000-timestep) noise schedule
  • CLIPTokenizer + CLIPTextModel (openai/clip-vit-large-patch14)
  • 256Γ—256 RGB images
  • Cross-attention text conditioning
  • Trained from scratch on a subset of COCO captions
  • Trained on NVIDIA A100 80GB

This is an experimental research model created by Guus (AI Nexus Studios).

Developed By

  • Guus @ AI Nexus Studios

Model Type

  • Denoising Diffusion Probabilistic Model (DDPM)
  • Text-conditioned
  • Pixel-space UNet (in_channels=3, out_channels=3)

Languages

  • English captions

License

MIT (suggested)

Finetuned From

  • Not finetuned β€” trained from scratch

Model Sources


🧠 Uses

Direct Use

  • Research on pixel-space diffusion
  • Studying diffusion learned from scratch
  • Basic low-resolution text-to-image synthesis
  • Educational ML experiments

Downstream Use

  • Further training
  • Fine-tuning
  • Architecture research
  • Integrating alternative text encoders
  • Switching to VAE latent-space models

Out-of-Scope Use

⚠️ NOT suitable for:

  • High-quality realistic image generation
  • Safety-critical applications
  • Harmful, sensitive, or NSFW generation
  • Deepfakes / misinformation
  • Commercial deployment

⚠️ Bias, Risks, and Limitations

  • Model trained on COCO β†’ inherits dataset biases
  • Pixel-space diffusion produces low-detail images
  • Model is unstable at low epoch counts (1–3 epochs)
  • Poor at fine details (faces, text, hands)
  • May output distorted or abstract content
  • Not trained with safety filtering
  • Not appropriate for real-world decision making

πŸš€ How to Get Started with the Model

Below is a minimal working example to generate an image from a prompt.

import torch
from diffusers import UNet2DConditionModel, DDPMScheduler
from transformers import CLIPTokenizer, CLIPTextModel
import torchvision.utils as vutils

device = "cuda"

# Load tokenizer + text encoder
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14").to(device)

# Load UNet
model = UNet2DConditionModel.from_pretrained(
    "guus4324343/my-coco-diffusion-model"
).to(device)

scheduler = DDPMScheduler(num_train_timesteps=1000)
scheduler.set_timesteps(250)

prompt = "a cute cat sitting in a box"

tokens = tokenizer(prompt, return_tensors="pt", padding="max_length",
                   truncation=True).input_ids.to(device)
text_emb = text_encoder(tokens)[0]

latents = torch.randn(1, 3, 256, 256, device=device)

for t in scheduler.timesteps:
    with torch.autocast("cuda"):
        eps = model(latents, t, encoder_hidden_states=text_emb).sample
    latents = scheduler.step(eps, t, latents).prev_sample

img = (latents.clamp(-1, 1) + 1) / 2
vutils.save_image(img, "sample.png")

πŸ‹οΈ Training Details

Training Data

  • Dataset: Multimodal-Fatima/COCO_captions_train
  • Used: 50,000 image-caption pairs (subset)
  • Resolution: 256Γ—256
  • Caption field: sentences_raw[0]
  • Preprocessing:
Resize(256)
CenterCrop(256)
ToTensor()
Normalize([0.5], [0.5])

Training Procedure

Model Architecture

  • UNet2DConditionModel
  • in_channels: 3
  • out_channels: 3
  • block_out_channels: (128, 256, 512, 512)
  • layers_per_block: 2
  • cross_attention_dim: 1024 (CLIP-L/14)

Hyperparameters

  • Optimizer: AdamW
  • LR: 1e-4
  • Batch size: 8
  • Scheduler: DDPM, 1000 timesteps
  • Epochs: 3
  • Mixed precision: FP16 (torch.amp)

Hardware

  • NVIDIA A100 80GB
  • CUDA acceleration enabled
  • High-speed NVMe storage

Software

  • Python 3.10
  • PyTorch (CUDA 12.1)
  • Diffusers
  • Transformers
  • Accelerate

πŸ“Š Evaluation

Testing Data

No formal testing β€” qualitative inspection only.

Factors

  • Epoch count
  • Prompt complexity
  • Sampling steps

Metrics

None β€” experimental model

Results

  • Epoch 1: abstract textures
  • Epoch 3: early emergence of shapes
  • More epochs required for meaningful images

🌱 Environmental Impact

Estimated using the Machine Learning Impact Calculator:

  • Hardware: NVIDIA A100 80GB
  • Train time: 3–6 hours
  • Power draw: ~250–300W
  • Estimated COβ‚‚ emissions: 0.2–0.6 kg COβ‚‚ (depending on region's energy mix)

πŸ›  Technical Specifications

Model Architecture and Objective

  • DDPM UNet
  • Pixel-space denoising
  • Cross-attention text conditioning
  • No VAE or latent compression
  • MSE loss on predicted noise

Compute Infrastructure

  • Cloud A100 instance
  • High-bandwidth memory
  • Python + CUDA toolchain

Citation

BibTeX

@misc{guus2025cocodiffusion,
  title={My COCO Diffusion Model},
  author={Guus},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/guus4324343/my-coco-diffusion-model}}
}

Model Card Contact

For questions or issues: Guus / AI Nexus Studios


Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train guus4324343/my-coco-diffusion-model