NoisyCLIP

Early Estimation of Language-to-Latent Alignment in Diffusion Models
ECCV 2026

Project arXiv Code


Model Description

NoisyCLIP is a noise-aware, twin-tower contrastive model that enables early language-to-latent alignment estimation in diffusion models. Instead of waiting for a diffusion model to finish denoising before checking whether the generated image matches the prompt, NoisyCLIP scores the alignment between a prompt and an intermediate (noisy) latent, turning alignment assessment from an expensive final check into a continuous monitoring tool during generation.

This checkpoint is a fine-tune of openai/clip-vit-large-patch14. Only the vision tower is fine-tuned; the text encoder and both projection heads (text_projection, visual_projection) are kept frozen from the original CLIP. This adapts the image encoder to operate on RGB renderings of partially-denoised SDXL latents, while preserving CLIP's text–image embedding space.

  • Architecture: CLIPModel (ViT-L/14, fully compatible with 🤗 Transformers)
  • Base model: openai/clip-vit-large-patch14
  • Fine-tuned components: vision encoder only (text encoder + projections frozen)
  • Developed by: NOVA LINCS, NOVA School of Science and Technology + Google Research
  • License: MIT

Intended Uses

  • Best-of-N selection / early stopping during diffusion sampling — rank or prune candidate generations from their intermediate latents before full denoising.
  • Reward / alignment signal for inference-time optimization of text-to-image models.
  • Zero-shot prompt–image (and prompt–latent) alignment scoring, like standard CLIP.

It is a drop-in replacement for CLIP ViT-L/14 wherever you would compute a CLIP similarity score, with the key difference that it remains reliable on noisy latent inputs.

Usage

NoisyCLIP loads exactly like any CLIP model in 🤗 Transformers:

import torch
from PIL import Image
from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("asiimo/noisyclip")
processor = CLIPProcessor.from_pretrained("asiimo/noisyclip")

image = Image.open("example.png")  # an RGB image or an RGB-decoded (noisy) latent
texts = ["a photo of a cat", "a photo of a dog"]

inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
    outputs = model(**inputs)

probs = outputs.logits_per_image.softmax(dim=-1)
print(probs)

Scoring noisy latents

NoisyCLIP is designed to score intermediate diffusion latents. Decode the latent to a 3-channel RGB image with the lightweight linear approximation below (no VAE decode required), then pass it to the processor as a normal image:

import torch
from PIL import Image

def latents_to_rgb(latents):
    weights = (
        (60, -60,  25, -70),
        (60,  -5,  15, -50),
        (60,  10,  -5, -35),
    )
    w = torch.t(torch.tensor(weights, dtype=latents.dtype, device=latents.device))
    b = torch.tensor((150, 140, 130), dtype=latents.dtype, device=latents.device)
    rgb = torch.einsum("...lxy,lr -> ...rxy", latents, w) + b[:, None, None]
    arr = rgb.clamp(0, 255)[0].byte().cpu().numpy().transpose(1, 2, 0)
    return Image.fromarray(arr)

Training

NoisyCLIP is trained with the standard CLIP contrastive objective on pairs of prompts and intermediate SDXL latents (decoded to RGB). The text encoder and projection heads are frozen so that only the vision backbone adapts to the noisy-latent domain. See the project page and repository for the training pipeline.

Limitations

  • The vision encoder is specialized for SDXL-style latents decoded via the linear approximation above; behavior on other latent spaces or decoders may differ.
  • Inherits the biases and failure modes of the underlying CLIP ViT-L/14 and of the SDXL data used for fine-tuning.
  • Intended as an alignment / ranking signal.

Citation

If you find NoisyCLIP useful for your research, please cite:

TO BE UPDATED
Downloads last month
18
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for asiimo/noisyclip

Finetuned
(132)
this model

Paper for asiimo/noisyclip