NoisyCLIP

Early Estimation of Language-to-Latent Alignment in Diffusion Models
ECCV 2026

Model Description

NoisyCLIP is a noise-aware, twin-tower contrastive model that enables early language-to-latent alignment estimation in diffusion models. Instead of waiting for a diffusion model to finish denoising before checking whether the generated image matches the prompt, NoisyCLIP scores the alignment between a prompt and an intermediate (noisy) latent, turning alignment assessment from an expensive final check into a continuous monitoring tool during generation.

This checkpoint is a fine-tune of openai/clip-vit-large-patch14. Only the vision tower is fine-tuned; the text encoder and both projection heads (text_projection, visual_projection) are kept frozen from the original CLIP. This adapts the image encoder to operate on RGB renderings of partially-denoised SDXL latents, while preserving CLIP's text–image embedding space.

Architecture: CLIPModel (ViT-L/14, fully compatible with 🤗 Transformers)
Base model: openai/clip-vit-large-patch14
Fine-tuned components: vision encoder only (text encoder + projections frozen)
Developed by: NOVA LINCS, NOVA School of Science and Technology + Google Research
License: MIT

Intended Uses

Best-of-N selection / early stopping during diffusion sampling — rank or prune candidate generations from their intermediate latents before full denoising.
Reward / alignment signal for inference-time optimization of text-to-image models.
Zero-shot prompt–image (and prompt–latent) alignment scoring, like standard CLIP.

It is a drop-in replacement for CLIP ViT-L/14 wherever you would compute a CLIP similarity score, with the key difference that it remains reliable on noisy latent inputs.

Usage

NoisyCLIP loads exactly like any CLIP model in 🤗 Transformers:

import torch
from PIL import Image
from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("asiimo/noisyclip")
processor = CLIPProcessor.from_pretrained("asiimo/noisyclip")

image = Image.open("example.png")  # an RGB image or an RGB-decoded (noisy) latent
texts = ["a photo of a cat", "a photo of a dog"]

inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
    outputs = model(**inputs)

probs = outputs.logits_per_image.softmax(dim=-1)
print(probs)

Scoring noisy latents

NoisyCLIP is designed to score intermediate diffusion latents. Decode the latent to a 3-channel RGB image with the lightweight linear approximation below (no VAE decode required), then pass it to the processor as a normal image:

import torch
from PIL import Image

def latents_to_rgb(latents):
    weights = (
        (60, -60,  25, -70),
        (60,  -5,  15, -50),
        (60,  10,  -5, -35),
    )
    w = torch.t(torch.tensor(weights, dtype=latents.dtype, device=latents.device))
    b = torch.tensor((150, 140, 130), dtype=latents.dtype, device=latents.device)
    rgb = torch.einsum("...lxy,lr -> ...rxy", latents, w) + b[:, None, None]
    arr = rgb.clamp(0, 255)[0].byte().cpu().numpy().transpose(1, 2, 0)
    return Image.fromarray(arr)

Training

NoisyCLIP is trained with the standard CLIP contrastive objective on pairs of prompts and intermediate SDXL latents (decoded to RGB). The text encoder and projection heads are frozen so that only the vision backbone adapts to the noisy-latent domain. See the project page and repository for the training pipeline.

Limitations

The vision encoder is specialized for SDXL-style latents decoded via the linear approximation above; behavior on other latent spaces or decoders may differ.
Inherits the biases and failure modes of the underlying CLIP ViT-L/14 and of the SDXL data used for fine-tuning.
Intended as an alignment / ranking signal.

Citation

If you find NoisyCLIP useful for your research, please cite:

TO BE UPDATED

Downloads last month: 18

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for asiimo/noisyclip

Base model

openai/clip-vit-large-patch14

Finetuned

(132)

this model

Paper for asiimo/noisyclip

Beyond the Noise: Aligning Prompts with Latent Representations in Diffusion Models

Paper • 2512.08505 • Published Dec 9, 2025