Instructions to use asiimo/noisyclip with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use asiimo/noisyclip with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("zero-shot-image-classification", model="asiimo/noisyclip") pipe( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parrots.png", candidate_labels=["animals", "humans", "landscape"], )# Load model directly from transformers import AutoProcessor, AutoModelForZeroShotImageClassification processor = AutoProcessor.from_pretrained("asiimo/noisyclip") model = AutoModelForZeroShotImageClassification.from_pretrained("asiimo/noisyclip") - Notebooks
- Google Colab
- Kaggle
NoisyCLIP
Early Estimation of Language-to-Latent Alignment in Diffusion Models
ECCV 2026
Model Description
NoisyCLIP is a noise-aware, twin-tower contrastive model that enables early language-to-latent alignment estimation in diffusion models. Instead of waiting for a diffusion model to finish denoising before checking whether the generated image matches the prompt, NoisyCLIP scores the alignment between a prompt and an intermediate (noisy) latent, turning alignment assessment from an expensive final check into a continuous monitoring tool during generation.
This checkpoint is a fine-tune of openai/clip-vit-large-patch14. Only the vision tower is fine-tuned; the text encoder and both projection heads (text_projection, visual_projection) are kept frozen from the original CLIP. This adapts the image encoder to operate on RGB renderings of partially-denoised SDXL latents, while preserving CLIP's text–image embedding space.
- Architecture:
CLIPModel(ViT-L/14, fully compatible with 🤗 Transformers) - Base model:
openai/clip-vit-large-patch14 - Fine-tuned components: vision encoder only (text encoder + projections frozen)
- Developed by: NOVA LINCS, NOVA School of Science and Technology + Google Research
- License: MIT
Intended Uses
- Best-of-N selection / early stopping during diffusion sampling — rank or prune candidate generations from their intermediate latents before full denoising.
- Reward / alignment signal for inference-time optimization of text-to-image models.
- Zero-shot prompt–image (and prompt–latent) alignment scoring, like standard CLIP.
It is a drop-in replacement for CLIP ViT-L/14 wherever you would compute a CLIP similarity score, with the key difference that it remains reliable on noisy latent inputs.
Usage
NoisyCLIP loads exactly like any CLIP model in 🤗 Transformers:
import torch
from PIL import Image
from transformers import CLIPModel, CLIPProcessor
model = CLIPModel.from_pretrained("asiimo/noisyclip")
processor = CLIPProcessor.from_pretrained("asiimo/noisyclip")
image = Image.open("example.png") # an RGB image or an RGB-decoded (noisy) latent
texts = ["a photo of a cat", "a photo of a dog"]
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=-1)
print(probs)
Scoring noisy latents
NoisyCLIP is designed to score intermediate diffusion latents. Decode the latent to a 3-channel RGB image with the lightweight linear approximation below (no VAE decode required), then pass it to the processor as a normal image:
import torch
from PIL import Image
def latents_to_rgb(latents):
weights = (
(60, -60, 25, -70),
(60, -5, 15, -50),
(60, 10, -5, -35),
)
w = torch.t(torch.tensor(weights, dtype=latents.dtype, device=latents.device))
b = torch.tensor((150, 140, 130), dtype=latents.dtype, device=latents.device)
rgb = torch.einsum("...lxy,lr -> ...rxy", latents, w) + b[:, None, None]
arr = rgb.clamp(0, 255)[0].byte().cpu().numpy().transpose(1, 2, 0)
return Image.fromarray(arr)
Training
NoisyCLIP is trained with the standard CLIP contrastive objective on pairs of prompts and intermediate SDXL latents (decoded to RGB). The text encoder and projection heads are frozen so that only the vision backbone adapts to the noisy-latent domain. See the project page and repository for the training pipeline.
Limitations
- The vision encoder is specialized for SDXL-style latents decoded via the linear approximation above; behavior on other latent spaces or decoders may differ.
- Inherits the biases and failure modes of the underlying CLIP ViT-L/14 and of the SDXL data used for fine-tuning.
- Intended as an alignment / ranking signal.
Citation
If you find NoisyCLIP useful for your research, please cite:
TO BE UPDATED
- Downloads last month
- 18
Model tree for asiimo/noisyclip
Base model
openai/clip-vit-large-patch14