YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

CLIP-Based Color Prediction Model (CS2 Skins)

Overview

This model predicts color attributes from a natural language query using a fine-tuned CLIP model. It is designed for attribute extraction (color prediction) rather than image ranking.

Model Objective

Given a query:

"forest theme"

The model outputs:

["green", "brown"]

This enables downstream filtering systems to retrieve all relevant items.

Model Architecture

Base model:

openai/clip-vit-large-patch14

Fine-tuning strategy:

Frozen:
- Vision encoder
- Text encoder
Trainable:
- text_projection
- visual_projection
- logit_scale

Inference Pipeline

1. Text Encoding

The query is tokenized and passed through the CLIP text encoder:

T = TextEncoder(query)

2. Projection

The pooled output is projected into the joint embedding space:

z_q = normalize( W_t · T )

Where:

W_t = text projection matrix
normalize(x) = x / ||x||

3. Color Embedding Space

A fixed vocabulary of colors is used:

C = ["red", "blue", "green", ...]

Each color is embedded as:

z_c = normalize( W_t · TextEncoder(color) )

4. Similarity Computation

Similarity is computed using cosine similarity:

sim(q, c) = z_q · z_c

Since embeddings are normalized:

cosine_similarity = dot_product

5. Color Selection Rule

Let:

s_max = max(sim(q, c))

Selected colors:

S = { c | sim(q, c) ≥ (s_max - margin) }

Where:

margin ≈ 0.1

This ensures:

relative selection
robustness across queries

Output

The model returns:

[(color_1, score_1), (color_2, score_2), ...]

Example:

[("green", 0.66), ("brown", 0.60)]

Usage

from transformers import CLIPModel, CLIPProcessor
import torch

model = CLIPModel.from_pretrained("YOUR_USERNAME/clip-color-filter-cs2")
processor = CLIPProcessor.from_pretrained("YOUR_USERNAME/clip-color-filter-cs2")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

def predict_colors(query, color_vocab, margin=0.1):
    inputs = processor(text=[query], return_tensors="pt", padding=True).to(device)

    with torch.no_grad():
        text_outputs = model.text_model(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"]
        )

        pooled = text_outputs.pooler_output
        z_q = model.text_projection(pooled)

    z_q = z_q / z_q.norm(dim=-1, keepdim=True)

    # encode colors
    inputs_c = processor(text=color_vocab, return_tensors="pt", padding=True).to(device)

    with torch.no_grad():
        text_outputs_c = model.text_model(
            input_ids=inputs_c["input_ids"],
            attention_mask=inputs_c["attention_mask"]
        )

        pooled_c = text_outputs_c.pooler_output
        z_c = model.text_projection(pooled_c)

    z_c = z_c / z_c.norm(dim=-1, keepdim=True)

    sims = (z_q @ z_c.T).squeeze(0)

    s_max = sims.max().item()

    selected = [
        (color_vocab[i], sims[i].item())
        for i in range(len(color_vocab))
        if sims[i].item() >= (s_max - margin)
    ]

    return sorted(selected, key=lambda x: x[1], reverse=True)

Notes

The model predicts color attributes only
It is intended to be used as part of a filtering pipeline
It does not perform image ranking

Limitations

No explicit modeling of texture/pattern
Performance depends on color vocabulary quality
Semantic mapping is approximate

Downloads last month: 1

Safetensors

Model size

0.4B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support