YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
CLIP-Based Color Prediction Model (CS2 Skins)
Overview
This model predicts color attributes from a natural language query using a fine-tuned CLIP model. It is designed for attribute extraction (color prediction) rather than image ranking.
Model Objective
Given a query:
"forest theme"
The model outputs:
["green", "brown"]
This enables downstream filtering systems to retrieve all relevant items.
Model Architecture
Base model:
openai/clip-vit-large-patch14
Fine-tuning strategy:
Frozen:
- Vision encoder
- Text encoder
Trainable:
text_projectionvisual_projectionlogit_scale
Inference Pipeline
1. Text Encoding
The query is tokenized and passed through the CLIP text encoder:
T = TextEncoder(query)
2. Projection
The pooled output is projected into the joint embedding space:
z_q = normalize( W_t · T )
Where:
W_t= text projection matrixnormalize(x) = x / ||x||
3. Color Embedding Space
A fixed vocabulary of colors is used:
C = ["red", "blue", "green", ...]
Each color is embedded as:
z_c = normalize( W_t · TextEncoder(color) )
4. Similarity Computation
Similarity is computed using cosine similarity:
sim(q, c) = z_q · z_c
Since embeddings are normalized:
cosine_similarity = dot_product
5. Color Selection Rule
Let:
s_max = max(sim(q, c))
Selected colors:
S = { c | sim(q, c) ≥ (s_max - margin) }
Where:
margin ≈ 0.1
This ensures:
- relative selection
- robustness across queries
Output
The model returns:
[(color_1, score_1), (color_2, score_2), ...]
Example:
[("green", 0.66), ("brown", 0.60)]
Usage
from transformers import CLIPModel, CLIPProcessor
import torch
model = CLIPModel.from_pretrained("YOUR_USERNAME/clip-color-filter-cs2")
processor = CLIPProcessor.from_pretrained("YOUR_USERNAME/clip-color-filter-cs2")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
def predict_colors(query, color_vocab, margin=0.1):
inputs = processor(text=[query], return_tensors="pt", padding=True).to(device)
with torch.no_grad():
text_outputs = model.text_model(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"]
)
pooled = text_outputs.pooler_output
z_q = model.text_projection(pooled)
z_q = z_q / z_q.norm(dim=-1, keepdim=True)
# encode colors
inputs_c = processor(text=color_vocab, return_tensors="pt", padding=True).to(device)
with torch.no_grad():
text_outputs_c = model.text_model(
input_ids=inputs_c["input_ids"],
attention_mask=inputs_c["attention_mask"]
)
pooled_c = text_outputs_c.pooler_output
z_c = model.text_projection(pooled_c)
z_c = z_c / z_c.norm(dim=-1, keepdim=True)
sims = (z_q @ z_c.T).squeeze(0)
s_max = sims.max().item()
selected = [
(color_vocab[i], sims[i].item())
for i in range(len(color_vocab))
if sims[i].item() >= (s_max - margin)
]
return sorted(selected, key=lambda x: x[1], reverse=True)
Notes
- The model predicts color attributes only
- It is intended to be used as part of a filtering pipeline
- It does not perform image ranking
Limitations
- No explicit modeling of texture/pattern
- Performance depends on color vocabulary quality
- Semantic mapping is approximate
- Downloads last month
- 21