File size: 8,576 Bytes
c72183f f29517d c72183f f29517d 41b2e06 c72183f f29517d da01989 f29517d da01989 f29517d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | ---
library_name: cinemaclip
pipeline_tag: zero-shot-image-classification
tags:
- clip
- mobile-clip
- cinema
- film
- movies
- multi-task
- hybrid
- cinematography
- domain-specific
- image-classification
- zero-shot
base_model: apple/MobileCLIP-S1-OpenCLIP
base_model_relation: finetune
license: apple-amlr
license_name: aplle-ascl
license_link: https://github.com/apple/ml-mobileclip/blob/main/LICENSE_MODELS
---
# CinemaCLIP-1.0.0
**CinemaCLIP** is a [MobileCLIP-S1](https://huggingface.co/apple/MobileCLIP-S1-OpenCLIP) fine-tune specialized for understanding the visual language of cinema at a frame level. It is a hybrid CLIP model with 23 classifier heads that represent a comprehensive taxonomy built with domain experts. For more info, see our [launch blog post](https://www.ozu.ai/cinemaclip).
This repository ships three serialized forms of the same model:
- **Torch** (`model.safetensors`): load via the `cinemaclip` Python package.
- **CoreML** (`ImageEncoder.mlmodel`, `ImageEncoder.mlpackage` and `TextEncoder.mlpackage`): on-device Apple Neural Engine inference.
- **ONNX** (`ImageEncoder.onnx`, `TextEncoder.onnx`, plus `_fp16` variants): cross-platform inference.
## Install
```bash
pip install cinemaclip # core
pip install "cinemaclip[coreml]" # CoreML export/inference
pip install "cinemaclip[onnx]" # ONNX export/inference
```
## Usage (PyTorch)
```python
from PIL import Image
from cinemaclip import CinemaCLIP
model = CinemaCLIP.from_pretrained("OZU-Technology/CinemaCLIP").eval()
# End-to-end classification on a PIL image
image = Image.open("still.jpg").convert("RGB")
predictions = model.predict_image(image)
predictions["classifier_preds"] # Classifier predictions
predictions["clip_image_embedding"]
# Just the image embedding
x = model.preprocess(image).unsqueeze(0)
image_embedding = model.encode_image(x, normalize=True) # [1, 512]
# Just the text embedding
tokens = model.tokenizer(["a medium closeup of "])
text_embedding = model.encode_text(tokens, normalize=True) # [1, 512]
```
The `CinemaCLIP.predict_image` method is demonstrative for how to get post-processed classifier outputs from the model. It is not super efficient or production ready, and must be treated as a reference above all else.
## Usage (CoreML)
```python
import coremltools as ct
from PIL import Image
img_encoder = ct.models.MLModel("ImageEncoder.mlpackage")
# Input must be 256x256 RGB, resized with BICUBIC for parity with the released torch outputs.
img = Image.open("still.jpg").convert("RGB").resize((256, 256), Image.Resampling.BICUBIC)
out = img_encoder.predict({"Image": img})
embedding = out["clip_image_embedding"] # [512]
probabilities = out["probabilities"] # [101] — concat of 23 per-category outputs
# TODO
text_encoder = ct.models.MLModel("TextEncoder.mlpackage")
```
## Usage (ONNX)
```python
from PIL import Image
from onnxruntime import InferenceSession
from torchvision import transforms as T
img = Image.open("still.jpg").convert("RGB")
preprocess = T.Compose([
T.Resize((256, 256), interpolation=T.InterpolationMode.BICUBIC),
T.ToTensor(), # yields float tensor in [0, 1] — no mean/std normalization
])
x = preprocess(img).unsqueeze(0).numpy()
session = InferenceSession("ImageEncoder.onnx", providers=["CPUExecutionProvider"])
emb, probs = session.run(None, {"Image": x})
```
## Output structure
`probabilities` is a flat `[101]` vector — the concatenation of all 23 classifier heads' post-activation outputs. Label names and positions are in the shipped `CinemaNetSchema.json`:
```python
import json
schema = json.load(open("CinemaNetSchema.json"))
label_names = schema["probabilities_labels"] # len == 101
```
The classifier heads are a mix of 3 types of classifiers:
- Single label (softmax activation)
- Multi label (sigmoid activation)
- Binary (sigmoid activation)
## Evaluation
`CinemaCLIP` outperforms not only the largest existing CLIP models (up to 28x larger), but also leading VLMs in cinematic understanding tasks (we benchmarked against the leading `4B` VLMs).
Two inference modes are reported for CinemaCLIP:
- **Classifier** — the shipped supervised heads on the CinemaCLIP image embedding.
- **0-shot** — zero-shot text/image similarity using CinemaCLIP's own text encoder.
| Category | CinemaCLIP 0-shot | CinemaCLIP Classifier | Qwen3.5-4B | Gemma4-4B | InternVL3.5-4B | Molmo2-4B | DFN ViT-H-14 | MetaCLIP PE-bigG | OpenAI ViT-L-14 | MobileCLIP-S1 | DFN ViT-L-14 | SigLIP2 SO400M | SigLIP2 ViT-gopt |
|---|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|
| **Mean** | **82.9** | **87.6** | **57.6** | **56.7** | **55.3** | **55.3** | **45.9** | **45.2** | **44.8** | **44.2** | **39.0** | **38.7** | **36.5** |
| Color Contrast | 89.6 | 86.8 | 33.7 | 35.3 | 33.7 | 35.3 | 34.0 | 33.1 | 49.4 | 38.7 | 37.1 | 57.7 | 25.2 |
| Color Key | 84.9 | 92.9 | 78.1 | 78.1 | 80.3 | 64.3 | 58.2 | 50.2 | 53.2 | 59.4 | 48.3 | 22.8 | 52.6 |
| Color Saturation | 82.6 | 82.6 | 66.5 | 65.4 | 72.1 | 45.9 | 55.1 | 61.8 | 58.1 | 35.8 | 46.8 | 33.3 | 31.8 |
| Color Theory | 71.3 | 72.7 | 54.0 | 51.7 | 50.7 | 48.7 | 54.7 | 51.7 | 50.7 | 47.3 | 47.7 | 31.3 | 31.7 |
| Color Tones | 86.0 | 86.5 | 50.2 | 62.6 | 70.6 | 62.1 | 58.5 | 50.2 | 52.0 | 55.7 | 47.2 | 24.0 | 17.7 |
| Lighting Cast | 85.9 | 90.4 | 38.3 | 53.3 | 39.8 | 35.7 | 25.4 | 29.3 | 28.8 | 35.7 | 22.8 | 37.8 | 18.2 |
| Lighting Contrast | 93.9 | 95.3 | 29.8 | 39.1 | 38.7 | 46.1 | 35.3 | 35.5 | 32.6 | 39.0 | 39.4 | 48.4 | 37.6 |
| Lighting Edge | 87.6 | 90.4 | 22.8 | 38.8 | 31.2 | 40.4 | 22.4 | 31.6 | 41.6 | 34.0 | 21.2 | 26.0 | 25.6 |
| Lighting Silhouette | 88.4 | 93.1 | 80.9 | 63.0 | 48.9 | 48.8 | 66.6 | 67.1 | 67.4 | 58.4 | 43.5 | 46.2 | 78.9 |
| Shot Angle | 73.4 | 82.3 | 41.9 | 49.2 | 33.2 | 49.9 | 28.0 | 13.7 | 19.0 | 19.6 | 25.9 | 21.3 | 17.2 |
| Shot Composition | 95.5 | 96.0 | 46.0 | 54.5 | 55.7 | 60.5 | 27.8 | 24.3 | 21.3 | 22.0 | 25.2 | 31.4 | 11.4 |
| Shot Dutch Angle | 61.9 | 78.5 | 62.2 | 65.1 | 46.7 | 49.3 | 27.3 | 44.5 | 38.4 | 56.6 | 25.9 | 47.6 | 68.7 |
| Shot Focus | 71.3 | 71.2 | 19.9 | 26.6 | 26.3 | 25.1 | 32.9 | 31.2 | 24.4 | 31.3 | 37.3 | 48.2 | 12.6 |
| Shot Framing | 79.2 | 83.8 | 38.0 | 29.6 | 40.1 | 34.6 | 33.6 | 24.9 | 23.5 | 23.9 | 33.0 | 7.3 | 9.8 |
| Shot Height | 90.5 | 91.8 | 38.1 | 37.4 | 41.2 | 53.0 | 37.6 | 33.7 | 28.9 | 24.0 | 33.6 | 29.6 | 23.9 |
| Shot Lens Size | 67.9 | 70.6 | 49.6 | 28.0 | 43.6 | 46.6 | 32.1 | 28.0 | 34.5 | 30.1 | 25.7 | 30.1 | 17.6 |
| Shot Location | 90.9 | 93.9 | 81.0 | 82.2 | 81.5 | 79.2 | 73.0 | 68.4 | 68.0 | 75.6 | 66.1 | 65.0 | 46.7 |
| Shot Symmetry | 88.3 | 92.9 | 90.2 | 86.7 | 76.0 | 80.2 | 76.6 | 78.0 | 54.0 | 39.3 | 24.9 | 46.0 | 82.4 |
| Shot Time of Day | 69.2 | 89.0 | 75.1 | 66.1 | 70.7 | 70.7 | 68.1 | 69.6 | 60.3 | 73.7 | 71.2 | 48.5 | 42.7 |
| Shot Type | 81.8 | 90.5 | 81.3 | 61.2 | 57.0 | 57.4 | 52.8 | 40.4 | 36.5 | 35.7 | 56.7 | 46.5 | 29.7 |
| Shot Type - Crowd | 91.5 | 99.6 | 97.2 | 88.2 | 94.3 | 94.8 | 55.9 | 69.1 | 68.6 | 77.2 | 37.3 | 52.4 | 69.3 |
| Shot Type - OTS | 92.0 | 95.5 | 92.5 | 85.0 | 83.9 | 87.6 | 53.2 | 57.0 | 73.9 | 60.3 | 42.1 | 50.5 | 51.2 |
The `shot.lighting.direction` head ships in the classifier heads but has been excluded from the table above being a multi-label classifier.
## Files in this repo
| File | Purpose |
|---|---|
| `model.safetensors` | Blended (α=0.75) torch weights — `CinemaCLIP.from_pretrained` target |
| `config.json` | Autogenerated `__init__` kwargs for `CinemaCLIP` |
| `CinemaNetSchema.json` | Schema detailing classifier head metadata, confidence thresholds, preprocessing info |
| `ImageEncoder.mlmodel` | CoreML `"neuralnetwork"` ImageEncoder (unified embedding + probabilities) |
| `ImageEncoder.mlpackage` | CoreML ImageEncoder (unified embedding + probabilities) |
| `TextEncoder.mlpackage` | CoreML TextEncoder |
| `ImageEncoder.onnx` / `_fp16.onnx` | ONNX ImageEncoder |
| `TextEncoder.onnx` / `_fp16.onnx` | ONNX TextEncoder |
## Citation
```bibtex
@misc{cinemaclip2026,
title = {CinemaCLIP: A hybrid CLIP model and taxonomy for the visual language of cinema},
author = {Somani, Rahul and Marini, Anton and Stewart, Damian},
year = {2026},
publisher = {HuggingFace},
doi = {10.57967/hf/8539},
howpublished = {\url{https://huggingface.co/OZU-Technology/CinemaCLIP}},
note = {Model weights and taxonomy}
}
```
|