README.md · OZU-Technology/CinemaCLIP at main

File size: 8,576 Bytes

---
library_name: cinemaclip
pipeline_tag: zero-shot-image-classification
tags:
  - clip
  - mobile-clip
  - cinema
  - film
  - movies
  - multi-task
  - hybrid
  - cinematography
  - domain-specific
  - image-classification
  - zero-shot
base_model: apple/MobileCLIP-S1-OpenCLIP
base_model_relation: finetune
license: apple-amlr
license_name: aplle-ascl
license_link: https://github.com/apple/ml-mobileclip/blob/main/LICENSE_MODELS
---

# CinemaCLIP-1.0.0

**CinemaCLIP** is a [MobileCLIP-S1](https://huggingface.co/apple/MobileCLIP-S1-OpenCLIP) fine-tune specialized for understanding the visual language of cinema at a frame level. It is a hybrid CLIP model with 23 classifier heads that represent a comprehensive taxonomy built with domain experts. For more info, see our [launch blog post](https://www.ozu.ai/cinemaclip).

This repository ships three serialized forms of the same model:
- **Torch** (`model.safetensors`): load via the `cinemaclip` Python package.
- **CoreML** (`ImageEncoder.mlmodel`, `ImageEncoder.mlpackage` and `TextEncoder.mlpackage`): on-device Apple Neural Engine inference.
- **ONNX** (`ImageEncoder.onnx`, `TextEncoder.onnx`, plus `_fp16` variants): cross-platform inference.

## Install

```bash
pip install cinemaclip            # core
pip install "cinemaclip[coreml]"  # CoreML export/inference
pip install "cinemaclip[onnx]"    # ONNX export/inference
```

## Usage (PyTorch)

```python
from PIL import Image
from cinemaclip import CinemaCLIP

model = CinemaCLIP.from_pretrained("OZU-Technology/CinemaCLIP").eval()

# End-to-end classification on a PIL image
image = Image.open("still.jpg").convert("RGB")
predictions = model.predict_image(image)
predictions["classifier_preds"]  # Classifier predictions
predictions["clip_image_embedding"]

# Just the image embedding
x = model.preprocess(image).unsqueeze(0)
image_embedding = model.encode_image(x, normalize=True)   # [1, 512]

# Just the text embedding
tokens = model.tokenizer(["a medium closeup of "])
text_embedding = model.encode_text(tokens, normalize=True)  # [1, 512]
```

The `CinemaCLIP.predict_image` method is demonstrative for how to get post-processed classifier outputs from the model. It is not super efficient or production ready, and must be treated as a reference above all else.

## Usage (CoreML)

```python
import coremltools as ct
from PIL import Image

img_encoder = ct.models.MLModel("ImageEncoder.mlpackage")
# Input must be 256x256 RGB, resized with BICUBIC for parity with the released torch outputs.
img = Image.open("still.jpg").convert("RGB").resize((256, 256), Image.Resampling.BICUBIC)
out = img_encoder.predict({"Image": img})
embedding = out["clip_image_embedding"]    # [512]
probabilities = out["probabilities"]       # [101] — concat of 23 per-category outputs

# TODO
text_encoder = ct.models.MLModel("TextEncoder.mlpackage")
```

## Usage (ONNX)

```python
from PIL import Image
from onnxruntime import InferenceSession
from torchvision import transforms as T

img = Image.open("still.jpg").convert("RGB")
preprocess = T.Compose([
    T.Resize((256, 256), interpolation=T.InterpolationMode.BICUBIC),
    T.ToTensor(),   # yields float tensor in [0, 1] — no mean/std normalization
])
x = preprocess(img).unsqueeze(0).numpy()

session = InferenceSession("ImageEncoder.onnx", providers=["CPUExecutionProvider"])
emb, probs = session.run(None, {"Image": x})
```

## Output structure

`probabilities` is a flat `[101]` vector — the concatenation of all 23 classifier heads' post-activation outputs. Label names and positions are in the shipped `CinemaNetSchema.json`:

```python
import json
schema = json.load(open("CinemaNetSchema.json"))
label_names = schema["probabilities_labels"]  # len == 101
```

The classifier heads are a mix of 3 types of classifiers:
- Single label (softmax activation)
- Multi label (sigmoid activation)
- Binary (sigmoid activation)


## Evaluation

`CinemaCLIP` outperforms not only the largest existing CLIP models (up to 28x larger), but also leading VLMs in cinematic understanding tasks (we benchmarked against the leading `4B` VLMs).

Two inference modes are reported for CinemaCLIP:
- **Classifier** — the shipped supervised heads on the CinemaCLIP image embedding.
- **0-shot** — zero-shot text/image similarity using CinemaCLIP's own text encoder.

| Category | CinemaCLIP 0-shot | CinemaCLIP Classifier | Qwen3.5-4B | Gemma4-4B | InternVL3.5-4B | Molmo2-4B | DFN ViT-H-14 | MetaCLIP PE-bigG | OpenAI ViT-L-14 | MobileCLIP-S1 | DFN ViT-L-14 | SigLIP2 SO400M | SigLIP2 ViT-gopt |
|---|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|
| **Mean**              | **82.9** | **87.6** | **57.6** | **56.7** | **55.3** | **55.3** | **45.9** | **45.2** | **44.8** | **44.2** | **39.0** | **38.7** | **36.5** |
| Color Contrast      | 89.6 | 86.8 | 33.7 | 35.3 | 33.7 | 35.3 | 34.0 | 33.1 | 49.4 | 38.7 | 37.1 | 57.7 | 25.2 |
| Color Key           | 84.9 | 92.9 | 78.1 | 78.1 | 80.3 | 64.3 | 58.2 | 50.2 | 53.2 | 59.4 | 48.3 | 22.8 | 52.6 |
| Color Saturation    | 82.6 | 82.6 | 66.5 | 65.4 | 72.1 | 45.9 | 55.1 | 61.8 | 58.1 | 35.8 | 46.8 | 33.3 | 31.8 |
| Color Theory        | 71.3 | 72.7 | 54.0 | 51.7 | 50.7 | 48.7 | 54.7 | 51.7 | 50.7 | 47.3 | 47.7 | 31.3 | 31.7 |
| Color Tones         | 86.0 | 86.5 | 50.2 | 62.6 | 70.6 | 62.1 | 58.5 | 50.2 | 52.0 | 55.7 | 47.2 | 24.0 | 17.7 |
| Lighting Cast       | 85.9 | 90.4 | 38.3 | 53.3 | 39.8 | 35.7 | 25.4 | 29.3 | 28.8 | 35.7 | 22.8 | 37.8 | 18.2 |
| Lighting Contrast   | 93.9 | 95.3 | 29.8 | 39.1 | 38.7 | 46.1 | 35.3 | 35.5 | 32.6 | 39.0 | 39.4 | 48.4 | 37.6 |
| Lighting Edge       | 87.6 | 90.4 | 22.8 | 38.8 | 31.2 | 40.4 | 22.4 | 31.6 | 41.6 | 34.0 | 21.2 | 26.0 | 25.6 |
| Lighting Silhouette | 88.4 | 93.1 | 80.9 | 63.0 | 48.9 | 48.8 | 66.6 | 67.1 | 67.4 | 58.4 | 43.5 | 46.2 | 78.9 |
| Shot Angle          | 73.4 | 82.3 | 41.9 | 49.2 | 33.2 | 49.9 | 28.0 | 13.7 | 19.0 | 19.6 | 25.9 | 21.3 | 17.2 |
| Shot Composition    | 95.5 | 96.0 | 46.0 | 54.5 | 55.7 | 60.5 | 27.8 | 24.3 | 21.3 | 22.0 | 25.2 | 31.4 | 11.4 |
| Shot Dutch Angle    | 61.9 | 78.5 | 62.2 | 65.1 | 46.7 | 49.3 | 27.3 | 44.5 | 38.4 | 56.6 | 25.9 | 47.6 | 68.7 |
| Shot Focus          | 71.3 | 71.2 | 19.9 | 26.6 | 26.3 | 25.1 | 32.9 | 31.2 | 24.4 | 31.3 | 37.3 | 48.2 | 12.6 |
| Shot Framing        | 79.2 | 83.8 | 38.0 | 29.6 | 40.1 | 34.6 | 33.6 | 24.9 | 23.5 | 23.9 | 33.0 |  7.3 |  9.8 |
| Shot Height         | 90.5 | 91.8 | 38.1 | 37.4 | 41.2 | 53.0 | 37.6 | 33.7 | 28.9 | 24.0 | 33.6 | 29.6 | 23.9 |
| Shot Lens Size      | 67.9 | 70.6 | 49.6 | 28.0 | 43.6 | 46.6 | 32.1 | 28.0 | 34.5 | 30.1 | 25.7 | 30.1 | 17.6 |
| Shot Location       | 90.9 | 93.9 | 81.0 | 82.2 | 81.5 | 79.2 | 73.0 | 68.4 | 68.0 | 75.6 | 66.1 | 65.0 | 46.7 |
| Shot Symmetry       | 88.3 | 92.9 | 90.2 | 86.7 | 76.0 | 80.2 | 76.6 | 78.0 | 54.0 | 39.3 | 24.9 | 46.0 | 82.4 |
| Shot Time of Day    | 69.2 | 89.0 | 75.1 | 66.1 | 70.7 | 70.7 | 68.1 | 69.6 | 60.3 | 73.7 | 71.2 | 48.5 | 42.7 |
| Shot Type           | 81.8 | 90.5 | 81.3 | 61.2 | 57.0 | 57.4 | 52.8 | 40.4 | 36.5 | 35.7 | 56.7 | 46.5 | 29.7 |
| Shot Type - Crowd   | 91.5 | 99.6 | 97.2 | 88.2 | 94.3 | 94.8 | 55.9 | 69.1 | 68.6 | 77.2 | 37.3 | 52.4 | 69.3 |
| Shot Type - OTS     | 92.0 | 95.5 | 92.5 | 85.0 | 83.9 | 87.6 | 53.2 | 57.0 | 73.9 | 60.3 | 42.1 | 50.5 | 51.2 |

The `shot.lighting.direction` head ships in the classifier heads but has been excluded from the table above being a multi-label classifier.

## Files in this repo

| File | Purpose |
|---|---|
| `model.safetensors` | Blended (α=0.75) torch weights — `CinemaCLIP.from_pretrained` target |
| `config.json` | Autogenerated `__init__` kwargs for `CinemaCLIP` |
| `CinemaNetSchema.json` | Schema detailing classifier head metadata, confidence thresholds, preprocessing info |
| `ImageEncoder.mlmodel` | CoreML `"neuralnetwork"` ImageEncoder (unified embedding + probabilities) |
| `ImageEncoder.mlpackage` | CoreML ImageEncoder (unified embedding + probabilities) |
| `TextEncoder.mlpackage` | CoreML TextEncoder |
| `ImageEncoder.onnx` / `_fp16.onnx` | ONNX ImageEncoder |
| `TextEncoder.onnx` / `_fp16.onnx` | ONNX TextEncoder |


## Citation

```bibtex
@misc{cinemaclip2026,
  title        = {CinemaCLIP: A hybrid CLIP model and taxonomy for the visual language of cinema},
  author       = {Somani, Rahul and Marini, Anton and Stewart, Damian},
  year         = {2026},
  publisher    = {HuggingFace},
  doi          = {10.57967/hf/8539},
  howpublished = {\url{https://huggingface.co/OZU-Technology/CinemaCLIP}},
  note         = {Model weights and taxonomy}
}
```