| --- |
| library_name: cinemaclip |
| pipeline_tag: zero-shot-image-classification |
| tags: |
| - clip |
| - mobile-clip |
| - cinema |
| - film |
| - movies |
| - multi-task |
| - hybrid |
| - cinematography |
| - domain-specific |
| - image-classification |
| - zero-shot |
| base_model: apple/MobileCLIP-S1-OpenCLIP |
| base_model_relation: finetune |
| license: apple-amlr |
| license_name: aplle-ascl |
| license_link: https://github.com/apple/ml-mobileclip/blob/main/LICENSE_MODELS |
| --- |
| |
| # CinemaCLIP-1.0.0 |
|
|
| **CinemaCLIP** is a [MobileCLIP-S1](https://huggingface.co/apple/MobileCLIP-S1-OpenCLIP) fine-tune specialized for understanding the visual language of cinema at a frame level. It is a hybrid CLIP model with 23 classifier heads that represent a comprehensive taxonomy built with domain experts. For more info, see our [launch blog post](https://www.ozu.ai/cinemaclip). |
|
|
| This repository ships three serialized forms of the same model: |
| - **Torch** (`model.safetensors`): load via the `cinemaclip` Python package. |
| - **CoreML** (`ImageEncoder.mlmodel`, `ImageEncoder.mlpackage` and `TextEncoder.mlpackage`): on-device Apple Neural Engine inference. |
| - **ONNX** (`ImageEncoder.onnx`, `TextEncoder.onnx`, plus `_fp16` variants): cross-platform inference. |
|
|
| ## Install |
|
|
| ```bash |
| pip install cinemaclip # core |
| pip install "cinemaclip[coreml]" # CoreML export/inference |
| pip install "cinemaclip[onnx]" # ONNX export/inference |
| ``` |
|
|
| ## Usage (PyTorch) |
|
|
| ```python |
| from PIL import Image |
| from cinemaclip import CinemaCLIP |
| |
| model = CinemaCLIP.from_pretrained("OZU-Technology/CinemaCLIP").eval() |
| |
| # End-to-end classification on a PIL image |
| image = Image.open("still.jpg").convert("RGB") |
| predictions = model.predict_image(image) |
| predictions["classifier_preds"] # Classifier predictions |
| predictions["clip_image_embedding"] |
| |
| # Just the image embedding |
| x = model.preprocess(image).unsqueeze(0) |
| image_embedding = model.encode_image(x, normalize=True) # [1, 512] |
| |
| # Just the text embedding |
| tokens = model.tokenizer(["a medium closeup of "]) |
| text_embedding = model.encode_text(tokens, normalize=True) # [1, 512] |
| ``` |
|
|
| The `CinemaCLIP.predict_image` method is demonstrative for how to get post-processed classifier outputs from the model. It is not super efficient or production ready, and must be treated as a reference above all else. |
|
|
| ## Usage (CoreML) |
|
|
| ```python |
| import coremltools as ct |
| from PIL import Image |
| |
| img_encoder = ct.models.MLModel("ImageEncoder.mlpackage") |
| # Input must be 256x256 RGB, resized with BICUBIC for parity with the released torch outputs. |
| img = Image.open("still.jpg").convert("RGB").resize((256, 256), Image.Resampling.BICUBIC) |
| out = img_encoder.predict({"Image": img}) |
| embedding = out["clip_image_embedding"] # [512] |
| probabilities = out["probabilities"] # [101] — concat of 23 per-category outputs |
| |
| # TODO |
| text_encoder = ct.models.MLModel("TextEncoder.mlpackage") |
| ``` |
|
|
| ## Usage (ONNX) |
|
|
| ```python |
| from PIL import Image |
| from onnxruntime import InferenceSession |
| from torchvision import transforms as T |
| |
| img = Image.open("still.jpg").convert("RGB") |
| preprocess = T.Compose([ |
| T.Resize((256, 256), interpolation=T.InterpolationMode.BICUBIC), |
| T.ToTensor(), # yields float tensor in [0, 1] — no mean/std normalization |
| ]) |
| x = preprocess(img).unsqueeze(0).numpy() |
| |
| session = InferenceSession("ImageEncoder.onnx", providers=["CPUExecutionProvider"]) |
| emb, probs = session.run(None, {"Image": x}) |
| ``` |
|
|
| ## Output structure |
|
|
| `probabilities` is a flat `[101]` vector — the concatenation of all 23 classifier heads' post-activation outputs. Label names and positions are in the shipped `CinemaNetSchema.json`: |
|
|
| ```python |
| import json |
| schema = json.load(open("CinemaNetSchema.json")) |
| label_names = schema["probabilities_labels"] # len == 101 |
| ``` |
|
|
| The classifier heads are a mix of 3 types of classifiers: |
| - Single label (softmax activation) |
| - Multi label (sigmoid activation) |
| - Binary (sigmoid activation) |
|
|
|
|
| ## Evaluation |
|
|
| `CinemaCLIP` outperforms not only the largest existing CLIP models (up to 28x larger), but also leading VLMs in cinematic understanding tasks (we benchmarked against the leading `4B` VLMs). |
|
|
| Two inference modes are reported for CinemaCLIP: |
| - **Classifier** — the shipped supervised heads on the CinemaCLIP image embedding. |
| - **0-shot** — zero-shot text/image similarity using CinemaCLIP's own text encoder. |
|
|
| | Category | CinemaCLIP 0-shot | CinemaCLIP Classifier | Qwen3.5-4B | Gemma4-4B | InternVL3.5-4B | Molmo2-4B | DFN ViT-H-14 | MetaCLIP PE-bigG | OpenAI ViT-L-14 | MobileCLIP-S1 | DFN ViT-L-14 | SigLIP2 SO400M | SigLIP2 ViT-gopt | |
| |---|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:| |
| | **Mean** | **82.9** | **87.6** | **57.6** | **56.7** | **55.3** | **55.3** | **45.9** | **45.2** | **44.8** | **44.2** | **39.0** | **38.7** | **36.5** | |
| | Color Contrast | 89.6 | 86.8 | 33.7 | 35.3 | 33.7 | 35.3 | 34.0 | 33.1 | 49.4 | 38.7 | 37.1 | 57.7 | 25.2 | |
| | Color Key | 84.9 | 92.9 | 78.1 | 78.1 | 80.3 | 64.3 | 58.2 | 50.2 | 53.2 | 59.4 | 48.3 | 22.8 | 52.6 | |
| | Color Saturation | 82.6 | 82.6 | 66.5 | 65.4 | 72.1 | 45.9 | 55.1 | 61.8 | 58.1 | 35.8 | 46.8 | 33.3 | 31.8 | |
| | Color Theory | 71.3 | 72.7 | 54.0 | 51.7 | 50.7 | 48.7 | 54.7 | 51.7 | 50.7 | 47.3 | 47.7 | 31.3 | 31.7 | |
| | Color Tones | 86.0 | 86.5 | 50.2 | 62.6 | 70.6 | 62.1 | 58.5 | 50.2 | 52.0 | 55.7 | 47.2 | 24.0 | 17.7 | |
| | Lighting Cast | 85.9 | 90.4 | 38.3 | 53.3 | 39.8 | 35.7 | 25.4 | 29.3 | 28.8 | 35.7 | 22.8 | 37.8 | 18.2 | |
| | Lighting Contrast | 93.9 | 95.3 | 29.8 | 39.1 | 38.7 | 46.1 | 35.3 | 35.5 | 32.6 | 39.0 | 39.4 | 48.4 | 37.6 | |
| | Lighting Edge | 87.6 | 90.4 | 22.8 | 38.8 | 31.2 | 40.4 | 22.4 | 31.6 | 41.6 | 34.0 | 21.2 | 26.0 | 25.6 | |
| | Lighting Silhouette | 88.4 | 93.1 | 80.9 | 63.0 | 48.9 | 48.8 | 66.6 | 67.1 | 67.4 | 58.4 | 43.5 | 46.2 | 78.9 | |
| | Shot Angle | 73.4 | 82.3 | 41.9 | 49.2 | 33.2 | 49.9 | 28.0 | 13.7 | 19.0 | 19.6 | 25.9 | 21.3 | 17.2 | |
| | Shot Composition | 95.5 | 96.0 | 46.0 | 54.5 | 55.7 | 60.5 | 27.8 | 24.3 | 21.3 | 22.0 | 25.2 | 31.4 | 11.4 | |
| | Shot Dutch Angle | 61.9 | 78.5 | 62.2 | 65.1 | 46.7 | 49.3 | 27.3 | 44.5 | 38.4 | 56.6 | 25.9 | 47.6 | 68.7 | |
| | Shot Focus | 71.3 | 71.2 | 19.9 | 26.6 | 26.3 | 25.1 | 32.9 | 31.2 | 24.4 | 31.3 | 37.3 | 48.2 | 12.6 | |
| | Shot Framing | 79.2 | 83.8 | 38.0 | 29.6 | 40.1 | 34.6 | 33.6 | 24.9 | 23.5 | 23.9 | 33.0 | 7.3 | 9.8 | |
| | Shot Height | 90.5 | 91.8 | 38.1 | 37.4 | 41.2 | 53.0 | 37.6 | 33.7 | 28.9 | 24.0 | 33.6 | 29.6 | 23.9 | |
| | Shot Lens Size | 67.9 | 70.6 | 49.6 | 28.0 | 43.6 | 46.6 | 32.1 | 28.0 | 34.5 | 30.1 | 25.7 | 30.1 | 17.6 | |
| | Shot Location | 90.9 | 93.9 | 81.0 | 82.2 | 81.5 | 79.2 | 73.0 | 68.4 | 68.0 | 75.6 | 66.1 | 65.0 | 46.7 | |
| | Shot Symmetry | 88.3 | 92.9 | 90.2 | 86.7 | 76.0 | 80.2 | 76.6 | 78.0 | 54.0 | 39.3 | 24.9 | 46.0 | 82.4 | |
| | Shot Time of Day | 69.2 | 89.0 | 75.1 | 66.1 | 70.7 | 70.7 | 68.1 | 69.6 | 60.3 | 73.7 | 71.2 | 48.5 | 42.7 | |
| | Shot Type | 81.8 | 90.5 | 81.3 | 61.2 | 57.0 | 57.4 | 52.8 | 40.4 | 36.5 | 35.7 | 56.7 | 46.5 | 29.7 | |
| | Shot Type - Crowd | 91.5 | 99.6 | 97.2 | 88.2 | 94.3 | 94.8 | 55.9 | 69.1 | 68.6 | 77.2 | 37.3 | 52.4 | 69.3 | |
| | Shot Type - OTS | 92.0 | 95.5 | 92.5 | 85.0 | 83.9 | 87.6 | 53.2 | 57.0 | 73.9 | 60.3 | 42.1 | 50.5 | 51.2 | |
|
|
| The `shot.lighting.direction` head ships in the classifier heads but has been excluded from the table above being a multi-label classifier. |
|
|
| ## Files in this repo |
|
|
| | File | Purpose | |
| |---|---| |
| | `model.safetensors` | Blended (α=0.75) torch weights — `CinemaCLIP.from_pretrained` target | |
| | `config.json` | Autogenerated `__init__` kwargs for `CinemaCLIP` | |
| | `CinemaNetSchema.json` | Schema detailing classifier head metadata, confidence thresholds, preprocessing info | |
| | `ImageEncoder.mlmodel` | CoreML `"neuralnetwork"` ImageEncoder (unified embedding + probabilities) | |
| | `ImageEncoder.mlpackage` | CoreML ImageEncoder (unified embedding + probabilities) | |
| | `TextEncoder.mlpackage` | CoreML TextEncoder | |
| | `ImageEncoder.onnx` / `_fp16.onnx` | ONNX ImageEncoder | |
| | `TextEncoder.onnx` / `_fp16.onnx` | ONNX TextEncoder | |
|
|
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{cinemaclip2026, |
| title = {CinemaCLIP: A hybrid CLIP model and taxonomy for the visual language of cinema}, |
| author = {Somani, Rahul and Marini, Anton and Stewart, Damian}, |
| year = {2026}, |
| publisher = {HuggingFace}, |
| doi = {10.57967/hf/8539}, |
| howpublished = {\url{https://huggingface.co/OZU-Technology/CinemaCLIP}}, |
| note = {Model weights and taxonomy} |
| } |
| ``` |
|
|