--- library_name: cinemaclip pipeline_tag: zero-shot-image-classification tags: - clip - mobile-clip - cinema - film - movies - multi-task - hybrid - cinematography - domain-specific - image-classification - zero-shot base_model: apple/MobileCLIP-S1-OpenCLIP base_model_relation: finetune license: apple-amlr license_name: aplle-ascl license_link: https://github.com/apple/ml-mobileclip/blob/main/LICENSE_MODELS --- # CinemaCLIP-1.0.0 **CinemaCLIP** is a [MobileCLIP-S1](https://huggingface.co/apple/MobileCLIP-S1-OpenCLIP) fine-tune specialized for understanding the visual language of cinema at a frame level. It is a hybrid CLIP model with 23 classifier heads that represent a comprehensive taxonomy built with domain experts. For more info, see our [launch blog post](https://www.ozu.ai/cinemaclip). This repository ships three serialized forms of the same model: - **Torch** (`model.safetensors`): load via the `cinemaclip` Python package. - **CoreML** (`ImageEncoder.mlmodel`, `ImageEncoder.mlpackage` and `TextEncoder.mlpackage`): on-device Apple Neural Engine inference. - **ONNX** (`ImageEncoder.onnx`, `TextEncoder.onnx`, plus `_fp16` variants): cross-platform inference. ## Install ```bash pip install cinemaclip # core pip install "cinemaclip[coreml]" # CoreML export/inference pip install "cinemaclip[onnx]" # ONNX export/inference ``` ## Usage (PyTorch) ```python from PIL import Image from cinemaclip import CinemaCLIP model = CinemaCLIP.from_pretrained("OZU-Technology/CinemaCLIP").eval() # End-to-end classification on a PIL image image = Image.open("still.jpg").convert("RGB") predictions = model.predict_image(image) predictions["classifier_preds"] # Classifier predictions predictions["clip_image_embedding"] # Just the image embedding x = model.preprocess(image).unsqueeze(0) image_embedding = model.encode_image(x, normalize=True) # [1, 512] # Just the text embedding tokens = model.tokenizer(["a medium closeup of "]) text_embedding = model.encode_text(tokens, normalize=True) # [1, 512] ``` The `CinemaCLIP.predict_image` method is demonstrative for how to get post-processed classifier outputs from the model. It is not super efficient or production ready, and must be treated as a reference above all else. ## Usage (CoreML) ```python import coremltools as ct from PIL import Image img_encoder = ct.models.MLModel("ImageEncoder.mlpackage") # Input must be 256x256 RGB, resized with BICUBIC for parity with the released torch outputs. img = Image.open("still.jpg").convert("RGB").resize((256, 256), Image.Resampling.BICUBIC) out = img_encoder.predict({"Image": img}) embedding = out["clip_image_embedding"] # [512] probabilities = out["probabilities"] # [101] — concat of 23 per-category outputs # TODO text_encoder = ct.models.MLModel("TextEncoder.mlpackage") ``` ## Usage (ONNX) ```python from PIL import Image from onnxruntime import InferenceSession from torchvision import transforms as T img = Image.open("still.jpg").convert("RGB") preprocess = T.Compose([ T.Resize((256, 256), interpolation=T.InterpolationMode.BICUBIC), T.ToTensor(), # yields float tensor in [0, 1] — no mean/std normalization ]) x = preprocess(img).unsqueeze(0).numpy() session = InferenceSession("ImageEncoder.onnx", providers=["CPUExecutionProvider"]) emb, probs = session.run(None, {"Image": x}) ``` ## Output structure `probabilities` is a flat `[101]` vector — the concatenation of all 23 classifier heads' post-activation outputs. Label names and positions are in the shipped `CinemaNetSchema.json`: ```python import json schema = json.load(open("CinemaNetSchema.json")) label_names = schema["probabilities_labels"] # len == 101 ``` The classifier heads are a mix of 3 types of classifiers: - Single label (softmax activation) - Multi label (sigmoid activation) - Binary (sigmoid activation) ## Evaluation `CinemaCLIP` outperforms not only the largest existing CLIP models (up to 28x larger), but also leading VLMs in cinematic understanding tasks (we benchmarked against the leading `4B` VLMs). Two inference modes are reported for CinemaCLIP: - **Classifier** — the shipped supervised heads on the CinemaCLIP image embedding. - **0-shot** — zero-shot text/image similarity using CinemaCLIP's own text encoder. | Category | CinemaCLIP 0-shot | CinemaCLIP Classifier | Qwen3.5-4B | Gemma4-4B | InternVL3.5-4B | Molmo2-4B | DFN ViT-H-14 | MetaCLIP PE-bigG | OpenAI ViT-L-14 | MobileCLIP-S1 | DFN ViT-L-14 | SigLIP2 SO400M | SigLIP2 ViT-gopt | |---|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:|--:| | **Mean** | **82.9** | **87.6** | **57.6** | **56.7** | **55.3** | **55.3** | **45.9** | **45.2** | **44.8** | **44.2** | **39.0** | **38.7** | **36.5** | | Color Contrast | 89.6 | 86.8 | 33.7 | 35.3 | 33.7 | 35.3 | 34.0 | 33.1 | 49.4 | 38.7 | 37.1 | 57.7 | 25.2 | | Color Key | 84.9 | 92.9 | 78.1 | 78.1 | 80.3 | 64.3 | 58.2 | 50.2 | 53.2 | 59.4 | 48.3 | 22.8 | 52.6 | | Color Saturation | 82.6 | 82.6 | 66.5 | 65.4 | 72.1 | 45.9 | 55.1 | 61.8 | 58.1 | 35.8 | 46.8 | 33.3 | 31.8 | | Color Theory | 71.3 | 72.7 | 54.0 | 51.7 | 50.7 | 48.7 | 54.7 | 51.7 | 50.7 | 47.3 | 47.7 | 31.3 | 31.7 | | Color Tones | 86.0 | 86.5 | 50.2 | 62.6 | 70.6 | 62.1 | 58.5 | 50.2 | 52.0 | 55.7 | 47.2 | 24.0 | 17.7 | | Lighting Cast | 85.9 | 90.4 | 38.3 | 53.3 | 39.8 | 35.7 | 25.4 | 29.3 | 28.8 | 35.7 | 22.8 | 37.8 | 18.2 | | Lighting Contrast | 93.9 | 95.3 | 29.8 | 39.1 | 38.7 | 46.1 | 35.3 | 35.5 | 32.6 | 39.0 | 39.4 | 48.4 | 37.6 | | Lighting Edge | 87.6 | 90.4 | 22.8 | 38.8 | 31.2 | 40.4 | 22.4 | 31.6 | 41.6 | 34.0 | 21.2 | 26.0 | 25.6 | | Lighting Silhouette | 88.4 | 93.1 | 80.9 | 63.0 | 48.9 | 48.8 | 66.6 | 67.1 | 67.4 | 58.4 | 43.5 | 46.2 | 78.9 | | Shot Angle | 73.4 | 82.3 | 41.9 | 49.2 | 33.2 | 49.9 | 28.0 | 13.7 | 19.0 | 19.6 | 25.9 | 21.3 | 17.2 | | Shot Composition | 95.5 | 96.0 | 46.0 | 54.5 | 55.7 | 60.5 | 27.8 | 24.3 | 21.3 | 22.0 | 25.2 | 31.4 | 11.4 | | Shot Dutch Angle | 61.9 | 78.5 | 62.2 | 65.1 | 46.7 | 49.3 | 27.3 | 44.5 | 38.4 | 56.6 | 25.9 | 47.6 | 68.7 | | Shot Focus | 71.3 | 71.2 | 19.9 | 26.6 | 26.3 | 25.1 | 32.9 | 31.2 | 24.4 | 31.3 | 37.3 | 48.2 | 12.6 | | Shot Framing | 79.2 | 83.8 | 38.0 | 29.6 | 40.1 | 34.6 | 33.6 | 24.9 | 23.5 | 23.9 | 33.0 | 7.3 | 9.8 | | Shot Height | 90.5 | 91.8 | 38.1 | 37.4 | 41.2 | 53.0 | 37.6 | 33.7 | 28.9 | 24.0 | 33.6 | 29.6 | 23.9 | | Shot Lens Size | 67.9 | 70.6 | 49.6 | 28.0 | 43.6 | 46.6 | 32.1 | 28.0 | 34.5 | 30.1 | 25.7 | 30.1 | 17.6 | | Shot Location | 90.9 | 93.9 | 81.0 | 82.2 | 81.5 | 79.2 | 73.0 | 68.4 | 68.0 | 75.6 | 66.1 | 65.0 | 46.7 | | Shot Symmetry | 88.3 | 92.9 | 90.2 | 86.7 | 76.0 | 80.2 | 76.6 | 78.0 | 54.0 | 39.3 | 24.9 | 46.0 | 82.4 | | Shot Time of Day | 69.2 | 89.0 | 75.1 | 66.1 | 70.7 | 70.7 | 68.1 | 69.6 | 60.3 | 73.7 | 71.2 | 48.5 | 42.7 | | Shot Type | 81.8 | 90.5 | 81.3 | 61.2 | 57.0 | 57.4 | 52.8 | 40.4 | 36.5 | 35.7 | 56.7 | 46.5 | 29.7 | | Shot Type - Crowd | 91.5 | 99.6 | 97.2 | 88.2 | 94.3 | 94.8 | 55.9 | 69.1 | 68.6 | 77.2 | 37.3 | 52.4 | 69.3 | | Shot Type - OTS | 92.0 | 95.5 | 92.5 | 85.0 | 83.9 | 87.6 | 53.2 | 57.0 | 73.9 | 60.3 | 42.1 | 50.5 | 51.2 | The `shot.lighting.direction` head ships in the classifier heads but has been excluded from the table above being a multi-label classifier. ## Files in this repo | File | Purpose | |---|---| | `model.safetensors` | Blended (α=0.75) torch weights — `CinemaCLIP.from_pretrained` target | | `config.json` | Autogenerated `__init__` kwargs for `CinemaCLIP` | | `CinemaNetSchema.json` | Schema detailing classifier head metadata, confidence thresholds, preprocessing info | | `ImageEncoder.mlmodel` | CoreML `"neuralnetwork"` ImageEncoder (unified embedding + probabilities) | | `ImageEncoder.mlpackage` | CoreML ImageEncoder (unified embedding + probabilities) | | `TextEncoder.mlpackage` | CoreML TextEncoder | | `ImageEncoder.onnx` / `_fp16.onnx` | ONNX ImageEncoder | | `TextEncoder.onnx` / `_fp16.onnx` | ONNX TextEncoder | ## Citation ```bibtex @misc{cinemaclip2026, title = {CinemaCLIP: A hybrid CLIP model and taxonomy for the visual language of cinema}, author = {Somani, Rahul and Marini, Anton and Stewart, Damian}, year = {2026}, publisher = {HuggingFace}, doi = {10.57967/hf/8539}, howpublished = {\url{https://huggingface.co/OZU-Technology/CinemaCLIP}}, note = {Model weights and taxonomy} } ```