README.md · OZU-Technology/CinemaCLIP at main

CinemaCLIP / README.md

rsomani95

Update README.md

41b2e06 verified 19 days ago

preview code

raw

history blame contribute delete

8.58 kB

	---
	library_name: cinemaclip
	pipeline_tag: zero-shot-image-classification
	tags:
	- clip
	- mobile-clip
	- cinema
	- film
	- movies
	- multi-task
	- hybrid
	- cinematography
	- domain-specific
	- image-classification
	- zero-shot
	base_model: apple/MobileCLIP-S1-OpenCLIP
	base_model_relation: finetune
	license: apple-amlr
	license_name: aplle-ascl
	license_link: https://github.com/apple/ml-mobileclip/blob/main/LICENSE_MODELS
	---

	# CinemaCLIP-1.0.0

	CinemaCLIP is a [MobileCLIP-S1](https://huggingface.co/apple/MobileCLIP-S1-OpenCLIP) fine-tune specialized for understanding the visual language of cinema at a frame level. It is a hybrid CLIP model with 23 classifier heads that represent a comprehensive taxonomy built with domain experts. For more info, see our [launch blog post](https://www.ozu.ai/cinemaclip).

	This repository ships three serialized forms of the same model:
	- Torch (`model.safetensors`): load via the `cinemaclip` Python package.
	- CoreML (`ImageEncoder.mlmodel`, `ImageEncoder.mlpackage` and `TextEncoder.mlpackage`): on-device Apple Neural Engine inference.
	- ONNX (`ImageEncoder.onnx`, `TextEncoder.onnx`, plus `_fp16` variants): cross-platform inference.

	## Install

	```bash
	pip install cinemaclip # core
	pip install "cinemaclip[coreml]" # CoreML export/inference
	pip install "cinemaclip[onnx]" # ONNX export/inference
	```

	## Usage (PyTorch)

	```python
	from PIL import Image
	from cinemaclip import CinemaCLIP

	model = CinemaCLIP.from_pretrained("OZU-Technology/CinemaCLIP").eval()

	# End-to-end classification on a PIL image
	image = Image.open("still.jpg").convert("RGB")
	predictions = model.predict_image(image)
	predictions["classifier_preds"] # Classifier predictions
	predictions["clip_image_embedding"]

	# Just the image embedding
	x = model.preprocess(image).unsqueeze(0)
	image_embedding = model.encode_image(x, normalize=True) # [1, 512]

	# Just the text embedding
	tokens = model.tokenizer(["a medium closeup of "])
	text_embedding = model.encode_text(tokens, normalize=True) # [1, 512]
	```

	The `CinemaCLIP.predict_image` method is demonstrative for how to get post-processed classifier outputs from the model. It is not super efficient or production ready, and must be treated as a reference above all else.

	## Usage (CoreML)

	```python
	import coremltools as ct
	from PIL import Image

	img_encoder = ct.models.MLModel("ImageEncoder.mlpackage")
	# Input must be 256x256 RGB, resized with BICUBIC for parity with the released torch outputs.
	img = Image.open("still.jpg").convert("RGB").resize((256, 256), Image.Resampling.BICUBIC)
	out = img_encoder.predict({"Image": img})
	embedding = out["clip_image_embedding"] # [512]
	probabilities = out["probabilities"] # [101] — concat of 23 per-category outputs

	# TODO
	text_encoder = ct.models.MLModel("TextEncoder.mlpackage")
	```

	## Usage (ONNX)

	```python
	from PIL import Image
	from onnxruntime import InferenceSession
	from torchvision import transforms as T

	img = Image.open("still.jpg").convert("RGB")
	preprocess = T.Compose([
	T.Resize((256, 256), interpolation=T.InterpolationMode.BICUBIC),
	T.ToTensor(), # yields float tensor in [0, 1] — no mean/std normalization
	])
	x = preprocess(img).unsqueeze(0).numpy()

	session = InferenceSession("ImageEncoder.onnx", providers=["CPUExecutionProvider"])
	emb, probs = session.run(None, {"Image": x})
	```

	## Output structure

	`probabilities` is a flat `[101]` vector — the concatenation of all 23 classifier heads' post-activation outputs. Label names and positions are in the shipped `CinemaNetSchema.json`:

	```python
	import json
	schema = json.load(open("CinemaNetSchema.json"))
	label_names = schema["probabilities_labels"] # len == 101
	```

	The classifier heads are a mix of 3 types of classifiers:
	- Single label (softmax activation)
	- Multi label (sigmoid activation)
	- Binary (sigmoid activation)


	## Evaluation

	`CinemaCLIP` outperforms not only the largest existing CLIP models (up to 28x larger), but also leading VLMs in cinematic understanding tasks (we benchmarked against the leading `4B` VLMs).

	Two inference modes are reported for CinemaCLIP:
	- Classifier — the shipped supervised heads on the CinemaCLIP image embedding.
	- 0-shot — zero-shot text/image similarity using CinemaCLIP's own text encoder.

	\| Category \| CinemaCLIP 0-shot \| CinemaCLIP Classifier \| Qwen3.5-4B \| Gemma4-4B \| InternVL3.5-4B \| Molmo2-4B \| DFN ViT-H-14 \| MetaCLIP PE-bigG \| OpenAI ViT-L-14 \| MobileCLIP-S1 \| DFN ViT-L-14 \| SigLIP2 SO400M \| SigLIP2 ViT-gopt \|
	\|---\|--:\|--:\|--:\|--:\|--:\|--:\|--:\|--:\|--:\|--:\|--:\|--:\|--:\|
	\| Mean \| 82.9 \| 87.6 \| 57.6 \| 56.7 \| 55.3 \| 55.3 \| 45.9 \| 45.2 \| 44.8 \| 44.2 \| 39.0 \| 38.7 \| 36.5 \|
	\| Color Contrast \| 89.6 \| 86.8 \| 33.7 \| 35.3 \| 33.7 \| 35.3 \| 34.0 \| 33.1 \| 49.4 \| 38.7 \| 37.1 \| 57.7 \| 25.2 \|
	\| Color Key \| 84.9 \| 92.9 \| 78.1 \| 78.1 \| 80.3 \| 64.3 \| 58.2 \| 50.2 \| 53.2 \| 59.4 \| 48.3 \| 22.8 \| 52.6 \|
	\| Color Saturation \| 82.6 \| 82.6 \| 66.5 \| 65.4 \| 72.1 \| 45.9 \| 55.1 \| 61.8 \| 58.1 \| 35.8 \| 46.8 \| 33.3 \| 31.8 \|
	\| Color Theory \| 71.3 \| 72.7 \| 54.0 \| 51.7 \| 50.7 \| 48.7 \| 54.7 \| 51.7 \| 50.7 \| 47.3 \| 47.7 \| 31.3 \| 31.7 \|
	\| Color Tones \| 86.0 \| 86.5 \| 50.2 \| 62.6 \| 70.6 \| 62.1 \| 58.5 \| 50.2 \| 52.0 \| 55.7 \| 47.2 \| 24.0 \| 17.7 \|
	\| Lighting Cast \| 85.9 \| 90.4 \| 38.3 \| 53.3 \| 39.8 \| 35.7 \| 25.4 \| 29.3 \| 28.8 \| 35.7 \| 22.8 \| 37.8 \| 18.2 \|
	\| Lighting Contrast \| 93.9 \| 95.3 \| 29.8 \| 39.1 \| 38.7 \| 46.1 \| 35.3 \| 35.5 \| 32.6 \| 39.0 \| 39.4 \| 48.4 \| 37.6 \|
	\| Lighting Edge \| 87.6 \| 90.4 \| 22.8 \| 38.8 \| 31.2 \| 40.4 \| 22.4 \| 31.6 \| 41.6 \| 34.0 \| 21.2 \| 26.0 \| 25.6 \|
	\| Lighting Silhouette \| 88.4 \| 93.1 \| 80.9 \| 63.0 \| 48.9 \| 48.8 \| 66.6 \| 67.1 \| 67.4 \| 58.4 \| 43.5 \| 46.2 \| 78.9 \|
	\| Shot Angle \| 73.4 \| 82.3 \| 41.9 \| 49.2 \| 33.2 \| 49.9 \| 28.0 \| 13.7 \| 19.0 \| 19.6 \| 25.9 \| 21.3 \| 17.2 \|
	\| Shot Composition \| 95.5 \| 96.0 \| 46.0 \| 54.5 \| 55.7 \| 60.5 \| 27.8 \| 24.3 \| 21.3 \| 22.0 \| 25.2 \| 31.4 \| 11.4 \|
	\| Shot Dutch Angle \| 61.9 \| 78.5 \| 62.2 \| 65.1 \| 46.7 \| 49.3 \| 27.3 \| 44.5 \| 38.4 \| 56.6 \| 25.9 \| 47.6 \| 68.7 \|
	\| Shot Focus \| 71.3 \| 71.2 \| 19.9 \| 26.6 \| 26.3 \| 25.1 \| 32.9 \| 31.2 \| 24.4 \| 31.3 \| 37.3 \| 48.2 \| 12.6 \|
	\| Shot Framing \| 79.2 \| 83.8 \| 38.0 \| 29.6 \| 40.1 \| 34.6 \| 33.6 \| 24.9 \| 23.5 \| 23.9 \| 33.0 \| 7.3 \| 9.8 \|
	\| Shot Height \| 90.5 \| 91.8 \| 38.1 \| 37.4 \| 41.2 \| 53.0 \| 37.6 \| 33.7 \| 28.9 \| 24.0 \| 33.6 \| 29.6 \| 23.9 \|
	\| Shot Lens Size \| 67.9 \| 70.6 \| 49.6 \| 28.0 \| 43.6 \| 46.6 \| 32.1 \| 28.0 \| 34.5 \| 30.1 \| 25.7 \| 30.1 \| 17.6 \|
	\| Shot Location \| 90.9 \| 93.9 \| 81.0 \| 82.2 \| 81.5 \| 79.2 \| 73.0 \| 68.4 \| 68.0 \| 75.6 \| 66.1 \| 65.0 \| 46.7 \|
	\| Shot Symmetry \| 88.3 \| 92.9 \| 90.2 \| 86.7 \| 76.0 \| 80.2 \| 76.6 \| 78.0 \| 54.0 \| 39.3 \| 24.9 \| 46.0 \| 82.4 \|
	\| Shot Time of Day \| 69.2 \| 89.0 \| 75.1 \| 66.1 \| 70.7 \| 70.7 \| 68.1 \| 69.6 \| 60.3 \| 73.7 \| 71.2 \| 48.5 \| 42.7 \|
	\| Shot Type \| 81.8 \| 90.5 \| 81.3 \| 61.2 \| 57.0 \| 57.4 \| 52.8 \| 40.4 \| 36.5 \| 35.7 \| 56.7 \| 46.5 \| 29.7 \|
	\| Shot Type - Crowd \| 91.5 \| 99.6 \| 97.2 \| 88.2 \| 94.3 \| 94.8 \| 55.9 \| 69.1 \| 68.6 \| 77.2 \| 37.3 \| 52.4 \| 69.3 \|
	\| Shot Type - OTS \| 92.0 \| 95.5 \| 92.5 \| 85.0 \| 83.9 \| 87.6 \| 53.2 \| 57.0 \| 73.9 \| 60.3 \| 42.1 \| 50.5 \| 51.2 \|

	The `shot.lighting.direction` head ships in the classifier heads but has been excluded from the table above being a multi-label classifier.

	## Files in this repo

	\| File \| Purpose \|
	\|---\|---\|
	\| `model.safetensors` \| Blended (α=0.75) torch weights — `CinemaCLIP.from_pretrained` target \|
	\| `config.json` \| Autogenerated `__init__` kwargs for `CinemaCLIP` \|
	\| `CinemaNetSchema.json` \| Schema detailing classifier head metadata, confidence thresholds, preprocessing info \|
	\| `ImageEncoder.mlmodel` \| CoreML `"neuralnetwork"` ImageEncoder (unified embedding + probabilities) \|
	\| `ImageEncoder.mlpackage` \| CoreML ImageEncoder (unified embedding + probabilities) \|
	\| `TextEncoder.mlpackage` \| CoreML TextEncoder \|
	\| `ImageEncoder.onnx` / `_fp16.onnx` \| ONNX ImageEncoder \|
	\| `TextEncoder.onnx` / `_fp16.onnx` \| ONNX TextEncoder \|


	## Citation

	```bibtex
	@misc{cinemaclip2026,
	title = {CinemaCLIP: A hybrid CLIP model and taxonomy for the visual language of cinema},
	author = {Somani, Rahul and Marini, Anton and Stewart, Damian},
	year = {2026},
	publisher = {HuggingFace},
	doi = {10.57967/hf/8539},
	howpublished = {\url{https://huggingface.co/OZU-Technology/CinemaCLIP}},
	note = {Model weights and taxonomy}
	}
	```