--- license: apache-2.0 --- # Camera Level This model predicts an image's cinematic camera level [ground, hip, shoulder, eye, aerial]. The model is a DinoV2 with registers backbone (initiated with `facebook/dinov2-with-registers-large` weights) and trained on a diverse set of five thousand human-annotated images. ## How to use: ```python import torch from PIL import Image from transformers import AutoImageProcessor from transformers import AutoModelForImageClassification image_processor = AutoImageProcessor.from_pretrained("facebook/dinov2-with-registers-large") model = AutoModelForImageClassification.from_pretrained('aslakey/camera_level') model.eval() # Model labels: [ground, hip, shoulder, eye, aerial] image = Image.open('cinematic_shot.jpg') inputs = image_processor(image, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) # technically multi-label training, but argmax works too! predicted_label = outputs.logits.argmax(-1).item() print(model.config.id2label[predicted_label]) ``` ## Performance: | Category | Precision | Recall | |----------|-----------|--------| | ground | 65% | 51% | | hip | 69% | 62% | | shoulder | 68% | 74% | | eye | 51% | 39% | | aerial | 89% | 76% |