aslakey
/

shot_scale

dinov2_with_registers

Model card Files Files and versions

shot_scale / README.md

aslakey's picture

Update README.md

faa8a32 verified 8 months ago

|

history blame contribute delete

1.62 kB

	---
	license: apache-2.0
	---

	# Shot Scale

	This model predicts an image's cinematic camera angle [extreme_close_up, close_up, medium, full, wide]. The model is a DinoV2 with registers backbone (initiated with `facebook/dinov2-with-registers-large` weights) and trained on a diverse set of five thousand human-annotated images.

	## How to use:
	```python

	import torch
	from PIL import Image
	from transformers import AutoImageProcessor
	from transformers import AutoModelForImageClassification

	image_processor = AutoImageProcessor.from_pretrained("facebook/dinov2-with-registers-large")
	model = AutoModelForImageClassification.from_pretrained('aslakey/shot_scale')
	model.eval()

	# example medium shot image
	# Model labels: [extreme_close_up, close_up, medium, full, wide]
	image = Image.open('medium.jpg')
	inputs = image_processor(image, return_tensors="pt")
	with torch.no_grad():
	outputs = model(**inputs)

	# technically multi-label training, but argmax works too!
	predicted_label = outputs.logits.argmax(-1).item()
	print(model.config.id2label[predicted_label])
	```

	## Performance:

	Due to very low representation for ECU, the performance on that category is less than desirable. In the next version we will oversample ECU images. Also note that Wide and Full shots overlap quite a bit. In practice, a full shot is often a wide shot with a human subject.

	\| Category \| Precision \| Recall \|
	\|----------\|-----------\|--------\|
	\| ECU (low coverage) \| 75% \| 32% \|
	\| CU \| 66% \| 51% \|
	\| M \| 88% \| 90% \|
	\| F \| 69% \| 68% \|
	\| W \| 89% \| 83% \|