File size: 4,772 Bytes
7afabec
 
 
 
 
 
 
 
 
 
 
6b82aed
 
8d89c9f
 
 
 
 
 
 
 
 
 
6b82aed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8d89c9f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
license: cc-by-nc-4.0
language:
- en
base_model:
- facebook/metaclip-2-worldwide-s16
pipeline_tag: image-classification
library_name: transformers
tags:
- text-generation-inference
- open-scene
---

![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/Cwn-cWX3RDmAhywdLgocX.png)

# **MetaCLIP-2-Open-Scene**

> **MetaCLIP-2-Open-Scene** is an image classification vision-language encoder model fine-tuned from **[facebook/metaclip-2-worldwide-s16](https://huggingface.co/facebook/metaclip-2-worldwide-s16)** for a single-label classification task.
> It is designed to identify and categorize various natural and urban scenes using the **MetaClip2ForImageClassification** architecture.

>[!note]
MetaCLIP 2: A Worldwide Scaling Recipe : https://huggingface.co/papers/2507.22062

```
Classification Report:
              precision    recall  f1-score   support

   buildings     0.9644    0.9703    0.9673      2625
      forest     0.9948    0.9978    0.9963      2694
     glacier     0.9531    0.9427    0.9479      2671
    mountain     0.9470    0.9512    0.9491      2723
         sea     0.9909    0.9920    0.9915      2758
      street     0.9728    0.9694    0.9711      2874

    accuracy                         0.9706     16345
   macro avg     0.9705    0.9706    0.9705     16345
weighted avg     0.9706    0.9706    0.9706     16345
```

![download](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/gJUwvNsxBQAh30FprXlyP.png)

The model classifies images into six open-scene categories:

* **Class 0:** "buildings"
* **Class 1:** "forest"
* **Class 2:** "glacier"
* **Class 3:** "mountain"
* **Class 4:** "sea"
* **Class 5:** "street"

# **Run with Transformers**

```python
!pip install -q transformers torch pillow gradio
```

```python
import gradio as gr
from transformers import AutoImageProcessor
from transformers import AutoModelForImageClassification
from transformers.image_utils import load_image
from PIL import Image
import torch

# Load model and processor
model_name = "prithivMLmods/MetaCLIP-2-Open-Scene"
model = AutoModelForImageClassification.from_pretrained(model_name)
processor = AutoImageProcessor.from_pretrained(model_name)

def scene_classification(image):
    """Predicts the type of scene represented in an image."""
    image = Image.fromarray(image).convert("RGB")
    inputs = processor(images=image, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = torch.nn.functional.softmax(logits, dim=1).squeeze().tolist()

    labels = {
        "0": "buildings",
        "1": "forest",
        "2": "glacier",
        "3": "mountain",
        "4": "sea",
        "5": "street"
    }
    predictions = {labels[str(i)]: round(probs[i], 3) for i in range(len(probs))}

    return predictions

# Create Gradio interface
iface = gr.Interface(
    fn=scene_classification,
    inputs=gr.Image(type="numpy"),
    outputs=gr.Label(label="Prediction Scores"),
    title="Open Scene Classification",
    description="Upload an image to classify the scene type (e.g., forest, sea, street, mountain, etc.)."
)

# Launch the app
if __name__ == "__main__":
    iface.launch()
```

# **Sample Inference:**

![Screenshot 2025-11-13 at 19-39-55 Open Scene Classification](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/9vHPyQsv3FeduOU1s4A6r.png)
![Screenshot 2025-11-13 at 19-37-07 Open Scene Classification](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/H_5f4qIB5XYQuOyeNGh_P.png)
![Screenshot 2025-11-13 at 19-37-50 Open Scene Classification](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/KM0hBDUFF5-zU9tCXoU1w.png)
![Screenshot 2025-11-13 at 19-38-37 Open Scene Classification](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/ErSRLvFr3fLGUTO86klOT.png)
![Screenshot 2025-11-13 at 19-39-24 Open Scene Classification](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/_NYjjKZXQx8fJGOyjhY4r.png)

# **Intended Use:**

The **MetaCLIP-2-Open-Scene** model is designed to classify a wide range of natural and urban environments.
Potential use cases include:

* **Geographical Image Analysis:** Categorizing landscapes for environmental and mapping studies.
* **Tourism and Travel Applications:** Automatically tagging scenic photos for organization and recommendations.
* **Autonomous Systems:** Supporting navigation and perception in robotics and self-driving systems.
* **Environmental Monitoring:** Detecting and classifying geographic features for research.
* **Media and Photography:** Assisting in photo organization and content-based retrieval.