Subh775 commited on
Commit
6af5904
·
verified ·
1 Parent(s): 28fe299

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -1
README.md CHANGED
@@ -7,4 +7,77 @@ base_model:
7
  - vikhyatk/moondream2
8
  pipeline_tag: image-text-to-text
9
  library_name: transformers
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - vikhyatk/moondream2
8
  pipeline_tag: image-text-to-text
9
  library_name: transformers
10
+ ---
11
+
12
+ # Perception-moondream2
13
+
14
+ **Perception-moondream2** is a specialized Vision-Language Model (VLM) fine-tuned for dense urban traffic scene understanding. Built on top of the highly efficient `moondream2` architecture, this model is designed to analyze CCTV and traffic camera feeds to generate highly detailed, comprehensive textual descriptions of traffic conditions.
15
+
16
+ ## Model Details
17
+ - **Base Model:** [vikhyatk/moondream2](https://huggingface.co/vikhyatk/moondream2) (Revision: 2024-08-26)
18
+ - **Architecture:** Vision Encoder + Phi-1.5 Text Decoder
19
+ - **Task:** Dense Image Captioning & Visual Question Answering (VQA)
20
+ - **Language:** English
21
+
22
+ ## Training Data
23
+ The model was fine-tuned on the [Subh775/Traffic-Perception-VL](https://huggingface.co/datasets/Subh775/Traffic-Perception-VL) dataset. This dataset consists of complex, real-world urban traffic scenes (such as bustling streets in Bengaluru, India).
24
+
25
+ The training focused on teaching the model to accurately perceive and describe:
26
+ - **Vehicle Types & Colors:** Identifying auto-rickshaws, scooters, motorcycles, and cars.
27
+ - **Traffic Density & Flow:** Estimating congestion levels and movement.
28
+ - **Pedestrian Activity:** Tracking people walking on sidewalks or crossing streets.
29
+ - **Infrastructure:** Recognizing road layouts, lanes, shops, signage, and greenery.
30
+
31
+ ## Intended Use Cases
32
+ - **Smart City Analytics:** Automated monitoring of CCTV feeds to detect congestion or accidents.
33
+ - **Traffic Management:** Generating real-time text logs of intersection activity.
34
+ - **Autonomous Driving Context:** Providing dense contextual descriptions for self-driving datasets.
35
+
36
+ ---
37
+
38
+ ## Usage and Inference
39
+
40
+ Because this model relies on the custom Moondream2 architecture, you will need to use `trust_remote_code=True` when loading it via the `transformers` library.
41
+
42
+ ### Prerequisites
43
+ Make sure you have the required libraries installed:
44
+ ```bash
45
+ pip install transformers pillow einops
46
+ ```
47
+ ### Python Inference Script
48
+ ```python
49
+ import torch
50
+ from transformers import AutoModelForCausalLM, AutoTokenizer
51
+ from PIL import Image
52
+
53
+ # 1. Define the model ID
54
+ model_id = "Subh775/Perception-moondream2"
55
+
56
+ # 2. Load the tokenizer and model
57
+ # Note: trust_remote_code=True is required for the moondream2 architecture
58
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
59
+ model = AutoModelForCausalLM.from_pretrained(
60
+ model_id,
61
+ trust_remote_code=True,
62
+ torch_dtype=torch.float16, # Recommended for memory efficiency
63
+ device_map="auto"
64
+ )
65
+
66
+ # 3. Load your traffic/CCTV image
67
+ image_path = "path_to_your_traffic_image.jpg"
68
+ image = Image.open(image_path).convert("RGB")
69
+
70
+ # 4. Encode the image using the vision encoder
71
+ enc_image = model.encode_image(image)
72
+
73
+ # 5. Ask the model to describe the scene
74
+ # We use the same prompt that the model was fine-tuned on
75
+ prompt = "Describe this traffic scene in detail."
76
+
77
+ answer = model.answer_question(enc_image, prompt, tokenizer)
78
+
79
+ print("Traffic Scene Analysis:")
80
+ print("-" * 50)
81
+ print(answer)
82
+ ```
83
+