Thunderbird2410 commited on
Commit
1f2da7e
Β·
verified Β·
1 Parent(s): 91dbf58

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +202 -4
README.md CHANGED
@@ -5,17 +5,215 @@ tags:
5
  - transformers
6
  - unsloth
7
  - qwen2_5_vl_text
 
8
  - trl
 
 
 
 
9
  license: apache-2.0
 
 
 
10
  language:
11
  - en
12
  ---
13
 
14
- # Uploaded model
15
 
16
- - **Developed by:** Thunderbird2410
17
- - **License:** apache-2.0
18
- - **Finetuned from model :** unsloth/Qwen2.5-VL-7B-Instruct
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  This qwen2_5_vl_text model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth)
21
 
 
5
  - transformers
6
  - unsloth
7
  - qwen2_5_vl_text
8
+ - lora
9
  - trl
10
+ - vision-language
11
+ - autonomous-driving
12
+ - robotics
13
+ - multi-view spatial-reasoning
14
  license: apache-2.0
15
+ datasets:
16
+ - nvidia/PhysicalAI-Autonomous-Vehicles
17
+ pipeline_tag: image-text-to-text
18
  language:
19
  - en
20
  ---
21
 
 
22
 
23
+
24
+
25
+ <div align="center">
26
+
27
+ # KAIØ-SIGHT
28
+
29
+ **Multi-View Vision-Language Reasoning for Autonomous Robotics**
30
+
31
+ [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow)](https://huggingface.co/Thunderbird2410/KAIO-SIGHT)
32
+ [![GitHub](https://img.shields.io/badge/GitHub-Repository-blue)](https://github.com/poornachandra24/KAIO-SIGHT)
33
+ [![AMD ROCm](https://img.shields.io/badge/AMD-MI300X-orange)](https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html)
34
+ [![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)
35
+
36
+
37
+ </div>
38
+
39
+ ## Model Description
40
+
41
+ **KAIØ-SIGHT** is a fine-tuned Vision-Language Model (VLM) designed for **multi-view spatial-temporal reasoning** in autonomous robotics and driving scenarios. Built on top of `Qwen2.5-VL-7B-Instruct`, this model learns to fuse multi-camera video feeds into a coherent understanding of 360° environments.
42
+ This repo contains only the fine-tuned Lora adapters. Please pull the base model directly.
43
+ ### Key Capabilities
44
+
45
+ - πŸŽ₯ **Multi-View Fusion**: Processes synchronized feeds from up to 7 cameras (Front Wide, Front Tele, Cross Left/Right, Rear Left/Right, Rear Tele)
46
+ - 🧠 **Spatial Reasoning**: Understands object positions, motion trajectories, and scene dynamics across camera views
47
+ - πŸš— **Egomotion Prediction**: Predicts vehicle state including position, velocity, and rotation
48
+ - ⏱️ **Temporal Context**: Analyzes 16-frame sliding windows to capture motion and causality
49
+
50
+ ## Training Details
51
+
52
+ ### Base Model
53
+ - **Architecture**: Qwen2.5-VL-7B-Instruct
54
+ - **Training Method**: LoRA (Low-Rank Adaptation) with Unsloth optimization
55
+ - **Precision**: BFloat16
56
+
57
+ ### LoRA Configuration
58
+ | Parameter | Value |
59
+ |-----------|-------|
60
+ | Rank | 128 |
61
+ | Alpha | 256 |
62
+ | Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
63
+ | Max Sequence Length | 65,536 tokens |
64
+
65
+ ### Training Hyperparameters
66
+ | Parameter | Value |
67
+ |-----------|-------|
68
+ | Learning Rate | 1e-4 |
69
+ | Optimizer | Paged AdamW 8-bit |
70
+ | Effective Batch Size | 144 (48 Γ— 3 gradient accumulation) |
71
+ | Weight Decay | 0.01 |
72
+ | LR Scheduler | Cosine with 10% warmup |
73
+ | Epochs | 1 |
74
+
75
+ ### Hardware
76
+ - **GPU**: AMD Instinct MI300X (192GB VRAM)
77
+ - **Framework**: ROCm 6.4 with custom kernel optimizations
78
+
79
+ ## Dataset
80
+
81
+
82
+ Trained on the [NVIDIA PhysicalAI Autonomous Vehicles](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles) dataset featuring:
83
+
84
+ - Multi-camera video streams from 7 synchronized cameras
85
+ - Egomotion labels (position, velocity, rotation)
86
+ - High-quality urban driving scenarios
87
+
88
+ ### Camera Configuration (7-cam Setup)
89
+ ```
90
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
91
+ β”‚ Front Wide β”‚ Front Tele β”‚ (empty) β”‚
92
+ β”‚ 120Β° FOV β”‚ 30Β° FOV β”‚ β”‚
93
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
94
+ β”‚ Cross Left β”‚ (ego) β”‚ Cross Right β”‚
95
+ β”‚ 120Β° FOV β”‚ β”‚ 120Β° FOV β”‚
96
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
97
+ β”‚ Rear Left β”‚ Rear Tele β”‚ Rear Right β”‚
98
+ β”‚ 70Β° FOV β”‚ 30Β° FOV β”‚ 70Β° FOV β”‚
99
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
100
+ ```
101
+
102
+ ## Intended Use
103
+
104
+ ### Primary Use Cases
105
+ - πŸ€– Autonomous robotics research and development
106
+ - πŸš™ Driving scenario understanding and prediction
107
+ - πŸ“Š Multi-view video understanding research
108
+ - πŸ”¬ Vision-language model experimentation
109
+
110
+ ### Out-of-Scope Uses
111
+ - ⚠️ Production autonomous vehicle deployment (experimental research only)
112
+ - ⚠️ Safety-critical applications without additional validation
113
+ - ⚠️ Real-time inference without hardware-specific optimization
114
+
115
+ ## Usage
116
+
117
+ ### Quick Start with Transformers
118
+
119
+ ```python
120
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
121
+ from peft import PeftModel
122
+ import torch
123
+
124
+ # Load base model
125
+ base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
126
+ "Qwen/Qwen2.5-VL-7B-Instruct",
127
+ torch_dtype=torch.bfloat16,
128
+ device_map="auto"
129
+ )
130
+
131
+ # Load LoRA adapter
132
+ model = PeftModel.from_pretrained(base_model, "Thunderbird2410/KAIO-SIGHT")
133
+ processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
134
+
135
+ # Prepare your multi-view image
136
+ messages = [
137
+ {
138
+ "role": "user",
139
+ "content": [
140
+ {"type": "image", "image": "path/to/multi_view_image.jpg"},
141
+ {"type": "text", "text": "Analyze this multi-camera driving scene. Describe the surroundings and predict the vehicle's motion."}
142
+ ]
143
+ }
144
+ ]
145
+
146
+ # Generate response
147
+ text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
148
+ inputs = processor(text=text, images=[image], return_tensors="pt").to(model.device)
149
+ outputs = model.generate(**inputs, max_new_tokens=512)
150
+ response = processor.decode(outputs[0], skip_special_tokens=True)
151
+ ```
152
+
153
+ ### With Unsloth (Recommended for Training)
154
+
155
+ ```python
156
+ from unsloth import FastVisionModel
157
+
158
+ model, tokenizer = FastVisionModel.from_pretrained(
159
+ "Thunderbird2410/KAIO-SIGHT",
160
+ max_seq_length=65536,
161
+ dtype=torch.bfloat16,
162
+ load_in_4bit=True # Optional: for lower VRAM
163
+ )
164
+ ```
165
+
166
+ ## Limitations
167
+
168
+ - **Experimental Status**: This model is a research prototype and not production-ready
169
+ - **Hardware Dependency**: Optimized for AMD MI300X; performance on other GPUs may vary
170
+ - **Domain Specificity**: Trained primarily on urban driving scenarios
171
+ - **Temporal Windows**: Best performance with 4-frame sequences matching training distribution to meet model's context window
172
+
173
+ ## Model Architecture
174
+
175
+ ```mermaid
176
+ graph LR
177
+ A[7-Camera Video] -->|Tile to Grid| B[3Γ—3 Composite Frame]
178
+ B -->|16-Frame Window| C[Temporal Sequence]
179
+ C -->|Vision Encoder| D[Qwen2.5-VL-7B]
180
+ D -->|LoRA Adapters| E[Fine-tuned Model]
181
+ E -->|Generate| F[Egomotion + Reasoning]
182
+ ```
183
+
184
+ ## Citation
185
+
186
+ If you use this model in your research, please cite:
187
+
188
+ ```bibtex
189
+ @misc{kaio-sight-2024,
190
+ author = {Poornachandra},
191
+ title = {KAIØ-SIGHT: Multi-View Vision-Language Reasoning for Autonomous Robotics},
192
+ year = {2024},
193
+ publisher = {Hugging Face},
194
+ url = {https://huggingface.co/Thunderbird2410/KAIO-SIGHT}
195
+ }
196
+ ```
197
+
198
+ ## Acknowledgments
199
+
200
+ - [Qwen Team](https://huggingface.co/Qwen) for the Qwen2.5-VL foundation model
201
+ - [Unsloth](https://github.com/unslothai/unsloth) for efficient fine-tuning optimizations
202
+ - [NVIDIA](https://huggingface.co/datasets/nvidia/PhysicalAI-Autonomous-Vehicles) for the PhysicalAI dataset
203
+ - [AMD](https://rocm.docs.amd.com/) for ROCm and MI300X hardware support
204
+
205
+ ## License
206
+
207
+ This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
208
+
209
+ ---
210
+
211
+ <div align="center">
212
+
213
+ **⚠️ Experimental Research Model - Use at Your Own Risk ⚠️**
214
+
215
+
216
+ </div>
217
 
218
  This qwen2_5_vl_text model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth)
219