Billavenu commited on
Commit
dc3d1dd
Β·
verified Β·
1 Parent(s): 1de911d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +257 -7
README.md CHANGED
@@ -1,15 +1,265 @@
1
  ---
2
  title: ROCKIT Vision Intelligence
3
- emoji: πŸ†
4
- colorFrom: yellow
5
  colorTo: purple
6
  sdk: gradio
7
- sdk_version: 6.14.0
8
- python_version: '3.13'
9
  app_file: app.py
10
- pinned: false
11
  license: apache-2.0
12
- short_description: GPU-Accelerated Multimodal Search Across Images, Videos
13
  ---
14
 
15
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  title: ROCKIT Vision Intelligence
3
+ emoji: "\U0001F50D"
4
+ colorFrom: indigo
5
  colorTo: purple
6
  sdk: gradio
7
+ sdk_version: "4.44.0"
 
8
  app_file: app.py
9
+ pinned: true
10
  license: apache-2.0
 
11
  ---
12
 
13
+ <div align="center">
14
+
15
+ # ARIA Vision Intelligence
16
+
17
+ ### GPU-Accelerated Multimodal Search Engine
18
+
19
+ *Build isolated projects, ingest images and videos, search with natural language β€” all powered by AMD hipVS CAGRA graph indexes with NVMe-backed hot-swap memory management.*
20
+
21
+ [![HuggingFace Space](https://img.shields.io/badge/HuggingFace-Space-yellow?logo=huggingface)](https://huggingface.co/spaces)
22
+ [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
23
+ [![Python 3.11+](https://img.shields.io/badge/Python-3.11%2B-green.svg)](https://www.python.org/)
24
+
25
+ </div>
26
+
27
+ ---
28
+
29
+ ## What Is This?
30
+
31
+ ARIA Vision Intelligence is an **open-source, self-hosted multimodal search engine** that lets you create isolated projects, ingest visual media (images, videos), and query them with natural language. It is built for the **AMD Hackathon** and designed to showcase GPU-accelerated approximate nearest-neighbor (ANN) search using the **hipVS CAGRA** graph index on AMD ROCm hardware.
32
+
33
+ The core idea is simple:
34
+
35
+ > **Upload media β†’ Embed everything into a shared vector space β†’ Build a CAGRA graph on the GPU β†’ Search in microseconds β†’ Let an LLM interpret the results.**
36
+
37
+ There is no database. There are no external API dependencies. Every embedding, every index, and every LLM inference can run **entirely on local hardware** β€” from a single AMD GPU to a CPU-only Hugging Face free tier.
38
+
39
+ ---
40
+
41
+ ## Key Features
42
+
43
+ ### Multi-Project Isolation
44
+ Create **multiple projects**, each with its own sources, indexes, and configuration. Projects are fully isolated β€” ingesting media into one never affects another.
45
+
46
+ ```
47
+ projects/
48
+ β”œβ”€β”€ security-cam/ # CCTV footage analysis
49
+ β”‚ β”œβ”€β”€ sources/
50
+ β”‚ β”œβ”€β”€ indexes/
51
+ β”‚ └── config.json
52
+ β”œβ”€β”€ product-catalog/ # E-commerce image search
53
+ β”‚ β”œβ”€β”€ sources/
54
+ β”‚ β”œβ”€β”€ indexes/
55
+ β”‚ └── config.json
56
+ └── nature-docs/ # Wildlife video intelligence
57
+ β”œβ”€β”€ sources/
58
+ β”œβ”€β”€ indexes/
59
+ └── config.json
60
+ ```
61
+
62
+ ### Native Multimodal Embedding (No Captioning)
63
+ Unlike caption-then-embed pipelines, ARIA uses **true vision-language embedding models** that encode images, video frames, and text queries into the **same vector space** directly. No intermediate captioning step β€” no information loss.
64
+
65
+ | Tier | Model | Dim | Use Case |
66
+ |------|-------|-----|----------|
67
+ | GPU (large) | `Qwen/Qwen3-VL-Embedding-8B` | 4096 | Highest quality, production |
68
+ | GPU (small) | `Qwen/Qwen3-VL-Embedding-2B` | 2048 | Balanced speed / quality |
69
+ | CPU fallback | `openai/clip-vit-large-patch14` | 768 | Free-tier HF Spaces, dev |
70
+
71
+ ### CAGRA Graph Index (hipVS)
72
+ The CAGRA graph index is the fastest known ANN algorithm for GPU-resident data. ARIA rebuilds the CAGRA graph on every insert because this project is **optimized for inference and query speed**, not ingestion throughput. A 100K-vector CAGRA rebuild takes ~2 seconds on an MI250X β€” negligible compared to the embedding cost.
73
+
74
+ ### NVMe β†’ VRAM Async Hot-Swap
75
+ Indexes live in three tiers of memory. When a project is queried, its index is **asynchronously copied from NVMe into VRAM** via pinned-memory DMA, without blocking other projects. When VRAM fills up, least-recently-used indexes are evicted back to NVMe β€” not deleted.
76
+
77
+ ```
78
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” async copy β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” evict β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
79
+ β”‚ NVMe SSD β”‚ ──────────────→ β”‚ GPU VRAM β”‚ ──────────────→ β”‚ NVMe SSD β”‚
80
+ β”‚ (cold store) β”‚ ←────────────── β”‚ (hot index) β”‚ β”‚ (cold store) β”‚
81
+ β”‚ .cagra file β”‚ restore β”‚ CAGRA graph β”‚ β”‚ .cagra file β”‚
82
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
83
+ ↑
84
+ search()
85
+ ↑
86
+ query vector
87
+ ```
88
+
89
+ This design lets you run **dozens of projects** on a single GPU by keeping only the active ones hot. Full VRAM capacity is utilized.
90
+
91
+ ### LLM-Interpreted Results
92
+ Raw vector search returns `(id, score)` tuples. Before showing results to the user, ARIA passes them through an LLM that interprets the matches, merges adjacent video timestamps into time ranges, and generates a human-readable summary.
93
+
94
+ | Tier | Model | Notes |
95
+ |------|-------|-------|
96
+ | Primary | `Qwen/Qwen3-35B-A3B` | MoE: 35B total, 3B active β€” fast + smart |
97
+ | Fallback | `Qwen/Qwen3-1.7B` | Tiny, runs on anything |
98
+ | API | HF Inference API | Zero local compute, free tier |
99
+
100
+ ---
101
+
102
+ ## Architecture
103
+
104
+ ![Architecture](assests/Architecture.png)
105
+
106
+ ### Data Flow (Single Query)
107
+
108
+ ![Data Flow (Single Query)](assests/dataflow.png)
109
+
110
+ ---
111
+
112
+ ## GPU Compute Tiers
113
+
114
+ ARIA automatically detects available hardware and selects the best backend:
115
+
116
+ ![GPU Compute Tiers](assests/GPU_Compute.png)
117
+
118
+ | Tier | Backend | Search Latency (100K vectors) | When Used |
119
+ |------|---------|-------------------------------|-----------|
120
+ | 1 | CAGRA graph (hipVS / cuVS) | ~50 ΞΌs | AMD ROCm GPU + `hipvs` installed |
121
+ | 2 | Flat tensor (hipBLAS matmul) | ~2 ms | Any CUDA/ROCm GPU |
122
+ | 3 | NumPy cosine similarity | ~15 ms | CPU-only / free HF Space |
123
+
124
+ ---
125
+
126
+ ## Project Structure
127
+
128
+ ```
129
+ HF_Space_hipVS/
130
+ β”œβ”€β”€ app.py # Gradio UI β€” 3 tabs (Search, Upload, About)
131
+ β”œβ”€β”€ config.py # Env-aware configuration, auto-scales by hardware
132
+ β”œβ”€β”€ embedding.py # Qwen3-VL / CLIP multimodal embedding + LLM calls
133
+ β”œβ”€β”€ vector_store.py # 3-tier vector store (CAGRA β†’ GPU β†’ CPU) + NVMe swap
134
+ β”œβ”€β”€ ingest.py # Image & video ingestion pipeline
135
+ β”œβ”€β”€ search.py # Query β†’ embed β†’ search β†’ LLM interpret
136
+ β”œβ”€β”€ seed_data.py # Auto-seed from HF datasets on first launch
137
+ β”œβ”€β”€ requirements.txt # HF-native dependencies
138
+ β”œβ”€β”€ README.md # This file
139
+ β”œβ”€β”€ .env.example # Environment variable template
140
+ └── data/
141
+ β”œβ”€β”€ projects/ # Per-project source files and indexes
142
+ β”‚ └── default/
143
+ β”‚ β”œβ”€β”€ images/
144
+ β”‚ β”œβ”€β”€ videos/
145
+ β”‚ └── indexes/
146
+ └── models/ # Cached model weights (auto-downloaded)
147
+ ```
148
+
149
+ ---
150
+
151
+ ## Models
152
+
153
+ ### Embedding (Multimodal β€” Images + Text in same space)
154
+
155
+ No captioning model is used. Both images and text are embedded directly into a shared vector space by a single vision-language model.
156
+
157
+ | Model | Params | Dim | Modalities | Tier |
158
+ |-------|--------|-----|------------|------|
159
+ | `Qwen/Qwen3-VL-Embedding-8B` | 8B | 4096 | image, video frame, text | GPU (production) |
160
+ | `Qwen/Qwen3-VL-Embedding-2B` | 2B | 2048 | image, video frame, text | GPU (balanced) |
161
+ | `openai/clip-vit-large-patch14` | 428M | 768 | image, text | CPU (fallback) |
162
+
163
+ ### LLM (Search Result Interpretation)
164
+
165
+ | Model | Params | Architecture | Tier |
166
+ |-------|--------|-------------|------|
167
+ | `Qwen/Qwen3-35B-A3B` | 35B (3B active) | MoE | Primary β€” fast inference, smart |
168
+ | `Qwen/Qwen3-1.7B` | 1.7B | Dense | Fallback β€” runs on anything |
169
+ | HF Inference API | -- | Serverless | API fallback β€” zero local compute |
170
+
171
+ ---
172
+
173
+ ## Setup
174
+
175
+ ### Hugging Face Space (Recommended)
176
+
177
+ 1. Create a new Space (Gradio SDK)
178
+ 2. Push the `HF_Space_hipVS/` directory
179
+ 3. Set these **Secrets** in Space Settings:
180
+
181
+ | Secret | Required | Description |
182
+ |--------|----------|-------------|
183
+ | `HF_TOKEN` | Optional | HF write token for dataset persistence + Inference API |
184
+ | `USE_GPU` | Optional | Set `true` on GPU-enabled Spaces |
185
+ | `HF_DATASET_REPO` | Optional | e.g. `username/aria-index` for persistent storage |
186
+
187
+ 4. The Space auto-seeds demo content from `flickr30k` on first launch.
188
+
189
+ ### Local / AMD GPU Server
190
+
191
+ ```bash
192
+ cd HF_Space_hipVS
193
+ cp .env.example .env # edit with your settings
194
+ pip install -r requirements.txt
195
+ python app.py # starts on http://localhost:7860
196
+ ```
197
+
198
+ For CAGRA acceleration on AMD:
199
+ ```bash
200
+ pip install hipvs cupy-rocm # enables Tier 1
201
+ export USE_GPU=true
202
+ python app.py
203
+ ```
204
+
205
+ ---
206
+
207
+ ## Environment Variables
208
+
209
+ | Variable | Default | Description |
210
+ |----------|---------|-------------|
211
+ | `USE_GPU` | `false` | Enable GPU acceleration |
212
+ | `EMBED_MODEL` | auto-detected | Override embedding model |
213
+ | `EMBED_DIM` | auto-detected | Embedding dimensionality |
214
+ | `LLM_MODEL` | `Qwen/Qwen3-35B-A3B` | LLM for result summarization |
215
+ | `LLM_FALLBACK` | `Qwen/Qwen3-1.7B` | Fallback LLM |
216
+ | `FRAME_EVERY_SEC` | `5` | Video frame extraction interval |
217
+ | `HF_TOKEN` | -- | HF token for persistence + API |
218
+ | `HF_DATASET_REPO` | -- | HF Dataset repo for cold storage |
219
+ | `AUTO_SEED` | `true` | Auto-seed on first empty launch |
220
+ | `SEED_DATASET` | `nlphuji/flickr30k` | Dataset for auto-seeding |
221
+ | `SWAP_PATH` | `data/indexes` | NVMe path for index swap files |
222
+
223
+ ---
224
+
225
+ ## How It Works
226
+
227
+ ### 1. Create a Project
228
+ Each project is an isolated workspace with its own sources, embeddings, and CAGRA index. You can have a "security-cam" project and a "product-catalog" project running on the same GPU without interference.
229
+
230
+ ### 2. Ingest Media
231
+ Upload images or videos. For videos, ffmpeg extracts one representative frame every N seconds. Every image and frame is embedded directly by the vision-language model (Qwen3-VL or CLIP) β€” no captioning, no text intermediary.
232
+
233
+ ### 3. CAGRA Build
234
+ After every insert, the CAGRA graph index is **fully rebuilt** from the updated vector set. This is intentional: ARIA is optimized for query speed, not ingestion throughput. A 100K rebuild takes ~2s on MI250X. The built graph is immediately serialized to NVMe.
235
+
236
+ ### 4. Search
237
+ When you search, the query text is embedded by the same model. The CAGRA index is loaded into VRAM (if not already hot) via async pinned-memory DMA, and searched in microseconds. Results are post-processed: video frame hits are merged into time ranges, and the full result set is sent to the LLM for a human-friendly summary.
238
+
239
+ ### 5. Memory Management
240
+ Multiple project indexes coexist by swapping between NVMe and VRAM. Active indexes are kept hot; idle ones are evicted. Restoration from NVMe is a fast deserialization β€” no re-embedding, no rebuild.
241
+
242
+ ---
243
+
244
+ ## Design Decisions
245
+
246
+ | Decision | Rationale |
247
+ |----------|-----------|
248
+ | **No captioning model** | Vision-language embedding models (Qwen3-VL, CLIP) encode images directly into the same space as text. Captioning adds latency and loses visual information. |
249
+ | **Rebuild CAGRA on every insert** | This project is inference-heavy. Query latency matters more than ingestion speed. CAGRA rebuild is fast enough (~2s for 100K vectors). |
250
+ | **NVMe swap, not eviction** | Indexes are expensive to build. Serializing to NVMe and restoring is 100x faster than re-embedding from source. |
251
+ | **Multi-project isolation** | Real-world use cases involve multiple distinct corpora. Isolation prevents cross-contamination and allows per-project model configuration. |
252
+ | **No external database** | Everything is `.npz` + `.cagra` files. Portable, debuggable, no ops overhead. HF Dataset push is optional backup. |
253
+ | **MoE LLM (Qwen3-35B-A3B)** | 35B params for quality, but only 3B active per token β€” inference cost of a 3B model with the reasoning of a 35B. |
254
+
255
+ ---
256
+
257
+ ## License
258
+
259
+ Apache 2.0
260
+
261
+ ---
262
+
263
+ <div align="center">
264
+ <i>Built for the AMD Hackathon β€” ARIA Vision Intelligence Platform</i>
265
+ </div>