Update README.md
Browse files
README.md
CHANGED
|
@@ -1,15 +1,265 @@
|
|
| 1 |
---
|
| 2 |
title: ROCKIT Vision Intelligence
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
colorTo: purple
|
| 6 |
sdk: gradio
|
| 7 |
-
sdk_version:
|
| 8 |
-
python_version: '3.13'
|
| 9 |
app_file: app.py
|
| 10 |
-
pinned:
|
| 11 |
license: apache-2.0
|
| 12 |
-
short_description: GPU-Accelerated Multimodal Search Across Images, Videos
|
| 13 |
---
|
| 14 |
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
title: ROCKIT Vision Intelligence
|
| 3 |
+
emoji: "\U0001F50D"
|
| 4 |
+
colorFrom: indigo
|
| 5 |
colorTo: purple
|
| 6 |
sdk: gradio
|
| 7 |
+
sdk_version: "4.44.0"
|
|
|
|
| 8 |
app_file: app.py
|
| 9 |
+
pinned: true
|
| 10 |
license: apache-2.0
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
+
<div align="center">
|
| 14 |
+
|
| 15 |
+
# ARIA Vision Intelligence
|
| 16 |
+
|
| 17 |
+
### GPU-Accelerated Multimodal Search Engine
|
| 18 |
+
|
| 19 |
+
*Build isolated projects, ingest images and videos, search with natural language β all powered by AMD hipVS CAGRA graph indexes with NVMe-backed hot-swap memory management.*
|
| 20 |
+
|
| 21 |
+
[](https://huggingface.co/spaces)
|
| 22 |
+
[](LICENSE)
|
| 23 |
+
[](https://www.python.org/)
|
| 24 |
+
|
| 25 |
+
</div>
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## What Is This?
|
| 30 |
+
|
| 31 |
+
ARIA Vision Intelligence is an **open-source, self-hosted multimodal search engine** that lets you create isolated projects, ingest visual media (images, videos), and query them with natural language. It is built for the **AMD Hackathon** and designed to showcase GPU-accelerated approximate nearest-neighbor (ANN) search using the **hipVS CAGRA** graph index on AMD ROCm hardware.
|
| 32 |
+
|
| 33 |
+
The core idea is simple:
|
| 34 |
+
|
| 35 |
+
> **Upload media β Embed everything into a shared vector space β Build a CAGRA graph on the GPU β Search in microseconds β Let an LLM interpret the results.**
|
| 36 |
+
|
| 37 |
+
There is no database. There are no external API dependencies. Every embedding, every index, and every LLM inference can run **entirely on local hardware** β from a single AMD GPU to a CPU-only Hugging Face free tier.
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
## Key Features
|
| 42 |
+
|
| 43 |
+
### Multi-Project Isolation
|
| 44 |
+
Create **multiple projects**, each with its own sources, indexes, and configuration. Projects are fully isolated β ingesting media into one never affects another.
|
| 45 |
+
|
| 46 |
+
```
|
| 47 |
+
projects/
|
| 48 |
+
βββ security-cam/ # CCTV footage analysis
|
| 49 |
+
β βββ sources/
|
| 50 |
+
β βββ indexes/
|
| 51 |
+
β βββ config.json
|
| 52 |
+
βββ product-catalog/ # E-commerce image search
|
| 53 |
+
β βββ sources/
|
| 54 |
+
β βββ indexes/
|
| 55 |
+
β βββ config.json
|
| 56 |
+
βββ nature-docs/ # Wildlife video intelligence
|
| 57 |
+
βββ sources/
|
| 58 |
+
βββ indexes/
|
| 59 |
+
βββ config.json
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
### Native Multimodal Embedding (No Captioning)
|
| 63 |
+
Unlike caption-then-embed pipelines, ARIA uses **true vision-language embedding models** that encode images, video frames, and text queries into the **same vector space** directly. No intermediate captioning step β no information loss.
|
| 64 |
+
|
| 65 |
+
| Tier | Model | Dim | Use Case |
|
| 66 |
+
|------|-------|-----|----------|
|
| 67 |
+
| GPU (large) | `Qwen/Qwen3-VL-Embedding-8B` | 4096 | Highest quality, production |
|
| 68 |
+
| GPU (small) | `Qwen/Qwen3-VL-Embedding-2B` | 2048 | Balanced speed / quality |
|
| 69 |
+
| CPU fallback | `openai/clip-vit-large-patch14` | 768 | Free-tier HF Spaces, dev |
|
| 70 |
+
|
| 71 |
+
### CAGRA Graph Index (hipVS)
|
| 72 |
+
The CAGRA graph index is the fastest known ANN algorithm for GPU-resident data. ARIA rebuilds the CAGRA graph on every insert because this project is **optimized for inference and query speed**, not ingestion throughput. A 100K-vector CAGRA rebuild takes ~2 seconds on an MI250X β negligible compared to the embedding cost.
|
| 73 |
+
|
| 74 |
+
### NVMe β VRAM Async Hot-Swap
|
| 75 |
+
Indexes live in three tiers of memory. When a project is queried, its index is **asynchronously copied from NVMe into VRAM** via pinned-memory DMA, without blocking other projects. When VRAM fills up, least-recently-used indexes are evicted back to NVMe β not deleted.
|
| 76 |
+
|
| 77 |
+
```
|
| 78 |
+
ββββββββββββββββ async copy ββββββββββββββββ evict ββββββββββββββββ
|
| 79 |
+
β NVMe SSD β βββββββββββββββ β GPU VRAM β βββββββββββββββ β NVMe SSD β
|
| 80 |
+
β (cold store) β βββββββββββββββ β (hot index) β β (cold store) β
|
| 81 |
+
β .cagra file β restore β CAGRA graph β β .cagra file β
|
| 82 |
+
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
|
| 83 |
+
β
|
| 84 |
+
search()
|
| 85 |
+
β
|
| 86 |
+
query vector
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
This design lets you run **dozens of projects** on a single GPU by keeping only the active ones hot. Full VRAM capacity is utilized.
|
| 90 |
+
|
| 91 |
+
### LLM-Interpreted Results
|
| 92 |
+
Raw vector search returns `(id, score)` tuples. Before showing results to the user, ARIA passes them through an LLM that interprets the matches, merges adjacent video timestamps into time ranges, and generates a human-readable summary.
|
| 93 |
+
|
| 94 |
+
| Tier | Model | Notes |
|
| 95 |
+
|------|-------|-------|
|
| 96 |
+
| Primary | `Qwen/Qwen3-35B-A3B` | MoE: 35B total, 3B active β fast + smart |
|
| 97 |
+
| Fallback | `Qwen/Qwen3-1.7B` | Tiny, runs on anything |
|
| 98 |
+
| API | HF Inference API | Zero local compute, free tier |
|
| 99 |
+
|
| 100 |
+
---
|
| 101 |
+
|
| 102 |
+
## Architecture
|
| 103 |
+
|
| 104 |
+

|
| 105 |
+
|
| 106 |
+
### Data Flow (Single Query)
|
| 107 |
+
|
| 108 |
+

|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## GPU Compute Tiers
|
| 113 |
+
|
| 114 |
+
ARIA automatically detects available hardware and selects the best backend:
|
| 115 |
+
|
| 116 |
+

|
| 117 |
+
|
| 118 |
+
| Tier | Backend | Search Latency (100K vectors) | When Used |
|
| 119 |
+
|------|---------|-------------------------------|-----------|
|
| 120 |
+
| 1 | CAGRA graph (hipVS / cuVS) | ~50 ΞΌs | AMD ROCm GPU + `hipvs` installed |
|
| 121 |
+
| 2 | Flat tensor (hipBLAS matmul) | ~2 ms | Any CUDA/ROCm GPU |
|
| 122 |
+
| 3 | NumPy cosine similarity | ~15 ms | CPU-only / free HF Space |
|
| 123 |
+
|
| 124 |
+
---
|
| 125 |
+
|
| 126 |
+
## Project Structure
|
| 127 |
+
|
| 128 |
+
```
|
| 129 |
+
HF_Space_hipVS/
|
| 130 |
+
βββ app.py # Gradio UI β 3 tabs (Search, Upload, About)
|
| 131 |
+
βββ config.py # Env-aware configuration, auto-scales by hardware
|
| 132 |
+
βββ embedding.py # Qwen3-VL / CLIP multimodal embedding + LLM calls
|
| 133 |
+
βββ vector_store.py # 3-tier vector store (CAGRA β GPU β CPU) + NVMe swap
|
| 134 |
+
βββ ingest.py # Image & video ingestion pipeline
|
| 135 |
+
βββ search.py # Query β embed β search β LLM interpret
|
| 136 |
+
βββ seed_data.py # Auto-seed from HF datasets on first launch
|
| 137 |
+
βββ requirements.txt # HF-native dependencies
|
| 138 |
+
βββ README.md # This file
|
| 139 |
+
βββ .env.example # Environment variable template
|
| 140 |
+
βββ data/
|
| 141 |
+
βββ projects/ # Per-project source files and indexes
|
| 142 |
+
β βββ default/
|
| 143 |
+
β βββ images/
|
| 144 |
+
β βββ videos/
|
| 145 |
+
β βββ indexes/
|
| 146 |
+
βββ models/ # Cached model weights (auto-downloaded)
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
---
|
| 150 |
+
|
| 151 |
+
## Models
|
| 152 |
+
|
| 153 |
+
### Embedding (Multimodal β Images + Text in same space)
|
| 154 |
+
|
| 155 |
+
No captioning model is used. Both images and text are embedded directly into a shared vector space by a single vision-language model.
|
| 156 |
+
|
| 157 |
+
| Model | Params | Dim | Modalities | Tier |
|
| 158 |
+
|-------|--------|-----|------------|------|
|
| 159 |
+
| `Qwen/Qwen3-VL-Embedding-8B` | 8B | 4096 | image, video frame, text | GPU (production) |
|
| 160 |
+
| `Qwen/Qwen3-VL-Embedding-2B` | 2B | 2048 | image, video frame, text | GPU (balanced) |
|
| 161 |
+
| `openai/clip-vit-large-patch14` | 428M | 768 | image, text | CPU (fallback) |
|
| 162 |
+
|
| 163 |
+
### LLM (Search Result Interpretation)
|
| 164 |
+
|
| 165 |
+
| Model | Params | Architecture | Tier |
|
| 166 |
+
|-------|--------|-------------|------|
|
| 167 |
+
| `Qwen/Qwen3-35B-A3B` | 35B (3B active) | MoE | Primary β fast inference, smart |
|
| 168 |
+
| `Qwen/Qwen3-1.7B` | 1.7B | Dense | Fallback β runs on anything |
|
| 169 |
+
| HF Inference API | -- | Serverless | API fallback β zero local compute |
|
| 170 |
+
|
| 171 |
+
---
|
| 172 |
+
|
| 173 |
+
## Setup
|
| 174 |
+
|
| 175 |
+
### Hugging Face Space (Recommended)
|
| 176 |
+
|
| 177 |
+
1. Create a new Space (Gradio SDK)
|
| 178 |
+
2. Push the `HF_Space_hipVS/` directory
|
| 179 |
+
3. Set these **Secrets** in Space Settings:
|
| 180 |
+
|
| 181 |
+
| Secret | Required | Description |
|
| 182 |
+
|--------|----------|-------------|
|
| 183 |
+
| `HF_TOKEN` | Optional | HF write token for dataset persistence + Inference API |
|
| 184 |
+
| `USE_GPU` | Optional | Set `true` on GPU-enabled Spaces |
|
| 185 |
+
| `HF_DATASET_REPO` | Optional | e.g. `username/aria-index` for persistent storage |
|
| 186 |
+
|
| 187 |
+
4. The Space auto-seeds demo content from `flickr30k` on first launch.
|
| 188 |
+
|
| 189 |
+
### Local / AMD GPU Server
|
| 190 |
+
|
| 191 |
+
```bash
|
| 192 |
+
cd HF_Space_hipVS
|
| 193 |
+
cp .env.example .env # edit with your settings
|
| 194 |
+
pip install -r requirements.txt
|
| 195 |
+
python app.py # starts on http://localhost:7860
|
| 196 |
+
```
|
| 197 |
+
|
| 198 |
+
For CAGRA acceleration on AMD:
|
| 199 |
+
```bash
|
| 200 |
+
pip install hipvs cupy-rocm # enables Tier 1
|
| 201 |
+
export USE_GPU=true
|
| 202 |
+
python app.py
|
| 203 |
+
```
|
| 204 |
+
|
| 205 |
+
---
|
| 206 |
+
|
| 207 |
+
## Environment Variables
|
| 208 |
+
|
| 209 |
+
| Variable | Default | Description |
|
| 210 |
+
|----------|---------|-------------|
|
| 211 |
+
| `USE_GPU` | `false` | Enable GPU acceleration |
|
| 212 |
+
| `EMBED_MODEL` | auto-detected | Override embedding model |
|
| 213 |
+
| `EMBED_DIM` | auto-detected | Embedding dimensionality |
|
| 214 |
+
| `LLM_MODEL` | `Qwen/Qwen3-35B-A3B` | LLM for result summarization |
|
| 215 |
+
| `LLM_FALLBACK` | `Qwen/Qwen3-1.7B` | Fallback LLM |
|
| 216 |
+
| `FRAME_EVERY_SEC` | `5` | Video frame extraction interval |
|
| 217 |
+
| `HF_TOKEN` | -- | HF token for persistence + API |
|
| 218 |
+
| `HF_DATASET_REPO` | -- | HF Dataset repo for cold storage |
|
| 219 |
+
| `AUTO_SEED` | `true` | Auto-seed on first empty launch |
|
| 220 |
+
| `SEED_DATASET` | `nlphuji/flickr30k` | Dataset for auto-seeding |
|
| 221 |
+
| `SWAP_PATH` | `data/indexes` | NVMe path for index swap files |
|
| 222 |
+
|
| 223 |
+
---
|
| 224 |
+
|
| 225 |
+
## How It Works
|
| 226 |
+
|
| 227 |
+
### 1. Create a Project
|
| 228 |
+
Each project is an isolated workspace with its own sources, embeddings, and CAGRA index. You can have a "security-cam" project and a "product-catalog" project running on the same GPU without interference.
|
| 229 |
+
|
| 230 |
+
### 2. Ingest Media
|
| 231 |
+
Upload images or videos. For videos, ffmpeg extracts one representative frame every N seconds. Every image and frame is embedded directly by the vision-language model (Qwen3-VL or CLIP) β no captioning, no text intermediary.
|
| 232 |
+
|
| 233 |
+
### 3. CAGRA Build
|
| 234 |
+
After every insert, the CAGRA graph index is **fully rebuilt** from the updated vector set. This is intentional: ARIA is optimized for query speed, not ingestion throughput. A 100K rebuild takes ~2s on MI250X. The built graph is immediately serialized to NVMe.
|
| 235 |
+
|
| 236 |
+
### 4. Search
|
| 237 |
+
When you search, the query text is embedded by the same model. The CAGRA index is loaded into VRAM (if not already hot) via async pinned-memory DMA, and searched in microseconds. Results are post-processed: video frame hits are merged into time ranges, and the full result set is sent to the LLM for a human-friendly summary.
|
| 238 |
+
|
| 239 |
+
### 5. Memory Management
|
| 240 |
+
Multiple project indexes coexist by swapping between NVMe and VRAM. Active indexes are kept hot; idle ones are evicted. Restoration from NVMe is a fast deserialization β no re-embedding, no rebuild.
|
| 241 |
+
|
| 242 |
+
---
|
| 243 |
+
|
| 244 |
+
## Design Decisions
|
| 245 |
+
|
| 246 |
+
| Decision | Rationale |
|
| 247 |
+
|----------|-----------|
|
| 248 |
+
| **No captioning model** | Vision-language embedding models (Qwen3-VL, CLIP) encode images directly into the same space as text. Captioning adds latency and loses visual information. |
|
| 249 |
+
| **Rebuild CAGRA on every insert** | This project is inference-heavy. Query latency matters more than ingestion speed. CAGRA rebuild is fast enough (~2s for 100K vectors). |
|
| 250 |
+
| **NVMe swap, not eviction** | Indexes are expensive to build. Serializing to NVMe and restoring is 100x faster than re-embedding from source. |
|
| 251 |
+
| **Multi-project isolation** | Real-world use cases involve multiple distinct corpora. Isolation prevents cross-contamination and allows per-project model configuration. |
|
| 252 |
+
| **No external database** | Everything is `.npz` + `.cagra` files. Portable, debuggable, no ops overhead. HF Dataset push is optional backup. |
|
| 253 |
+
| **MoE LLM (Qwen3-35B-A3B)** | 35B params for quality, but only 3B active per token β inference cost of a 3B model with the reasoning of a 35B. |
|
| 254 |
+
|
| 255 |
+
---
|
| 256 |
+
|
| 257 |
+
## License
|
| 258 |
+
|
| 259 |
+
Apache 2.0
|
| 260 |
+
|
| 261 |
+
---
|
| 262 |
+
|
| 263 |
+
<div align="center">
|
| 264 |
+
<i>Built for the AMD Hackathon β ARIA Vision Intelligence Platform</i>
|
| 265 |
+
</div>
|