File size: 11,907 Bytes
1de911d 88dcee2 dc3d1dd 1de911d 88dcee2 1de911d dc3d1dd 1de911d cf76ef6 1de911d dc3d1dd fb12ddc dc3d1dd fb12ddc dc3d1dd fb12ddc dc3d1dd fb12ddc dc3d1dd fb12ddc dc3d1dd 88dcee2 dc3d1dd 88dcee2 dc3d1dd fb12ddc dc3d1dd 88dcee2 dc3d1dd fb12ddc dc3d1dd fb12ddc dc3d1dd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 | ---
title: ROCKIT Vision Intelligence
emoji: "π"
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: "5.12.0"
app_file: app.py
pinned: true
license: apache-2.0
python_version: 3.12
---
<div align="center">
# ROCKIT Vision Intelligence
### GPU-Accelerated Multimodal Search Engine
*Build isolated projects, ingest images and videos, search with natural language β all powered by AMD hipVS CAGRA graph indexes with NVMe-backed hot-swap memory management.*
[](https://huggingface.co/spaces)
[](LICENSE)
[](https://www.python.org/)
</div>
---
## What Is This?
ROCKIT Vision Intelligence is an **open-source, self-hosted multimodal search engine** that lets you create isolated projects, ingest visual media (images, videos), and query them with natural language. It is built for the **AMD Hackathon** and designed to showcase GPU-accelerated approximate nearest-neighbor (ANN) search using the **hipVS CAGRA** graph index on AMD ROCm hardware.
The core idea is simple:
> **Upload media β Embed everything into a shared vector space β Build a CAGRA graph on the GPU β Search in microseconds β Let an LLM interpret the results.**
There is no database. There are no external API dependencies. Every embedding, every index, and every LLM inference can run **entirely on local hardware** β from a single AMD GPU to a CPU-only Hugging Face free tier.
---
## Key Features
### Multi-Project Isolation
Create **multiple projects**, each with its own sources, indexes, and configuration. Projects are fully isolated β ingesting media into one never affects another.
```
projects/
βββ security-cam/ # CCTV footage analysis
β βββ sources/
β βββ indexes/
β βββ config.json
βββ product-catalog/ # E-commerce image search
β βββ sources/
β βββ indexes/
β βββ config.json
βββ nature-docs/ # Wildlife video intelligence
βββ sources/
βββ indexes/
βββ config.json
```
### Native Multimodal Embedding (No Captioning)
Unlike caption-then-embed pipelines, ROCKIT uses **true vision-language embedding models** that encode images, video frames, and text queries into the **same vector space** directly. No intermediate captioning step β no information loss.
| Tier | Model | Dim | Use Case |
|------|-------|-----|----------|
| GPU (large) | `Qwen/Qwen3-VL-Embedding-8B` | 4096 | Highest quality, production |
| GPU (small) | `Qwen/Qwen3-VL-Embedding-2B` | 2048 | Balanced speed / quality |
| CPU fallback | `openai/clip-vit-large-patch14` | 768 | Free-tier HF Spaces, dev |
### CAGRA Graph Index (hipVS)
The CAGRA graph index is the fastest known ANN algorithm for GPU-resident data. ROCKIT rebuilds the CAGRA graph on every insert because this project is **optimized for inference and query speed**, not ingestion throughput. A 100K-vector CAGRA rebuild takes ~2 seconds on an MI250X β negligible compared to the embedding cost.
### NVMe β VRAM Async Hot-Swap
Indexes live in three tiers of memory. When a project is queried, its index is **asynchronously copied from NVMe into VRAM** via pinned-memory DMA, without blocking other projects. When VRAM fills up, least-recently-used indexes are evicted back to NVMe β not deleted.
```
ββββββββββββββββ async copy ββββββββββββββββ evict ββββββββββββββββ
β NVMe SSD β βββββββββββββββ β GPU VRAM β βββββββββββββββ β NVMe SSD β
β (cold store) β βββββββββββββββ β (hot index) β β (cold store) β
β .cagra file β restore β CAGRA graph β β .cagra file β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β
search()
β
query vector
```
This design lets you run **dozens of projects** on a single GPU by keeping only the active ones hot. Full VRAM capacity is utilized.
### LLM-Interpreted Results
Raw vector search returns `(id, score)` tuples. Before showing results to the user, ROCKIT passes them through an LLM that interprets the matches, merges adjacent video timestamps into time ranges, and generates a human-readable summary.
| Tier | Model | Notes |
|------|-------|-------|
| Primary | `Qwen/Qwen3-35B-A3B` | MoE: 35B total, 3B active β fast + smart |
| Fallback | `Qwen/Qwen3-1.7B` | Tiny, runs on anything |
| API | HF Inference API | Zero local compute, free tier |
---
## Architecture

### Data Flow (Single Query)

---
## GPU Compute Tiers
ROCKIT automatically detects available hardware and selects the best backend:

| Tier | Backend | Search Latency (100K vectors) | When Used |
|------|---------|-------------------------------|-----------|
| 1 | CAGRA graph (hipVS / cuVS) | ~50 ΞΌs | AMD ROCm GPU + `hipvs` installed |
| 2 | Flat tensor (hipBLAS matmul) | ~2 ms | Any CUDA/ROCm GPU |
| 3 | NumPy cosine similarity | ~15 ms | CPU-only / free HF Space |
---
## Project Structure
```
HF_Space_hipVS/
βββ app.py # Gradio UI β 3 tabs (Search, Upload, About)
βββ config.py # Env-aware configuration, auto-scales by hardware
βββ embedding.py # Qwen3-VL / CLIP multimodal embedding + LLM calls
βββ vector_store.py # 3-tier vector store (CAGRA β GPU β CPU) + NVMe swap
βββ ingest.py # Image & video ingestion pipeline
βββ search.py # Query β embed β search β LLM interpret
βββ seed_data.py # Auto-seed from HF datasets on first launch
βββ requirements.txt # HF-native dependencies
βββ README.md # This file
βββ .env.example # Environment variable template
βββ data/
βββ projects/ # Per-project source files and indexes
β βββ default/
β βββ images/
β βββ videos/
β βββ indexes/
βββ models/ # Cached model weights (auto-downloaded)
```
---
## Models
### Embedding (Multimodal β Images + Text in same space)
No captioning model is used. Both images and text are embedded directly into a shared vector space by a single vision-language model.
| Model | Params | Dim | Modalities | Tier |
|-------|--------|-----|------------|------|
| `Qwen/Qwen3-VL-Embedding-8B` | 8B | 4096 | image, video frame, text | GPU (production) |
| `Qwen/Qwen3-VL-Embedding-2B` | 2B | 2048 | image, video frame, text | GPU (balanced) |
| `openai/clip-vit-large-patch14` | 428M | 768 | image, text | CPU (fallback) |
### LLM (Search Result Interpretation)
| Model | Params | Architecture | Tier |
|-------|--------|-------------|------|
| `Qwen/Qwen3-35B-A3B` | 35B (3B active) | MoE | Primary β fast inference, smart |
| `Qwen/Qwen3-1.7B` | 1.7B | Dense | Fallback β runs on anything |
| HF Inference API | -- | Serverless | API fallback β zero local compute |
---
## Setup
### Hugging Face Space (Recommended)
1. Create a new Space (Gradio SDK)
2. Push the `HF_Space_hipVS/` directory
3. Set these **Secrets** in Space Settings:
| Secret | Required | Description |
|--------|----------|-------------|
| `HF_TOKEN` | Optional | HF write token for dataset persistence + Inference API |
| `USE_GPU` | Optional | Set `true` on GPU-enabled Spaces |
| `HF_DATASET_REPO` | Optional | e.g. `username/aria-index` for persistent storage |
4. The Space auto-seeds demo content from `flickr30k` on first launch.
### Local / AMD GPU Server
```bash
cd HF_Space_hipVS
cp .env.example .env # edit with your settings
pip install -r requirements.txt
python app.py # starts on http://localhost:7860
```
For CAGRA acceleration on AMD:
```bash
pip install hipvs cupy-rocm # enables Tier 1
export USE_GPU=true
python app.py
```
---
## Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `USE_GPU` | `false` | Enable GPU acceleration |
| `EMBED_MODEL` | auto-detected | Override embedding model |
| `EMBED_DIM` | auto-detected | Embedding dimensionality |
| `LLM_MODEL` | `Qwen/Qwen3-35B-A3B` | LLM for result summarization |
| `LLM_FALLBACK` | `Qwen/Qwen3-1.7B` | Fallback LLM |
| `FRAME_EVERY_SEC` | `5` | Video frame extraction interval |
| `HF_TOKEN` | -- | HF token for persistence + API |
| `HF_DATASET_REPO` | -- | HF Dataset repo for cold storage |
| `AUTO_SEED` | `true` | Auto-seed on first empty launch |
| `SEED_DATASET` | `nlphuji/flickr30k` | Dataset for auto-seeding |
| `SWAP_PATH` | `data/indexes` | NVMe path for index swap files |
---
## How It Works
### 1. Create a Project
Each project is an isolated workspace with its own sources, embeddings, and CAGRA index. You can have a "security-cam" project and a "product-catalog" project running on the same GPU without interference.
### 2. Ingest Media
Upload images or videos. For videos, ffmpeg extracts one representative frame every N seconds. Every image and frame is embedded directly by the vision-language model (Qwen3-VL or CLIP) β no captioning, no text intermediary.
### 3. CAGRA Build
After every insert, the CAGRA graph index is **fully rebuilt** from the updated vector set. This is intentional: ROCKIT is optimized for query speed, not ingestion throughput. A 100K rebuild takes ~2s on MI250X. The built graph is immediately serialized to NVMe.
### 4. Search
When you search, the query text is embedded by the same model. The CAGRA index is loaded into VRAM (if not already hot) via async pinned-memory DMA, and searched in microseconds. Results are post-processed: video frame hits are merged into time ranges, and the full result set is sent to the LLM for a human-friendly summary.
### 5. Memory Management
Multiple project indexes coexist by swapping between NVMe and VRAM. Active indexes are kept hot; idle ones are evicted. Restoration from NVMe is a fast deserialization β no re-embedding, no rebuild.
---
## Design Decisions
| Decision | Rationale |
|----------|-----------|
| **No captioning model** | Vision-language embedding models (Qwen3-VL, CLIP) encode images directly into the same space as text. Captioning adds latency and loses visual information. |
| **Rebuild CAGRA on every insert** | This project is inference-heavy. Query latency matters more than ingestion speed. CAGRA rebuild is fast enough (~2s for 100K vectors). |
| **NVMe swap, not eviction** | Indexes are expensive to build. Serializing to NVMe and restoring is 100x faster than re-embedding from source. |
| **Multi-project isolation** | Real-world use cases involve multiple distinct corpora. Isolation prevents cross-contamination and allows per-project model configuration. |
| **No external database** | Everything is `.npz` + `.cagra` files. Portable, debuggable, no ops overhead. HF Dataset push is optional backup. |
| **MoE LLM (Qwen3-35B-A3B)** | 35B params for quality, but only 3B active per token β inference cost of a 3B model with the reasoning of a 35B. |
---
## License
Apache 2.0
---
<div align="center">
<i>Built for the AMD Hackathon β ROCKIT Vision Intelligence Platform</i>
</div>
|