File size: 7,516 Bytes
dce11eb 1eb45d2 dce11eb 1eb45d2 dce11eb 1eb45d2 dce11eb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | # ๐ฌ Video Intelligence Platform
> **Akinator-style video search with RAG, boolean queries, and tree-based refinement.**
Upload any video โ the system indexes every frame โ then search with natural language to find exact timestamps.
## ๐ Features
- **Natural Language Search**: "person wearing white clothes" โ exact timestamps
- **Boolean Queries**: "red car AND bicycle" โ timestamps where BOTH appear together
- **Akinator Tree Refinement**: Too many results? The system asks discriminative questions to narrow down (indoor/outdoor? day/night? etc.)
- **RAG Answers**: Generates grounded answers citing specific timestamps
- **Multi-Channel Fusion**: Combines visual similarity + caption search + object detection
## ๐๏ธ Architecture
```
Video โ Frame Extraction (1fps)
โ
โโโบ Grounding DINO โ Object detection with attributes
โ (detects "person in white shirt", "red car", etc.)
โ โ SQLite structured DB
โ
โโโบ SigLIP2 โ Frame embeddings (1152-dim)
โ โ FAISS vector index
โ
โโโบ Gemini 2.0 Flash โ Dense captions
โ Gemini text-embedding-004 โ Caption embeddings (768-dim)
โ FAISS vector index
Query โ Gemini (decompose boolean) โ Sub-queries
โ
โโโบ Visual search (SigLIP2 FAISS)
โโโบ Caption search (Gemini FAISS)
โโโบ Detection search (SQL)
โ
โผ
Score Fusion โ Boolean Ops (AND/OR) โ Ranked Timestamps
โ
โผ
Akinator Refinement (if too many results)
โ
โผ
RAG Answer Generation (Gemini)
```
## ๐ Quick Start
### 1. Clone the repo
```bash
git clone https://huggingface.co/notRaphael/video-intelligence-platform
cd video-intelligence-platform
```
### 2. Install dependencies
```bash
pip install -r requirements.txt
```
> **Note:** Requires `transformers >= 4.49` (for SigLIP2 support). The system uses ~2.2GB RAM for model loading (SigLIP2 ~1.5GB + Grounding DINO ~657MB). A machine with โฅ8GB RAM is recommended.
### 3. Get a Gemini API key (free)
- Go to https://aistudio.google.com/apikey
- Create a free API key
### 4. Launch the UI
```bash
export GEMINI_API_KEY="your-key-here"
python app.py
```
### 5. Or use the CLI
```bash
# Index a video
python app.py --index video.mp4 --api-key YOUR_KEY
# Search
python app.py --search "red car" --api-key YOUR_KEY
```
## ๐ Models Used
| Component | Model | Size | Runs On |
|---|---|---|---|
| Frame Embeddings | [SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384) | ~1.5GB | CPU โ
/ GPU |
| Object Detection | [Grounding DINO](https://huggingface.co/IDEA-Research/grounding-dino-tiny) | ~657MB | CPU โ
/ GPU |
| Captioning | Gemini 2.0 Flash | API | Cloud |
| Text Embeddings | Gemini text-embedding-004 | API | Cloud |
| Query/RAG | Gemini 2.0 Flash | API | Cloud |
## ๐ง API Verification (Apr 2026)
All model APIs verified against **transformers 5.6.2** and **google-genai 1.73.1**:
### SigLIP2 (`google/siglip2-so400m-patch14-384`)
- `AutoModel` / `AutoProcessor` โ resolves to `SiglipModel` / `SiglipProcessor`
- `model.get_image_features(**inputs)` returns `BaseModelOutputWithPooling` (`.pooler_output` = `[B, 1152]`)
- Text input **must** use `padding="max_length"` (training requirement)
- Uses sigmoid (not softmax) for similarity scores
### Grounding DINO (`IDEA-Research/grounding-dino-tiny`)
- `AutoModelForZeroShotObjectDetection` / `AutoProcessor` โ resolves to `GroundingDinoForObjectDetection` / `GroundingDinoProcessor`
- Processor accepts text as `str`, `list[str]`, or `list[list[str]]` โ auto-converts internally
- `post_process_grounded_object_detection`: `threshold` kwarg (not `box_threshold`), `input_ids` optional
- Returns dict with both `"text_labels"` and `"labels"` keys
- `target_sizes` expects `(height, width)` tuples
### Gemini (`google-genai` SDK)
- Uses `google.genai` (NOT deprecated `google.generativeai`)
- `genai.Client(api_key=...)` โ `client.models.generate_content(...)`, `client.models.embed_content(...)`
- `types.Part.from_bytes(data=..., mime_type=...)`, `types.Part.from_text(text=...)`
- Embedding is **text-only** โ cannot embed images/video directly
## ๐ณ How the Akinator Tree Works
When a search returns too many results (>10), the system:
1. **Extracts attributes** from all candidate frames (objects, colors, location, time, actions)
2. **Computes information gain** for each attribute (same algorithm as decision trees!)
3. **Asks the most discriminative question** (e.g., "Indoor or outdoor?")
4. **Splits results** based on your answer
5. **Repeats** until results are manageable
```
"Found 47 clips with people"
โ
โผ
"Indoor or outdoor?" โ Outdoor (24 clips)
โ
โผ
"Daytime or nighttime?" โ Daytime (15 clips)
โ
โผ
"What color clothing?" โ White (6 clips) โ
Done!
```
## ๐ฎ Future: TPU Training
The platform is designed for future fine-tuning on TPU:
- **VLM2Vec-V2** (Qwen2-VL-7B + LoRA) for domain-specific video embeddings
- **TimeLens** recipe (GRPO/RLVR) for temporal grounding
- Uses `accelerate` + FSDPv2 with bf16 on TPU v5e
## ๐ Based On Research
| Paper | What We Use |
|---|---|
| [AVA](https://arxiv.org/abs/2505.00254) | Event Knowledge Graphs + semantic chunking |
| [VideoRAG](https://arxiv.org/abs/2502.01549) | Dual-channel retrieval architecture |
| [ForeSea](https://arxiv.org/abs/2603.22872) | Attribute-based forensic search |
| [TimeLens](https://arxiv.org/abs/2512.14698) | Temporal grounding recipes |
| [SigLIP2](https://arxiv.org/abs/2502.14786) | Frame-text shared embeddings |
| [Grounding DINO](https://arxiv.org/abs/2303.05499) | Open-vocab attribute detection |
## โ ๏ธ Troubleshooting
### "Could not import module 'AutoProcessor'"
This means your `transformers` version is too old. SigLIP2 requires `>= 4.49`:
```bash
pip install -U transformers
# Also clear stale cache:
rm -rf ~/.cache/huggingface/hub/models--google--siglip2-so400m-patch14-384
rm -rf ~/.cache/huggingface/hub/models--IDEA-Research--grounding-dino-tiny
```
### Out of Memory during model loading
SigLIP2 (~1.5GB) + Grounding DINO (~657MB) need ~2.5GB RAM just for weights. If your system has < 8GB RAM:
- Set `device="cpu"` in config (default)
- Close other memory-heavy applications
- Consider using only one model at a time
### Gemini rate limiting
The free tier allows ~15 requests/minute. The pipeline adds a 4-second delay between captioning calls. For longer videos, consider:
- Increasing `caption_every_n` (e.g., 5 = caption every 5th frame)
- Using a paid Gemini API tier
## ๐ Project Structure
```
video_intelligence/
โโโ __init__.py # Package init
โโโ config.py # Configuration dataclass
โโโ frame_extractor.py # OpenCV frame extraction
โโโ gemini_client.py # Gemini API (captioning, embedding, RAG, query decomposition)
โโโ visual_encoders.py # SigLIP2 + Grounding DINO
โโโ index_store.py # SQLite + FAISS index
โโโ query_engine.py # Multi-channel search + boolean ops + fusion
โโโ akinator.py # Decision-tree refinement
โโโ pipeline.py # End-to-end indexing orchestrator
โโโ app.py # Gradio UI
app.py # Entry point (CLI + UI)
requirements.txt
```
## License
MIT
|