File size: 7,516 Bytes

# 🎬 Video Intelligence Platform

> **Akinator-style video search with RAG, boolean queries, and tree-based refinement.**

Upload any video → the system indexes every frame → then search with natural language to find exact timestamps.

## 🌟 Features

- **Natural Language Search**: "person wearing white clothes" → exact timestamps
- **Boolean Queries**: "red car AND bicycle" → timestamps where BOTH appear together
- **Akinator Tree Refinement**: Too many results? The system asks discriminative questions to narrow down (indoor/outdoor? day/night? etc.)
- **RAG Answers**: Generates grounded answers citing specific timestamps
- **Multi-Channel Fusion**: Combines visual similarity + caption search + object detection

## 🏗️ Architecture

```
Video → Frame Extraction (1fps)
         │
         ├─► Grounding DINO → Object detection with attributes
         │   (detects "person in white shirt", "red car", etc.)
         │   → SQLite structured DB
         │
         ├─► SigLIP2 → Frame embeddings (1152-dim)
         │   → FAISS vector index
         │
         └─► Gemini 2.0 Flash → Dense captions
             → Gemini text-embedding-004 → Caption embeddings (768-dim)
             → FAISS vector index

Query → Gemini (decompose boolean) → Sub-queries
         │
         ├─► Visual search (SigLIP2 FAISS)
         ├─► Caption search (Gemini FAISS)
         └─► Detection search (SQL)
              │
              ▼
         Score Fusion → Boolean Ops (AND/OR) → Ranked Timestamps
              │
              ▼
         Akinator Refinement (if too many results)
              │
              ▼
         RAG Answer Generation (Gemini)
```

## 🚀 Quick Start

### 1. Clone the repo
```bash
git clone https://huggingface.co/notRaphael/video-intelligence-platform
cd video-intelligence-platform
```

### 2. Install dependencies
```bash
pip install -r requirements.txt
```

> **Note:** Requires `transformers >= 4.49` (for SigLIP2 support). The system uses ~2.2GB RAM for model loading (SigLIP2 ~1.5GB + Grounding DINO ~657MB). A machine with ≥8GB RAM is recommended.

### 3. Get a Gemini API key (free)
- Go to https://aistudio.google.com/apikey
- Create a free API key

### 4. Launch the UI
```bash
export GEMINI_API_KEY="your-key-here"
python app.py
```

### 5. Or use the CLI
```bash
# Index a video
python app.py --index video.mp4 --api-key YOUR_KEY

# Search
python app.py --search "red car" --api-key YOUR_KEY
```

## 📋 Models Used

| Component | Model | Size | Runs On |
|---|---|---|---|
| Frame Embeddings | [SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384) | ~1.5GB | CPU ✅ / GPU |
| Object Detection | [Grounding DINO](https://huggingface.co/IDEA-Research/grounding-dino-tiny) | ~657MB | CPU ✅ / GPU |
| Captioning | Gemini 2.0 Flash | API | Cloud |
| Text Embeddings | Gemini text-embedding-004 | API | Cloud |
| Query/RAG | Gemini 2.0 Flash | API | Cloud |

## 🔧 API Verification (Apr 2026)

All model APIs verified against **transformers 5.6.2** and **google-genai 1.73.1**:

### SigLIP2 (`google/siglip2-so400m-patch14-384`)
- `AutoModel` / `AutoProcessor` → resolves to `SiglipModel` / `SiglipProcessor`
- `model.get_image_features(**inputs)` returns `BaseModelOutputWithPooling` (`.pooler_output` = `[B, 1152]`)
- Text input **must** use `padding="max_length"` (training requirement)
- Uses sigmoid (not softmax) for similarity scores

### Grounding DINO (`IDEA-Research/grounding-dino-tiny`)
- `AutoModelForZeroShotObjectDetection` / `AutoProcessor` → resolves to `GroundingDinoForObjectDetection` / `GroundingDinoProcessor`
- Processor accepts text as `str`, `list[str]`, or `list[list[str]]` — auto-converts internally
- `post_process_grounded_object_detection`: `threshold` kwarg (not `box_threshold`), `input_ids` optional
- Returns dict with both `"text_labels"` and `"labels"` keys
- `target_sizes` expects `(height, width)` tuples

### Gemini (`google-genai` SDK)
- Uses `google.genai` (NOT deprecated `google.generativeai`)
- `genai.Client(api_key=...)` → `client.models.generate_content(...)`, `client.models.embed_content(...)`
- `types.Part.from_bytes(data=..., mime_type=...)`, `types.Part.from_text(text=...)`
- Embedding is **text-only** — cannot embed images/video directly

## 🌳 How the Akinator Tree Works

When a search returns too many results (>10), the system:

1. **Extracts attributes** from all candidate frames (objects, colors, location, time, actions)
2. **Computes information gain** for each attribute (same algorithm as decision trees!)
3. **Asks the most discriminative question** (e.g., "Indoor or outdoor?")
4. **Splits results** based on your answer
5. **Repeats** until results are manageable

```
"Found 47 clips with people"
    │
    ▼
"Indoor or outdoor?" → Outdoor (24 clips)
    │
    ▼
"Daytime or nighttime?" → Daytime (15 clips)
    │
    ▼
"What color clothing?" → White (6 clips) ✅ Done!
```

## 🔮 Future: TPU Training

The platform is designed for future fine-tuning on TPU:

- **VLM2Vec-V2** (Qwen2-VL-7B + LoRA) for domain-specific video embeddings
- **TimeLens** recipe (GRPO/RLVR) for temporal grounding
- Uses `accelerate` + FSDPv2 with bf16 on TPU v5e

## 📚 Based On Research

| Paper | What We Use |
|---|---|
| [AVA](https://arxiv.org/abs/2505.00254) | Event Knowledge Graphs + semantic chunking |
| [VideoRAG](https://arxiv.org/abs/2502.01549) | Dual-channel retrieval architecture |
| [ForeSea](https://arxiv.org/abs/2603.22872) | Attribute-based forensic search |
| [TimeLens](https://arxiv.org/abs/2512.14698) | Temporal grounding recipes |
| [SigLIP2](https://arxiv.org/abs/2502.14786) | Frame-text shared embeddings |
| [Grounding DINO](https://arxiv.org/abs/2303.05499) | Open-vocab attribute detection |

## ⚠️ Troubleshooting

### "Could not import module 'AutoProcessor'"
This means your `transformers` version is too old. SigLIP2 requires `>= 4.49`:
```bash
pip install -U transformers
# Also clear stale cache:
rm -rf ~/.cache/huggingface/hub/models--google--siglip2-so400m-patch14-384
rm -rf ~/.cache/huggingface/hub/models--IDEA-Research--grounding-dino-tiny
```

### Out of Memory during model loading
SigLIP2 (~1.5GB) + Grounding DINO (~657MB) need ~2.5GB RAM just for weights. If your system has < 8GB RAM:
- Set `device="cpu"` in config (default)
- Close other memory-heavy applications
- Consider using only one model at a time

### Gemini rate limiting
The free tier allows ~15 requests/minute. The pipeline adds a 4-second delay between captioning calls. For longer videos, consider:
- Increasing `caption_every_n` (e.g., 5 = caption every 5th frame)
- Using a paid Gemini API tier

## 📁 Project Structure

```
video_intelligence/
├── __init__.py          # Package init
├── config.py            # Configuration dataclass
├── frame_extractor.py   # OpenCV frame extraction
├── gemini_client.py     # Gemini API (captioning, embedding, RAG, query decomposition)
├── visual_encoders.py   # SigLIP2 + Grounding DINO
├── index_store.py       # SQLite + FAISS index
├── query_engine.py      # Multi-channel search + boolean ops + fusion
├── akinator.py          # Decision-tree refinement
├── pipeline.py          # End-to-end indexing orchestrator
└── app.py               # Gradio UI
app.py                   # Entry point (CLI + UI)
requirements.txt
```

## License

MIT