notRaphael's picture
docs: update README with API verification details and troubleshooting
1eb45d2 verified
# ๐ŸŽฌ Video Intelligence Platform
> **Akinator-style video search with RAG, boolean queries, and tree-based refinement.**
Upload any video โ†’ the system indexes every frame โ†’ then search with natural language to find exact timestamps.
## ๐ŸŒŸ Features
- **Natural Language Search**: "person wearing white clothes" โ†’ exact timestamps
- **Boolean Queries**: "red car AND bicycle" โ†’ timestamps where BOTH appear together
- **Akinator Tree Refinement**: Too many results? The system asks discriminative questions to narrow down (indoor/outdoor? day/night? etc.)
- **RAG Answers**: Generates grounded answers citing specific timestamps
- **Multi-Channel Fusion**: Combines visual similarity + caption search + object detection
## ๐Ÿ—๏ธ Architecture
```
Video โ†’ Frame Extraction (1fps)
โ”‚
โ”œโ”€โ–บ Grounding DINO โ†’ Object detection with attributes
โ”‚ (detects "person in white shirt", "red car", etc.)
โ”‚ โ†’ SQLite structured DB
โ”‚
โ”œโ”€โ–บ SigLIP2 โ†’ Frame embeddings (1152-dim)
โ”‚ โ†’ FAISS vector index
โ”‚
โ””โ”€โ–บ Gemini 2.0 Flash โ†’ Dense captions
โ†’ Gemini text-embedding-004 โ†’ Caption embeddings (768-dim)
โ†’ FAISS vector index
Query โ†’ Gemini (decompose boolean) โ†’ Sub-queries
โ”‚
โ”œโ”€โ–บ Visual search (SigLIP2 FAISS)
โ”œโ”€โ–บ Caption search (Gemini FAISS)
โ””โ”€โ–บ Detection search (SQL)
โ”‚
โ–ผ
Score Fusion โ†’ Boolean Ops (AND/OR) โ†’ Ranked Timestamps
โ”‚
โ–ผ
Akinator Refinement (if too many results)
โ”‚
โ–ผ
RAG Answer Generation (Gemini)
```
## ๐Ÿš€ Quick Start
### 1. Clone the repo
```bash
git clone https://huggingface.co/notRaphael/video-intelligence-platform
cd video-intelligence-platform
```
### 2. Install dependencies
```bash
pip install -r requirements.txt
```
> **Note:** Requires `transformers >= 4.49` (for SigLIP2 support). The system uses ~2.2GB RAM for model loading (SigLIP2 ~1.5GB + Grounding DINO ~657MB). A machine with โ‰ฅ8GB RAM is recommended.
### 3. Get a Gemini API key (free)
- Go to https://aistudio.google.com/apikey
- Create a free API key
### 4. Launch the UI
```bash
export GEMINI_API_KEY="your-key-here"
python app.py
```
### 5. Or use the CLI
```bash
# Index a video
python app.py --index video.mp4 --api-key YOUR_KEY
# Search
python app.py --search "red car" --api-key YOUR_KEY
```
## ๐Ÿ“‹ Models Used
| Component | Model | Size | Runs On |
|---|---|---|---|
| Frame Embeddings | [SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384) | ~1.5GB | CPU โœ… / GPU |
| Object Detection | [Grounding DINO](https://huggingface.co/IDEA-Research/grounding-dino-tiny) | ~657MB | CPU โœ… / GPU |
| Captioning | Gemini 2.0 Flash | API | Cloud |
| Text Embeddings | Gemini text-embedding-004 | API | Cloud |
| Query/RAG | Gemini 2.0 Flash | API | Cloud |
## ๐Ÿ”ง API Verification (Apr 2026)
All model APIs verified against **transformers 5.6.2** and **google-genai 1.73.1**:
### SigLIP2 (`google/siglip2-so400m-patch14-384`)
- `AutoModel` / `AutoProcessor` โ†’ resolves to `SiglipModel` / `SiglipProcessor`
- `model.get_image_features(**inputs)` returns `BaseModelOutputWithPooling` (`.pooler_output` = `[B, 1152]`)
- Text input **must** use `padding="max_length"` (training requirement)
- Uses sigmoid (not softmax) for similarity scores
### Grounding DINO (`IDEA-Research/grounding-dino-tiny`)
- `AutoModelForZeroShotObjectDetection` / `AutoProcessor` โ†’ resolves to `GroundingDinoForObjectDetection` / `GroundingDinoProcessor`
- Processor accepts text as `str`, `list[str]`, or `list[list[str]]` โ€” auto-converts internally
- `post_process_grounded_object_detection`: `threshold` kwarg (not `box_threshold`), `input_ids` optional
- Returns dict with both `"text_labels"` and `"labels"` keys
- `target_sizes` expects `(height, width)` tuples
### Gemini (`google-genai` SDK)
- Uses `google.genai` (NOT deprecated `google.generativeai`)
- `genai.Client(api_key=...)` โ†’ `client.models.generate_content(...)`, `client.models.embed_content(...)`
- `types.Part.from_bytes(data=..., mime_type=...)`, `types.Part.from_text(text=...)`
- Embedding is **text-only** โ€” cannot embed images/video directly
## ๐ŸŒณ How the Akinator Tree Works
When a search returns too many results (>10), the system:
1. **Extracts attributes** from all candidate frames (objects, colors, location, time, actions)
2. **Computes information gain** for each attribute (same algorithm as decision trees!)
3. **Asks the most discriminative question** (e.g., "Indoor or outdoor?")
4. **Splits results** based on your answer
5. **Repeats** until results are manageable
```
"Found 47 clips with people"
โ”‚
โ–ผ
"Indoor or outdoor?" โ†’ Outdoor (24 clips)
โ”‚
โ–ผ
"Daytime or nighttime?" โ†’ Daytime (15 clips)
โ”‚
โ–ผ
"What color clothing?" โ†’ White (6 clips) โœ… Done!
```
## ๐Ÿ”ฎ Future: TPU Training
The platform is designed for future fine-tuning on TPU:
- **VLM2Vec-V2** (Qwen2-VL-7B + LoRA) for domain-specific video embeddings
- **TimeLens** recipe (GRPO/RLVR) for temporal grounding
- Uses `accelerate` + FSDPv2 with bf16 on TPU v5e
## ๐Ÿ“š Based On Research
| Paper | What We Use |
|---|---|
| [AVA](https://arxiv.org/abs/2505.00254) | Event Knowledge Graphs + semantic chunking |
| [VideoRAG](https://arxiv.org/abs/2502.01549) | Dual-channel retrieval architecture |
| [ForeSea](https://arxiv.org/abs/2603.22872) | Attribute-based forensic search |
| [TimeLens](https://arxiv.org/abs/2512.14698) | Temporal grounding recipes |
| [SigLIP2](https://arxiv.org/abs/2502.14786) | Frame-text shared embeddings |
| [Grounding DINO](https://arxiv.org/abs/2303.05499) | Open-vocab attribute detection |
## โš ๏ธ Troubleshooting
### "Could not import module 'AutoProcessor'"
This means your `transformers` version is too old. SigLIP2 requires `>= 4.49`:
```bash
pip install -U transformers
# Also clear stale cache:
rm -rf ~/.cache/huggingface/hub/models--google--siglip2-so400m-patch14-384
rm -rf ~/.cache/huggingface/hub/models--IDEA-Research--grounding-dino-tiny
```
### Out of Memory during model loading
SigLIP2 (~1.5GB) + Grounding DINO (~657MB) need ~2.5GB RAM just for weights. If your system has < 8GB RAM:
- Set `device="cpu"` in config (default)
- Close other memory-heavy applications
- Consider using only one model at a time
### Gemini rate limiting
The free tier allows ~15 requests/minute. The pipeline adds a 4-second delay between captioning calls. For longer videos, consider:
- Increasing `caption_every_n` (e.g., 5 = caption every 5th frame)
- Using a paid Gemini API tier
## ๐Ÿ“ Project Structure
```
video_intelligence/
โ”œโ”€โ”€ __init__.py # Package init
โ”œโ”€โ”€ config.py # Configuration dataclass
โ”œโ”€โ”€ frame_extractor.py # OpenCV frame extraction
โ”œโ”€โ”€ gemini_client.py # Gemini API (captioning, embedding, RAG, query decomposition)
โ”œโ”€โ”€ visual_encoders.py # SigLIP2 + Grounding DINO
โ”œโ”€โ”€ index_store.py # SQLite + FAISS index
โ”œโ”€โ”€ query_engine.py # Multi-channel search + boolean ops + fusion
โ”œโ”€โ”€ akinator.py # Decision-tree refinement
โ”œโ”€โ”€ pipeline.py # End-to-end indexing orchestrator
โ””โ”€โ”€ app.py # Gradio UI
app.py # Entry point (CLI + UI)
requirements.txt
```
## License
MIT