| # ๐ฌ Video Intelligence Platform |
|
|
| > **Akinator-style video search with RAG, boolean queries, and tree-based refinement.** |
|
|
| Upload any video โ the system indexes every frame โ then search with natural language to find exact timestamps. |
|
|
| ## ๐ Features |
|
|
| - **Natural Language Search**: "person wearing white clothes" โ exact timestamps |
| - **Boolean Queries**: "red car AND bicycle" โ timestamps where BOTH appear together |
| - **Akinator Tree Refinement**: Too many results? The system asks discriminative questions to narrow down (indoor/outdoor? day/night? etc.) |
| - **RAG Answers**: Generates grounded answers citing specific timestamps |
| - **Multi-Channel Fusion**: Combines visual similarity + caption search + object detection |
|
|
| ## ๐๏ธ Architecture |
|
|
| ``` |
| Video โ Frame Extraction (1fps) |
| โ |
| โโโบ Grounding DINO โ Object detection with attributes |
| โ (detects "person in white shirt", "red car", etc.) |
| โ โ SQLite structured DB |
| โ |
| โโโบ SigLIP2 โ Frame embeddings (1152-dim) |
| โ โ FAISS vector index |
| โ |
| โโโบ Gemini 2.0 Flash โ Dense captions |
| โ Gemini text-embedding-004 โ Caption embeddings (768-dim) |
| โ FAISS vector index |
| |
| Query โ Gemini (decompose boolean) โ Sub-queries |
| โ |
| โโโบ Visual search (SigLIP2 FAISS) |
| โโโบ Caption search (Gemini FAISS) |
| โโโบ Detection search (SQL) |
| โ |
| โผ |
| Score Fusion โ Boolean Ops (AND/OR) โ Ranked Timestamps |
| โ |
| โผ |
| Akinator Refinement (if too many results) |
| โ |
| โผ |
| RAG Answer Generation (Gemini) |
| ``` |
|
|
| ## ๐ Quick Start |
|
|
| ### 1. Clone the repo |
| ```bash |
| git clone https://huggingface.co/notRaphael/video-intelligence-platform |
| cd video-intelligence-platform |
| ``` |
|
|
| ### 2. Install dependencies |
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| > **Note:** Requires `transformers >= 4.49` (for SigLIP2 support). The system uses ~2.2GB RAM for model loading (SigLIP2 ~1.5GB + Grounding DINO ~657MB). A machine with โฅ8GB RAM is recommended. |
|
|
| ### 3. Get a Gemini API key (free) |
| - Go to https://aistudio.google.com/apikey |
| - Create a free API key |
|
|
| ### 4. Launch the UI |
| ```bash |
| export GEMINI_API_KEY="your-key-here" |
| python app.py |
| ``` |
|
|
| ### 5. Or use the CLI |
| ```bash |
| # Index a video |
| python app.py --index video.mp4 --api-key YOUR_KEY |
| |
| # Search |
| python app.py --search "red car" --api-key YOUR_KEY |
| ``` |
|
|
| ## ๐ Models Used |
|
|
| | Component | Model | Size | Runs On | |
| |---|---|---|---| |
| | Frame Embeddings | [SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384) | ~1.5GB | CPU โ
/ GPU | |
| | Object Detection | [Grounding DINO](https://huggingface.co/IDEA-Research/grounding-dino-tiny) | ~657MB | CPU โ
/ GPU | |
| | Captioning | Gemini 2.0 Flash | API | Cloud | |
| | Text Embeddings | Gemini text-embedding-004 | API | Cloud | |
| | Query/RAG | Gemini 2.0 Flash | API | Cloud | |
|
|
| ## ๐ง API Verification (Apr 2026) |
|
|
| All model APIs verified against **transformers 5.6.2** and **google-genai 1.73.1**: |
|
|
| ### SigLIP2 (`google/siglip2-so400m-patch14-384`) |
| - `AutoModel` / `AutoProcessor` โ resolves to `SiglipModel` / `SiglipProcessor` |
| - `model.get_image_features(**inputs)` returns `BaseModelOutputWithPooling` (`.pooler_output` = `[B, 1152]`) |
| - Text input **must** use `padding="max_length"` (training requirement) |
| - Uses sigmoid (not softmax) for similarity scores |
|
|
| ### Grounding DINO (`IDEA-Research/grounding-dino-tiny`) |
| - `AutoModelForZeroShotObjectDetection` / `AutoProcessor` โ resolves to `GroundingDinoForObjectDetection` / `GroundingDinoProcessor` |
| - Processor accepts text as `str`, `list[str]`, or `list[list[str]]` โ auto-converts internally |
| - `post_process_grounded_object_detection`: `threshold` kwarg (not `box_threshold`), `input_ids` optional |
| - Returns dict with both `"text_labels"` and `"labels"` keys |
| - `target_sizes` expects `(height, width)` tuples |
|
|
| ### Gemini (`google-genai` SDK) |
| - Uses `google.genai` (NOT deprecated `google.generativeai`) |
| - `genai.Client(api_key=...)` โ `client.models.generate_content(...)`, `client.models.embed_content(...)` |
| - `types.Part.from_bytes(data=..., mime_type=...)`, `types.Part.from_text(text=...)` |
| - Embedding is **text-only** โ cannot embed images/video directly |
|
|
| ## ๐ณ How the Akinator Tree Works |
|
|
| When a search returns too many results (>10), the system: |
|
|
| 1. **Extracts attributes** from all candidate frames (objects, colors, location, time, actions) |
| 2. **Computes information gain** for each attribute (same algorithm as decision trees!) |
| 3. **Asks the most discriminative question** (e.g., "Indoor or outdoor?") |
| 4. **Splits results** based on your answer |
| 5. **Repeats** until results are manageable |
|
|
| ``` |
| "Found 47 clips with people" |
| โ |
| โผ |
| "Indoor or outdoor?" โ Outdoor (24 clips) |
| โ |
| โผ |
| "Daytime or nighttime?" โ Daytime (15 clips) |
| โ |
| โผ |
| "What color clothing?" โ White (6 clips) โ
Done! |
| ``` |
|
|
| ## ๐ฎ Future: TPU Training |
|
|
| The platform is designed for future fine-tuning on TPU: |
|
|
| - **VLM2Vec-V2** (Qwen2-VL-7B + LoRA) for domain-specific video embeddings |
| - **TimeLens** recipe (GRPO/RLVR) for temporal grounding |
| - Uses `accelerate` + FSDPv2 with bf16 on TPU v5e |
|
|
| ## ๐ Based On Research |
|
|
| | Paper | What We Use | |
| |---|---| |
| | [AVA](https://arxiv.org/abs/2505.00254) | Event Knowledge Graphs + semantic chunking | |
| | [VideoRAG](https://arxiv.org/abs/2502.01549) | Dual-channel retrieval architecture | |
| | [ForeSea](https://arxiv.org/abs/2603.22872) | Attribute-based forensic search | |
| | [TimeLens](https://arxiv.org/abs/2512.14698) | Temporal grounding recipes | |
| | [SigLIP2](https://arxiv.org/abs/2502.14786) | Frame-text shared embeddings | |
| | [Grounding DINO](https://arxiv.org/abs/2303.05499) | Open-vocab attribute detection | |
|
|
| ## โ ๏ธ Troubleshooting |
|
|
| ### "Could not import module 'AutoProcessor'" |
| This means your `transformers` version is too old. SigLIP2 requires `>= 4.49`: |
| ```bash |
| pip install -U transformers |
| # Also clear stale cache: |
| rm -rf ~/.cache/huggingface/hub/models--google--siglip2-so400m-patch14-384 |
| rm -rf ~/.cache/huggingface/hub/models--IDEA-Research--grounding-dino-tiny |
| ``` |
|
|
| ### Out of Memory during model loading |
| SigLIP2 (~1.5GB) + Grounding DINO (~657MB) need ~2.5GB RAM just for weights. If your system has < 8GB RAM: |
| - Set `device="cpu"` in config (default) |
| - Close other memory-heavy applications |
| - Consider using only one model at a time |
|
|
| ### Gemini rate limiting |
| The free tier allows ~15 requests/minute. The pipeline adds a 4-second delay between captioning calls. For longer videos, consider: |
| - Increasing `caption_every_n` (e.g., 5 = caption every 5th frame) |
| - Using a paid Gemini API tier |
|
|
| ## ๐ Project Structure |
|
|
| ``` |
| video_intelligence/ |
| โโโ __init__.py # Package init |
| โโโ config.py # Configuration dataclass |
| โโโ frame_extractor.py # OpenCV frame extraction |
| โโโ gemini_client.py # Gemini API (captioning, embedding, RAG, query decomposition) |
| โโโ visual_encoders.py # SigLIP2 + Grounding DINO |
| โโโ index_store.py # SQLite + FAISS index |
| โโโ query_engine.py # Multi-channel search + boolean ops + fusion |
| โโโ akinator.py # Decision-tree refinement |
| โโโ pipeline.py # End-to-end indexing orchestrator |
| โโโ app.py # Gradio UI |
| app.py # Entry point (CLI + UI) |
| requirements.txt |
| ``` |
|
|
| ## License |
|
|
| MIT |
|
|