# 🎬 Video Intelligence Platform > **Akinator-style video search with RAG, boolean queries, and tree-based refinement.** Upload any video → the system indexes every frame → then search with natural language to find exact timestamps. ## 🌟 Features - **Natural Language Search**: "person wearing white clothes" → exact timestamps - **Boolean Queries**: "red car AND bicycle" → timestamps where BOTH appear together - **Akinator Tree Refinement**: Too many results? The system asks discriminative questions to narrow down (indoor/outdoor? day/night? etc.) - **RAG Answers**: Generates grounded answers citing specific timestamps - **Multi-Channel Fusion**: Combines visual similarity + caption search + object detection ## 🏗️ Architecture ``` Video → Frame Extraction (1fps) │ ├─► Grounding DINO → Object detection with attributes │ (detects "person in white shirt", "red car", etc.) │ → SQLite structured DB │ ├─► SigLIP2 → Frame embeddings (1152-dim) │ → FAISS vector index │ └─► Gemini 2.0 Flash → Dense captions → Gemini text-embedding-004 → Caption embeddings (768-dim) → FAISS vector index Query → Gemini (decompose boolean) → Sub-queries │ ├─► Visual search (SigLIP2 FAISS) ├─► Caption search (Gemini FAISS) └─► Detection search (SQL) │ ▼ Score Fusion → Boolean Ops (AND/OR) → Ranked Timestamps │ ▼ Akinator Refinement (if too many results) │ ▼ RAG Answer Generation (Gemini) ``` ## 🚀 Quick Start ### 1. Clone the repo ```bash git clone https://huggingface.co/notRaphael/video-intelligence-platform cd video-intelligence-platform ``` ### 2. Install dependencies ```bash pip install -r requirements.txt ``` > **Note:** Requires `transformers >= 4.49` (for SigLIP2 support). The system uses ~2.2GB RAM for model loading (SigLIP2 ~1.5GB + Grounding DINO ~657MB). A machine with ≥8GB RAM is recommended. ### 3. Get a Gemini API key (free) - Go to https://aistudio.google.com/apikey - Create a free API key ### 4. Launch the UI ```bash export GEMINI_API_KEY="your-key-here" python app.py ``` ### 5. Or use the CLI ```bash # Index a video python app.py --index video.mp4 --api-key YOUR_KEY # Search python app.py --search "red car" --api-key YOUR_KEY ``` ## 📋 Models Used | Component | Model | Size | Runs On | |---|---|---|---| | Frame Embeddings | [SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384) | ~1.5GB | CPU ✅ / GPU | | Object Detection | [Grounding DINO](https://huggingface.co/IDEA-Research/grounding-dino-tiny) | ~657MB | CPU ✅ / GPU | | Captioning | Gemini 2.0 Flash | API | Cloud | | Text Embeddings | Gemini text-embedding-004 | API | Cloud | | Query/RAG | Gemini 2.0 Flash | API | Cloud | ## 🔧 API Verification (Apr 2026) All model APIs verified against **transformers 5.6.2** and **google-genai 1.73.1**: ### SigLIP2 (`google/siglip2-so400m-patch14-384`) - `AutoModel` / `AutoProcessor` → resolves to `SiglipModel` / `SiglipProcessor` - `model.get_image_features(**inputs)` returns `BaseModelOutputWithPooling` (`.pooler_output` = `[B, 1152]`) - Text input **must** use `padding="max_length"` (training requirement) - Uses sigmoid (not softmax) for similarity scores ### Grounding DINO (`IDEA-Research/grounding-dino-tiny`) - `AutoModelForZeroShotObjectDetection` / `AutoProcessor` → resolves to `GroundingDinoForObjectDetection` / `GroundingDinoProcessor` - Processor accepts text as `str`, `list[str]`, or `list[list[str]]` — auto-converts internally - `post_process_grounded_object_detection`: `threshold` kwarg (not `box_threshold`), `input_ids` optional - Returns dict with both `"text_labels"` and `"labels"` keys - `target_sizes` expects `(height, width)` tuples ### Gemini (`google-genai` SDK) - Uses `google.genai` (NOT deprecated `google.generativeai`) - `genai.Client(api_key=...)` → `client.models.generate_content(...)`, `client.models.embed_content(...)` - `types.Part.from_bytes(data=..., mime_type=...)`, `types.Part.from_text(text=...)` - Embedding is **text-only** — cannot embed images/video directly ## 🌳 How the Akinator Tree Works When a search returns too many results (>10), the system: 1. **Extracts attributes** from all candidate frames (objects, colors, location, time, actions) 2. **Computes information gain** for each attribute (same algorithm as decision trees!) 3. **Asks the most discriminative question** (e.g., "Indoor or outdoor?") 4. **Splits results** based on your answer 5. **Repeats** until results are manageable ``` "Found 47 clips with people" │ ▼ "Indoor or outdoor?" → Outdoor (24 clips) │ ▼ "Daytime or nighttime?" → Daytime (15 clips) │ ▼ "What color clothing?" → White (6 clips) ✅ Done! ``` ## 🔮 Future: TPU Training The platform is designed for future fine-tuning on TPU: - **VLM2Vec-V2** (Qwen2-VL-7B + LoRA) for domain-specific video embeddings - **TimeLens** recipe (GRPO/RLVR) for temporal grounding - Uses `accelerate` + FSDPv2 with bf16 on TPU v5e ## 📚 Based On Research | Paper | What We Use | |---|---| | [AVA](https://arxiv.org/abs/2505.00254) | Event Knowledge Graphs + semantic chunking | | [VideoRAG](https://arxiv.org/abs/2502.01549) | Dual-channel retrieval architecture | | [ForeSea](https://arxiv.org/abs/2603.22872) | Attribute-based forensic search | | [TimeLens](https://arxiv.org/abs/2512.14698) | Temporal grounding recipes | | [SigLIP2](https://arxiv.org/abs/2502.14786) | Frame-text shared embeddings | | [Grounding DINO](https://arxiv.org/abs/2303.05499) | Open-vocab attribute detection | ## ⚠️ Troubleshooting ### "Could not import module 'AutoProcessor'" This means your `transformers` version is too old. SigLIP2 requires `>= 4.49`: ```bash pip install -U transformers # Also clear stale cache: rm -rf ~/.cache/huggingface/hub/models--google--siglip2-so400m-patch14-384 rm -rf ~/.cache/huggingface/hub/models--IDEA-Research--grounding-dino-tiny ``` ### Out of Memory during model loading SigLIP2 (~1.5GB) + Grounding DINO (~657MB) need ~2.5GB RAM just for weights. If your system has < 8GB RAM: - Set `device="cpu"` in config (default) - Close other memory-heavy applications - Consider using only one model at a time ### Gemini rate limiting The free tier allows ~15 requests/minute. The pipeline adds a 4-second delay between captioning calls. For longer videos, consider: - Increasing `caption_every_n` (e.g., 5 = caption every 5th frame) - Using a paid Gemini API tier ## 📁 Project Structure ``` video_intelligence/ ├── __init__.py # Package init ├── config.py # Configuration dataclass ├── frame_extractor.py # OpenCV frame extraction ├── gemini_client.py # Gemini API (captioning, embedding, RAG, query decomposition) ├── visual_encoders.py # SigLIP2 + Grounding DINO ├── index_store.py # SQLite + FAISS index ├── query_engine.py # Multi-channel search + boolean ops + fusion ├── akinator.py # Decision-tree refinement ├── pipeline.py # End-to-end indexing orchestrator └── app.py # Gradio UI app.py # Entry point (CLI + UI) requirements.txt ``` ## License MIT