docs: update README with API verification details and troubleshooting

1eb45d2 verified 15 days ago

7.52 kB

	# 🎬 Video Intelligence Platform

	> Akinator-style video search with RAG, boolean queries, and tree-based refinement.

	Upload any video → the system indexes every frame → then search with natural language to find exact timestamps.

	## 🌟 Features

	- Natural Language Search: "person wearing white clothes" → exact timestamps
	- Boolean Queries: "red car AND bicycle" → timestamps where BOTH appear together
	- Akinator Tree Refinement: Too many results? The system asks discriminative questions to narrow down (indoor/outdoor? day/night? etc.)
	- RAG Answers: Generates grounded answers citing specific timestamps
	- Multi-Channel Fusion: Combines visual similarity + caption search + object detection

	## 🏗️ Architecture

	```
	Video → Frame Extraction (1fps)
	│
	├─► Grounding DINO → Object detection with attributes
	│ (detects "person in white shirt", "red car", etc.)
	│ → SQLite structured DB
	│
	├─► SigLIP2 → Frame embeddings (1152-dim)
	│ → FAISS vector index
	│
	└─► Gemini 2.0 Flash → Dense captions
	→ Gemini text-embedding-004 → Caption embeddings (768-dim)
	→ FAISS vector index

	Query → Gemini (decompose boolean) → Sub-queries
	│
	├─► Visual search (SigLIP2 FAISS)
	├─► Caption search (Gemini FAISS)
	└─► Detection search (SQL)
	│
	▼
	Score Fusion → Boolean Ops (AND/OR) → Ranked Timestamps
	│
	▼
	Akinator Refinement (if too many results)
	│
	▼
	RAG Answer Generation (Gemini)
	```

	## 🚀 Quick Start

	### 1. Clone the repo
	```bash
	git clone https://huggingface.co/notRaphael/video-intelligence-platform
	cd video-intelligence-platform
	```

	### 2. Install dependencies
	```bash
	pip install -r requirements.txt
	```

	> Note: Requires `transformers >= 4.49` (for SigLIP2 support). The system uses ~2.2GB RAM for model loading (SigLIP2 ~1.5GB + Grounding DINO ~657MB). A machine with ≥8GB RAM is recommended.

	### 3. Get a Gemini API key (free)
	- Go to https://aistudio.google.com/apikey
	- Create a free API key

	### 4. Launch the UI
	```bash
	export GEMINI_API_KEY="your-key-here"
	python app.py
	```

	### 5. Or use the CLI
	```bash
	# Index a video
	python app.py --index video.mp4 --api-key YOUR_KEY

	# Search
	python app.py --search "red car" --api-key YOUR_KEY
	```

	## 📋 Models Used

	\| Component \| Model \| Size \| Runs On \|
	\|---\|---\|---\|---\|
	\| Frame Embeddings \| [SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384) \| ~1.5GB \| CPU ✅ / GPU \|
	\| Object Detection \| [Grounding DINO](https://huggingface.co/IDEA-Research/grounding-dino-tiny) \| ~657MB \| CPU ✅ / GPU \|
	\| Captioning \| Gemini 2.0 Flash \| API \| Cloud \|
	\| Text Embeddings \| Gemini text-embedding-004 \| API \| Cloud \|
	\| Query/RAG \| Gemini 2.0 Flash \| API \| Cloud \|

	## 🔧 API Verification (Apr 2026)

	All model APIs verified against transformers 5.6.2 and google-genai 1.73.1:

	### SigLIP2 (`google/siglip2-so400m-patch14-384`)
	- `AutoModel` / `AutoProcessor` → resolves to `SiglipModel` / `SiglipProcessor`
	- `model.get_image_features(**inputs)` returns `BaseModelOutputWithPooling` (`.pooler_output` = `[B, 1152]`)
	- Text input must use `padding="max_length"` (training requirement)
	- Uses sigmoid (not softmax) for similarity scores

	### Grounding DINO (`IDEA-Research/grounding-dino-tiny`)
	- `AutoModelForZeroShotObjectDetection` / `AutoProcessor` → resolves to `GroundingDinoForObjectDetection` / `GroundingDinoProcessor`
	- Processor accepts text as `str`, `list[str]`, or `list[list[str]]` — auto-converts internally
	- `post_process_grounded_object_detection`: `threshold` kwarg (not `box_threshold`), `input_ids` optional
	- Returns dict with both `"text_labels"` and `"labels"` keys
	- `target_sizes` expects `(height, width)` tuples

	### Gemini (`google-genai` SDK)
	- Uses `google.genai` (NOT deprecated `google.generativeai`)
	- `genai.Client(api_key=...)` → `client.models.generate_content(...)`, `client.models.embed_content(...)`
	- `types.Part.from_bytes(data=..., mime_type=...)`, `types.Part.from_text(text=...)`
	- Embedding is text-only — cannot embed images/video directly

	## 🌳 How the Akinator Tree Works

	When a search returns too many results (>10), the system:

	1. Extracts attributes from all candidate frames (objects, colors, location, time, actions)
	2. Computes information gain for each attribute (same algorithm as decision trees!)
	3. Asks the most discriminative question (e.g., "Indoor or outdoor?")
	4. Splits results based on your answer
	5. Repeats until results are manageable

	```
	"Found 47 clips with people"
	│
	▼
	"Indoor or outdoor?" → Outdoor (24 clips)
	│
	▼
	"Daytime or nighttime?" → Daytime (15 clips)
	│
	▼
	"What color clothing?" → White (6 clips) ✅ Done!
	```

	## 🔮 Future: TPU Training

	The platform is designed for future fine-tuning on TPU:

	- VLM2Vec-V2 (Qwen2-VL-7B + LoRA) for domain-specific video embeddings
	- TimeLens recipe (GRPO/RLVR) for temporal grounding
	- Uses `accelerate` + FSDPv2 with bf16 on TPU v5e

	## 📚 Based On Research

	\| Paper \| What We Use \|
	\|---\|---\|
	\| [AVA](https://arxiv.org/abs/2505.00254) \| Event Knowledge Graphs + semantic chunking \|
	\| [VideoRAG](https://arxiv.org/abs/2502.01549) \| Dual-channel retrieval architecture \|
	\| [ForeSea](https://arxiv.org/abs/2603.22872) \| Attribute-based forensic search \|
	\| [TimeLens](https://arxiv.org/abs/2512.14698) \| Temporal grounding recipes \|
	\| [SigLIP2](https://arxiv.org/abs/2502.14786) \| Frame-text shared embeddings \|
	\| [Grounding DINO](https://arxiv.org/abs/2303.05499) \| Open-vocab attribute detection \|

	## ⚠️ Troubleshooting

	### "Could not import module 'AutoProcessor'"
	This means your `transformers` version is too old. SigLIP2 requires `>= 4.49`:
	```bash
	pip install -U transformers
	# Also clear stale cache:
	rm -rf ~/.cache/huggingface/hub/models--google--siglip2-so400m-patch14-384
	rm -rf ~/.cache/huggingface/hub/models--IDEA-Research--grounding-dino-tiny
	```

	### Out of Memory during model loading
	SigLIP2 (~1.5GB) + Grounding DINO (~657MB) need ~2.5GB RAM just for weights. If your system has < 8GB RAM:
	- Set `device="cpu"` in config (default)
	- Close other memory-heavy applications
	- Consider using only one model at a time

	### Gemini rate limiting
	The free tier allows ~15 requests/minute. The pipeline adds a 4-second delay between captioning calls. For longer videos, consider:
	- Increasing `caption_every_n` (e.g., 5 = caption every 5th frame)
	- Using a paid Gemini API tier

	## 📁 Project Structure

	```
	video_intelligence/
	├── __init__.py # Package init
	├── config.py # Configuration dataclass
	├── frame_extractor.py # OpenCV frame extraction
	├── gemini_client.py # Gemini API (captioning, embedding, RAG, query decomposition)
	├── visual_encoders.py # SigLIP2 + Grounding DINO
	├── index_store.py # SQLite + FAISS index
	├── query_engine.py # Multi-channel search + boolean ops + fusion
	├── akinator.py # Decision-tree refinement
	├── pipeline.py # End-to-end indexing orchestrator
	└── app.py # Gradio UI
	app.py # Entry point (CLI + UI)
	requirements.txt
	```

	## License

	MIT