|
|
--- |
|
|
language: en |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- retrieval |
|
|
- contrastive-learning |
|
|
- multimodal |
|
|
- video |
|
|
- rag |
|
|
- faiss |
|
|
- pytorch |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# ViD-GAN Encoder (VideoRAG Retrieval Model) |
|
|
|
|
|
ViD-GAN is a custom-trained retrieval model designed for **video-based question answering** using a multimodal retrieval approach. |
|
|
|
|
|
This repository contains: |
|
|
- **ViD-GAN Encoder** (SentenceTransformer-based) |
|
|
- **ViD-GAN Discriminator** (grounding verification module) |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ What This Model Does |
|
|
|
|
|
ViD-GAN Encoder generates embeddings for: |
|
|
- user questions |
|
|
- video transcript chunks |
|
|
- multimodal chunks (transcript + detected visual objects) |
|
|
|
|
|
It is trained using **contrastive learning (InfoNCE)** to improve retrieval quality. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ ๏ธ How to Use |
|
|
|
|
|
### Load Encoder |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
model = SentenceTransformer('nandakrishnan1311/ViD-GAN-Encoder') |
|
|
``` |
|
|
|
|
|
### Encode Text |
|
|
|
|
|
```python |
|
|
emb = model.encode('In the UK, what is totally illegal?') |
|
|
print(emb.shape) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ฆ Files |
|
|
|
|
|
- `ViD-GAN-Encoder/` : SentenceTransformer encoder |
|
|
- `ViD-GAN-Discriminator.pt` : grounding discriminator |
|
|
|
|
|
--- |
|
|
|
|
|
## โ ๏ธ Limitations |
|
|
|
|
|
- Trained on a small auto-generated dataset |
|
|
- Visual info is based on YOLO object labels (may include false detections) |
|
|
- Intended for research and prototype use |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ค Author |
|
|
|
|
|
Developed by **Nandakrishnan O** ๐ฎ๐ณ |
|
|
|