|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- rag |
|
|
- multimodal |
|
|
- faiss |
|
|
- sentence-transformers |
|
|
- clip |
|
|
- mistral |
|
|
- information-retrieval |
|
|
--- |
|
|
|
|
|
# Multimodal RAG System |
|
|
|
|
|
This repository contains a complete Multimodal Retrieval-Augmented Generation (RAG) system that combines text and image search with LLM-based answer generation. |
|
|
|
|
|
## System Components |
|
|
|
|
|
- **Text Embeddings**: Sentence-BERT (all-MiniLM-L6-v2) - 384 dimensions |
|
|
- **Image Embeddings**: CLIP (ViT-B/32) - 512 dimensions |
|
|
- **Vector Database**: FAISS indices for efficient similarity search |
|
|
- **LLM**: Mistral-7B-Instruct (4-bit quantized) |
|
|
- **Total Vectors**: 446 (161 text + 285 images) |
|
|
|
|
|
## Files |
|
|
|
|
|
- `text_index.faiss`: FAISS index for text embeddings |
|
|
- `image_index.faiss`: FAISS index for image embeddings |
|
|
- `text_metadata.pkl`: Metadata for text chunks (source, page, content) |
|
|
- `image_metadata.pkl`: Metadata for images (source, page, image_id) |
|
|
- `config.json`: System configuration |
|
|
- `image_summary.json`: Reference summary of images |
|
|
|
|
|
## Usage |
|
|
|
|
|
See the load cells in the notebook for loading and using this RAG system. |
|
|
|
|
|
## Features |
|
|
|
|
|
- Semantic text search |
|
|
- Cross-modal image search (text query → image results) |
|
|
- Multiple prompting strategies (Standard, Chain-of-Thought, Few-shot, Zero-shot) |
|
|
- Source attribution and traceability |
|
|
- Real-time answer generation |
|
|
|