| license: mit | |
| tags: | |
| - rag | |
| - multimodal | |
| - faiss | |
| - sentence-transformers | |
| - clip | |
| - mistral | |
| - information-retrieval | |
| # Multimodal RAG System | |
| This repository contains a complete Multimodal Retrieval-Augmented Generation (RAG) system that combines text and image search with LLM-based answer generation. | |
| ## System Components | |
| - **Text Embeddings**: Sentence-BERT (all-MiniLM-L6-v2) - 384 dimensions | |
| - **Image Embeddings**: CLIP (ViT-B/32) - 512 dimensions | |
| - **Vector Database**: FAISS indices for efficient similarity search | |
| - **LLM**: Mistral-7B-Instruct (4-bit quantized) | |
| - **Total Vectors**: 446 (161 text + 285 images) | |
| ## Files | |
| - `text_index.faiss`: FAISS index for text embeddings | |
| - `image_index.faiss`: FAISS index for image embeddings | |
| - `text_metadata.pkl`: Metadata for text chunks (source, page, content) | |
| - `image_metadata.pkl`: Metadata for images (source, page, image_id) | |
| - `config.json`: System configuration | |
| - `image_summary.json`: Reference summary of images | |
| ## Usage | |
| See the load cells in the notebook for loading and using this RAG system. | |
| ## Features | |
| - Semantic text search | |
| - Cross-modal image search (text query → image results) | |
| - Multiple prompting strategies (Standard, Chain-of-Thought, Few-shot, Zero-shot) | |
| - Source attribution and traceability | |
| - Real-time answer generation | |