Multimodal RAG System

This repository contains a complete Multimodal Retrieval-Augmented Generation (RAG) system that combines text and image search with LLM-based answer generation.

System Components

  • Text Embeddings: Sentence-BERT (all-MiniLM-L6-v2) - 384 dimensions
  • Image Embeddings: CLIP (ViT-B/32) - 512 dimensions
  • Vector Database: FAISS indices for efficient similarity search
  • LLM: Mistral-7B-Instruct (4-bit quantized)
  • Total Vectors: 446 (161 text + 285 images)

Files

  • text_index.faiss: FAISS index for text embeddings
  • image_index.faiss: FAISS index for image embeddings
  • text_metadata.pkl: Metadata for text chunks (source, page, content)
  • image_metadata.pkl: Metadata for images (source, page, image_id)
  • config.json: System configuration
  • image_summary.json: Reference summary of images

Usage

See the load cells in the notebook for loading and using this RAG system.

Features

  • Semantic text search
  • Cross-modal image search (text query โ†’ image results)
  • Multiple prompting strategies (Standard, Chain-of-Thought, Few-shot, Zero-shot)
  • Source attribution and traceability
  • Real-time answer generation
Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support