mini-rag / README.md
navyamehta's picture
Upload README.md
0119088 verified

A newer version of the Gradio SDK is available: 6.9.0

Upgrade
metadata
title: Mini RAG - Track B Assessment
emoji: ๐Ÿค–
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false

Mini RAG - Track B Assessment

A production-ready RAG (Retrieval-Augmented Generation) application that demonstrates text input, vector storage, retrieval + reranking, and LLM answering with inline citations.

๐ŸŽฏ Goal

Build and host a small RAG app where users input text (upload file is optional) from the frontend, store it in a cloud-hosted vector DB, retrieve the most relevant chunks with a retriever + reranker, and answer queries via an LLM with proper citations.

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Frontend      โ”‚    โ”‚   Backend       โ”‚    โ”‚   External      โ”‚
โ”‚   (Gradio UI)   โ”‚โ—„โ”€โ”€โ–บโ”‚   (Python)      โ”‚โ—„โ”€โ”€โ–บโ”‚   Services      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚                        โ”‚                        โ”‚
โ”‚ โ€ข Text Input/Upload    โ”‚ โ€ข Text Processing      โ”‚ โ€ข OpenAI API    โ”‚
โ”‚ โ€ข Query Interface      โ”‚ โ€ข Chunking Strategy    โ”‚ โ€ข Groq API      โ”‚
โ”‚ โ€ข Results Display      โ”‚ โ€ข Embedding Generation โ”‚ โ€ข Cohere API    โ”‚
โ”‚ โ€ข Citations & Sources  โ”‚ โ€ข Vector Storage      โ”‚ โ€ข Pinecone      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Data Flow

  1. Ingestion: Text โ†’ Chunking โ†’ Embedding โ†’ Pinecone Vector DB
  2. Query: Question โ†’ Embedding โ†’ Vector Search โ†’ Top-K Retrieval
  3. Reranking: Retrieved chunks โ†’ Cohere Reranker โ†’ Reordered results
  4. Generation: Reranked chunks โ†’ LLM โ†’ Answer with inline citations [1], [2]

๐Ÿš€ Features

โœ… Requirements Met

  • Vector Database: Pinecone cloud-hosted with serverless index
  • Embeddings & Chunking: OpenAI embeddings with configurable chunk size (400-1200 tokens) and overlap (10-15%)
  • Retriever + Reranker: Top-k retrieval with optional Cohere reranker
  • LLM & Answering: OpenAI/Groq with inline citations and source mapping
  • Frontend: Text input/upload, query interface, citations display, timing & cost estimates
  • Metadata Storage: Source, title, section, position tracking

๐Ÿ”ง Technical Details

  • Chunking Strategy: 800 tokens default with 120 token overlap (15%)
  • Vector Dimension: 1536 (OpenAI text-embedding-3-small)
  • Index Configuration: Pinecone serverless, cosine similarity
  • Upsert Strategy: Batch processing (100 chunks) with metadata preservation

๐Ÿ› ๏ธ Setup

Prerequisites

  • Python 3.8+
  • Pinecone account and API key
  • OpenAI API key
  • Groq API key (optional)
  • Cohere API key (optional, for reranking)

Installation

  1. Clone and setup environment
git clone <your-repo-url>
cd mini-rag
python -m venv .venv
source .venv/bin/activate  # On Windows: .\.venv\Scripts\activate
pip install -r requirements.txt
  1. Configure environment variables
cp .env.example .env
# Edit .env with your API keys
  1. Create data directory
mkdir data
  1. Run the application
python app.py

Environment Variables

# Pinecone
PINECONE_API_KEY=your_pinecone_key
PINECONE_INDEX=mini-rag-index
PINECONE_CLOUD=aws
PINECONE_REGION=us-east-1

# LLMs
OPENAI_API_KEY=your_openai_key
GROQ_API_KEY=your_groq_key

# Reranker
COHERE_API_KEY=your_cohere_key

# Models
EMBEDDING_MODEL=text-embedding-3-small
LLM_PROVIDER=openai
LLM_MODEL=gpt-4o-mini
RERANK_PROVIDER=cohere
RERANK_MODEL=rerank-3

# Chunking
CHUNK_SIZE=800
CHUNK_OVERLAP=120
DATA_DIR=./data

๐Ÿ“Š Evaluation

Gold Set Q&A Pairs

  1. Q: What is the main topic of the document? Expected: Clear identification of document subject

  2. Q: What are the key findings or conclusions? Expected: Specific facts or conclusions from the text

  3. Q: What methodology was used? Expected: Description of approach or methods mentioned

  4. Q: What are the limitations discussed? Expected: Any limitations or constraints mentioned

  5. Q: What future work is suggested? Expected: Recommendations or future directions

Success Metrics

  • Precision: Relevant information in answers
  • Recall: Coverage of available information
  • Citation Accuracy: Proper source attribution with [1], [2] format
  • Response Time: Query processing speed
  • Cost Efficiency: Token usage and API cost estimates

๐Ÿš€ Deployment

Free Hosting Options

  • Hugging Face Spaces: Gradio apps with free tier
  • Render: Free tier for Python web services
  • Railway: Free tier for small applications
  • Vercel: Free tier for static sites (with API routes)

Deployment Steps

  1. Prepare for deployment

    • Ensure all API keys are environment variables
    • Test locally with production settings
    • Add proper error handling and logging
  2. Deploy to chosen platform

    • Follow platform-specific deployment guides
    • Set environment variables in platform dashboard
    • Configure domain and SSL if needed

๐Ÿ“ Project Structure

mini-rag/
โ”œโ”€โ”€ app.py              # Gradio UI and main application
โ”œโ”€โ”€ rag_core.py         # RAG orchestration logic
โ”œโ”€โ”€ llm.py             # LLM provider abstraction
โ”œโ”€โ”€ pinecone_client.py # Pinecone vector DB client
โ”œโ”€โ”€ ingest.py          # Document ingestion pipeline
โ”œโ”€โ”€ chunker.py         # Text chunking strategy
โ”œโ”€โ”€ requirements.txt   # Python dependencies
โ”œโ”€โ”€ .env.example      # Environment variables template
โ”œโ”€โ”€ README.md         # This file
โ””โ”€โ”€ data/             # Document storage directory

๐Ÿ” Usage Examples

1. Text Input Processing

  • Paste text into the "Text Input" tab
  • Configure chunk size (400-1200 tokens) and overlap (10-15%)
  • Click "Process & Store Text" to ingest into vector DB

2. File Ingestion

  • Place documents (.txt, .md, .pdf) in the data/ directory
  • Use the "File Ingestion" tab to process all files
  • Monitor chunk count and processing status

3. Query and Answer

  • Navigate to "Query" tab
  • Enter your question
  • Adjust Top-K retrieval and reranker settings
  • Get answer with inline citations [1], [2] and source details

๐Ÿ“ˆ Performance & Monitoring

Metrics Tracked

  • Processing Time: End-to-end query response time
  • Token Usage: Query, context, and answer token counts
  • Cost Estimates: Embedding, LLM, and reranking costs
  • Retrieval Quality: Vector similarity scores and rerank scores

Optimization Tips

  • Adjust chunk size based on document characteristics
  • Use reranker for better relevance (adds ~100ms but improves quality)
  • Batch process documents for efficient ingestion
  • Monitor Pinecone index performance and costs

๐Ÿšจ Error Handling

Common Issues

  • Missing API Keys: Check environment variables
  • Pinecone Connection: Verify index name and region
  • Document Processing: Check file formats and encoding
  • Rate Limits: Implement exponential backoff for API calls

Graceful Degradation

  • Fallback to original retrieval order if reranker fails
  • Continue processing if individual documents fail
  • Provide clear error messages with troubleshooting steps

๐Ÿ”ฎ Future Enhancements

Planned Improvements

  • Advanced Chunking: Semantic chunking with sentence transformers
  • Hybrid Search: Combine vector and keyword search
  • Multi-modal Support: Image and document processing
  • Caching Layer: Redis for frequently accessed results
  • Analytics Dashboard: Query performance and usage metrics

Scalability Considerations

  • Vector DB: Pinecone pod scaling for larger datasets
  • Embedding Models: Local models for cost reduction
  • Load Balancing: Multiple LLM providers for redundancy
  • CDN Integration: Static asset optimization

๐Ÿ“ Remarks

Trade-offs Made

  • API Dependencies: Relies on external services for embeddings and LLM
  • Cost vs Quality: OpenAI embeddings provide quality but add cost
  • Latency: Reranking adds ~100ms but significantly improves relevance
  • Chunking Strategy: Fixed-size chunks for simplicity vs semantic chunking

Provider Limits

  • OpenAI: Rate limits and token limits per request
  • Pinecone: Free tier index size and query limits
  • Cohere: Reranking API rate limits
  • Groq: Alternative LLM with different pricing model

What I'd Do Next

  1. Implement semantic chunking for better document understanding
  2. Add hybrid search combining vector and keyword approaches
  3. Build evaluation framework with automated testing
  4. Optimize for production with proper logging and monitoring
  5. Add authentication for multi-user support

๐Ÿ‘จโ€๐Ÿ’ป Author

Your Name - AI Engineer Assessment Candidate

  • GitHub: [Your GitHub Profile]
  • LinkedIn: [Your LinkedIn Profile]
  • Portfolio: [Your Portfolio/Website]

๐Ÿ“„ License

This project is created for the AI Engineer Assessment. Feel free to use and modify for learning purposes.


Note: This implementation demonstrates production-ready practices including proper error handling, environment variable management, comprehensive documentation, and scalable architecture design.