mini-rag / README.md
navyamehta's picture
Upload README.md
0119088 verified
---
title: Mini RAG - Track B Assessment
emoji: ๐Ÿค–
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
---
# Mini RAG - Track B Assessment
A production-ready RAG (Retrieval-Augmented Generation) application that demonstrates text input, vector storage, retrieval + reranking, and LLM answering with inline citations.
## ๐ŸŽฏ Goal
Build and host a small RAG app where users input text (upload file is optional) from the frontend, store it in a cloud-hosted vector DB, retrieve the most relevant chunks with a retriever + reranker, and answer queries via an LLM with proper citations.
## ๐Ÿ—๏ธ Architecture
```
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Frontend โ”‚ โ”‚ Backend โ”‚ โ”‚ External โ”‚
โ”‚ (Gradio UI) โ”‚โ—„โ”€โ”€โ–บโ”‚ (Python) โ”‚โ—„โ”€โ”€โ–บโ”‚ Services โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ”‚ โ”‚
โ”‚ โ€ข Text Input/Upload โ”‚ โ€ข Text Processing โ”‚ โ€ข OpenAI API โ”‚
โ”‚ โ€ข Query Interface โ”‚ โ€ข Chunking Strategy โ”‚ โ€ข Groq API โ”‚
โ”‚ โ€ข Results Display โ”‚ โ€ข Embedding Generation โ”‚ โ€ข Cohere API โ”‚
โ”‚ โ€ข Citations & Sources โ”‚ โ€ข Vector Storage โ”‚ โ€ข Pinecone โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
```
### Data Flow
1. **Ingestion**: Text โ†’ Chunking โ†’ Embedding โ†’ Pinecone Vector DB
2. **Query**: Question โ†’ Embedding โ†’ Vector Search โ†’ Top-K Retrieval
3. **Reranking**: Retrieved chunks โ†’ Cohere Reranker โ†’ Reordered results
4. **Generation**: Reranked chunks โ†’ LLM โ†’ Answer with inline citations [1], [2]
## ๐Ÿš€ Features
### โœ… Requirements Met
- **Vector Database**: Pinecone cloud-hosted with serverless index
- **Embeddings & Chunking**: OpenAI embeddings with configurable chunk size (400-1200 tokens) and overlap (10-15%)
- **Retriever + Reranker**: Top-k retrieval with optional Cohere reranker
- **LLM & Answering**: OpenAI/Groq with inline citations and source mapping
- **Frontend**: Text input/upload, query interface, citations display, timing & cost estimates
- **Metadata Storage**: Source, title, section, position tracking
### ๐Ÿ”ง Technical Details
- **Chunking Strategy**: 800 tokens default with 120 token overlap (15%)
- **Vector Dimension**: 1536 (OpenAI text-embedding-3-small)
- **Index Configuration**: Pinecone serverless, cosine similarity
- **Upsert Strategy**: Batch processing (100 chunks) with metadata preservation
## ๐Ÿ› ๏ธ Setup
### Prerequisites
- Python 3.8+
- Pinecone account and API key
- OpenAI API key
- Groq API key (optional)
- Cohere API key (optional, for reranking)
### Installation
1. **Clone and setup environment**
```bash
git clone <your-repo-url>
cd mini-rag
python -m venv .venv
source .venv/bin/activate # On Windows: .\.venv\Scripts\activate
pip install -r requirements.txt
```
2. **Configure environment variables**
```bash
cp .env.example .env
# Edit .env with your API keys
```
3. **Create data directory**
```bash
mkdir data
```
4. **Run the application**
```bash
python app.py
```
### Environment Variables
```bash
# Pinecone
PINECONE_API_KEY=your_pinecone_key
PINECONE_INDEX=mini-rag-index
PINECONE_CLOUD=aws
PINECONE_REGION=us-east-1
# LLMs
OPENAI_API_KEY=your_openai_key
GROQ_API_KEY=your_groq_key
# Reranker
COHERE_API_KEY=your_cohere_key
# Models
EMBEDDING_MODEL=text-embedding-3-small
LLM_PROVIDER=openai
LLM_MODEL=gpt-4o-mini
RERANK_PROVIDER=cohere
RERANK_MODEL=rerank-3
# Chunking
CHUNK_SIZE=800
CHUNK_OVERLAP=120
DATA_DIR=./data
```
## ๐Ÿ“Š Evaluation
### Gold Set Q&A Pairs
1. **Q:** What is the main topic of the document?
**Expected:** Clear identification of document subject
2. **Q:** What are the key findings or conclusions?
**Expected:** Specific facts or conclusions from the text
3. **Q:** What methodology was used?
**Expected:** Description of approach or methods mentioned
4. **Q:** What are the limitations discussed?
**Expected:** Any limitations or constraints mentioned
5. **Q:** What future work is suggested?
**Expected:** Recommendations or future directions
### Success Metrics
- **Precision**: Relevant information in answers
- **Recall**: Coverage of available information
- **Citation Accuracy**: Proper source attribution with [1], [2] format
- **Response Time**: Query processing speed
- **Cost Efficiency**: Token usage and API cost estimates
## ๐Ÿš€ Deployment
### Free Hosting Options
- **Hugging Face Spaces**: Gradio apps with free tier
- **Render**: Free tier for Python web services
- **Railway**: Free tier for small applications
- **Vercel**: Free tier for static sites (with API routes)
### Deployment Steps
1. **Prepare for deployment**
- Ensure all API keys are environment variables
- Test locally with production settings
- Add proper error handling and logging
2. **Deploy to chosen platform**
- Follow platform-specific deployment guides
- Set environment variables in platform dashboard
- Configure domain and SSL if needed
## ๐Ÿ“ Project Structure
```
mini-rag/
โ”œโ”€โ”€ app.py # Gradio UI and main application
โ”œโ”€โ”€ rag_core.py # RAG orchestration logic
โ”œโ”€โ”€ llm.py # LLM provider abstraction
โ”œโ”€โ”€ pinecone_client.py # Pinecone vector DB client
โ”œโ”€โ”€ ingest.py # Document ingestion pipeline
โ”œโ”€โ”€ chunker.py # Text chunking strategy
โ”œโ”€โ”€ requirements.txt # Python dependencies
โ”œโ”€โ”€ .env.example # Environment variables template
โ”œโ”€โ”€ README.md # This file
โ””โ”€โ”€ data/ # Document storage directory
```
## ๐Ÿ” Usage Examples
### 1. Text Input Processing
- Paste text into the "Text Input" tab
- Configure chunk size (400-1200 tokens) and overlap (10-15%)
- Click "Process & Store Text" to ingest into vector DB
### 2. File Ingestion
- Place documents (.txt, .md, .pdf) in the `data/` directory
- Use the "File Ingestion" tab to process all files
- Monitor chunk count and processing status
### 3. Query and Answer
- Navigate to "Query" tab
- Enter your question
- Adjust Top-K retrieval and reranker settings
- Get answer with inline citations [1], [2] and source details
## ๐Ÿ“ˆ Performance & Monitoring
### Metrics Tracked
- **Processing Time**: End-to-end query response time
- **Token Usage**: Query, context, and answer token counts
- **Cost Estimates**: Embedding, LLM, and reranking costs
- **Retrieval Quality**: Vector similarity scores and rerank scores
### Optimization Tips
- Adjust chunk size based on document characteristics
- Use reranker for better relevance (adds ~100ms but improves quality)
- Batch process documents for efficient ingestion
- Monitor Pinecone index performance and costs
## ๐Ÿšจ Error Handling
### Common Issues
- **Missing API Keys**: Check environment variables
- **Pinecone Connection**: Verify index name and region
- **Document Processing**: Check file formats and encoding
- **Rate Limits**: Implement exponential backoff for API calls
### Graceful Degradation
- Fallback to original retrieval order if reranker fails
- Continue processing if individual documents fail
- Provide clear error messages with troubleshooting steps
## ๐Ÿ”ฎ Future Enhancements
### Planned Improvements
- **Advanced Chunking**: Semantic chunking with sentence transformers
- **Hybrid Search**: Combine vector and keyword search
- **Multi-modal Support**: Image and document processing
- **Caching Layer**: Redis for frequently accessed results
- **Analytics Dashboard**: Query performance and usage metrics
### Scalability Considerations
- **Vector DB**: Pinecone pod scaling for larger datasets
- **Embedding Models**: Local models for cost reduction
- **Load Balancing**: Multiple LLM providers for redundancy
- **CDN Integration**: Static asset optimization
## ๐Ÿ“ Remarks
### Trade-offs Made
- **API Dependencies**: Relies on external services for embeddings and LLM
- **Cost vs Quality**: OpenAI embeddings provide quality but add cost
- **Latency**: Reranking adds ~100ms but significantly improves relevance
- **Chunking Strategy**: Fixed-size chunks for simplicity vs semantic chunking
### Provider Limits
- **OpenAI**: Rate limits and token limits per request
- **Pinecone**: Free tier index size and query limits
- **Cohere**: Reranking API rate limits
- **Groq**: Alternative LLM with different pricing model
### What I'd Do Next
1. **Implement semantic chunking** for better document understanding
2. **Add hybrid search** combining vector and keyword approaches
3. **Build evaluation framework** with automated testing
4. **Optimize for production** with proper logging and monitoring
5. **Add authentication** for multi-user support
## ๐Ÿ‘จโ€๐Ÿ’ป Author
**Your Name** - AI Engineer Assessment Candidate
- **GitHub**: [Your GitHub Profile]
- **LinkedIn**: [Your LinkedIn Profile]
- **Portfolio**: [Your Portfolio/Website]
## ๐Ÿ“„ License
This project is created for the AI Engineer Assessment. Feel free to use and modify for learning purposes.
---
**Note**: This implementation demonstrates production-ready practices including proper error handling, environment variable management, comprehensive documentation, and scalable architecture design.