Spaces:

navyamehta
/

mini-rag

Sleeping

App Files Files Community

mini-rag / README.md

navyamehta

Upload README.md

0119088 verified 6 months ago

preview code

raw

history blame contribute delete

9.98 kB

	---
	title: Mini RAG - Track B Assessment
	emoji: 🤖
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 4.44.0
	app_file: app.py
	pinned: false
	---

	# Mini RAG - Track B Assessment

	A production-ready RAG (Retrieval-Augmented Generation) application that demonstrates text input, vector storage, retrieval + reranking, and LLM answering with inline citations.

	## 🎯 Goal
	Build and host a small RAG app where users input text (upload file is optional) from the frontend, store it in a cloud-hosted vector DB, retrieve the most relevant chunks with a retriever + reranker, and answer queries via an LLM with proper citations.

	## 🏗️ Architecture

	```
	┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
	│ Frontend │ │ Backend │ │ External │
	│ (Gradio UI) │◄──►│ (Python) │◄──►│ Services │
	└─────────────────┘ └─────────────────┘ └─────────────────┘
	│ │ │
	│ • Text Input/Upload │ • Text Processing │ • OpenAI API │
	│ • Query Interface │ • Chunking Strategy │ • Groq API │
	│ • Results Display │ • Embedding Generation │ • Cohere API │
	│ • Citations & Sources │ • Vector Storage │ • Pinecone │
	└─────────────────┘ └─────────────────┘ └─────────────────┘
	```

	### Data Flow
	1. Ingestion: Text → Chunking → Embedding → Pinecone Vector DB
	2. Query: Question → Embedding → Vector Search → Top-K Retrieval
	3. Reranking: Retrieved chunks → Cohere Reranker → Reordered results
	4. Generation: Reranked chunks → LLM → Answer with inline citations [1], [2]

	## 🚀 Features

	### ✅ Requirements Met
	- Vector Database: Pinecone cloud-hosted with serverless index
	- Embeddings & Chunking: OpenAI embeddings with configurable chunk size (400-1200 tokens) and overlap (10-15%)
	- Retriever + Reranker: Top-k retrieval with optional Cohere reranker
	- LLM & Answering: OpenAI/Groq with inline citations and source mapping
	- Frontend: Text input/upload, query interface, citations display, timing & cost estimates
	- Metadata Storage: Source, title, section, position tracking

	### 🔧 Technical Details
	- Chunking Strategy: 800 tokens default with 120 token overlap (15%)
	- Vector Dimension: 1536 (OpenAI text-embedding-3-small)
	- Index Configuration: Pinecone serverless, cosine similarity
	- Upsert Strategy: Batch processing (100 chunks) with metadata preservation

	## 🛠️ Setup

	### Prerequisites
	- Python 3.8+
	- Pinecone account and API key
	- OpenAI API key
	- Groq API key (optional)
	- Cohere API key (optional, for reranking)

	### Installation

	1. Clone and setup environment
	```bash
	git clone <your-repo-url>
	cd mini-rag
	python -m venv .venv
	source .venv/bin/activate # On Windows: .\.venv\Scripts\activate
	pip install -r requirements.txt
	```

	2. Configure environment variables
	```bash
	cp .env.example .env
	# Edit .env with your API keys
	```

	3. Create data directory
	```bash
	mkdir data
	```

	4. Run the application
	```bash
	python app.py
	```

	### Environment Variables
	```bash
	# Pinecone
	PINECONE_API_KEY=your_pinecone_key
	PINECONE_INDEX=mini-rag-index
	PINECONE_CLOUD=aws
	PINECONE_REGION=us-east-1

	# LLMs
	OPENAI_API_KEY=your_openai_key
	GROQ_API_KEY=your_groq_key

	# Reranker
	COHERE_API_KEY=your_cohere_key

	# Models
	EMBEDDING_MODEL=text-embedding-3-small
	LLM_PROVIDER=openai
	LLM_MODEL=gpt-4o-mini
	RERANK_PROVIDER=cohere
	RERANK_MODEL=rerank-3

	# Chunking
	CHUNK_SIZE=800
	CHUNK_OVERLAP=120
	DATA_DIR=./data
	```

	## 📊 Evaluation

	### Gold Set Q&A Pairs
	1. Q: What is the main topic of the document?
	Expected: Clear identification of document subject

	2. Q: What are the key findings or conclusions?
	Expected: Specific facts or conclusions from the text

	3. Q: What methodology was used?
	Expected: Description of approach or methods mentioned

	4. Q: What are the limitations discussed?
	Expected: Any limitations or constraints mentioned

	5. Q: What future work is suggested?
	Expected: Recommendations or future directions

	### Success Metrics
	- Precision: Relevant information in answers
	- Recall: Coverage of available information
	- Citation Accuracy: Proper source attribution with [1], [2] format
	- Response Time: Query processing speed
	- Cost Efficiency: Token usage and API cost estimates

	## 🚀 Deployment

	### Free Hosting Options
	- Hugging Face Spaces: Gradio apps with free tier
	- Render: Free tier for Python web services
	- Railway: Free tier for small applications
	- Vercel: Free tier for static sites (with API routes)

	### Deployment Steps
	1. Prepare for deployment
	- Ensure all API keys are environment variables
	- Test locally with production settings
	- Add proper error handling and logging

	2. Deploy to chosen platform
	- Follow platform-specific deployment guides
	- Set environment variables in platform dashboard
	- Configure domain and SSL if needed

	## 📁 Project Structure
	```
	mini-rag/
	├── app.py # Gradio UI and main application
	├── rag_core.py # RAG orchestration logic
	├── llm.py # LLM provider abstraction
	├── pinecone_client.py # Pinecone vector DB client
	├── ingest.py # Document ingestion pipeline
	├── chunker.py # Text chunking strategy
	├── requirements.txt # Python dependencies
	├── .env.example # Environment variables template
	├── README.md # This file
	└── data/ # Document storage directory
	```

	## 🔍 Usage Examples

	### 1. Text Input Processing
	- Paste text into the "Text Input" tab
	- Configure chunk size (400-1200 tokens) and overlap (10-15%)
	- Click "Process & Store Text" to ingest into vector DB

	### 2. File Ingestion
	- Place documents (.txt, .md, .pdf) in the `data/` directory
	- Use the "File Ingestion" tab to process all files
	- Monitor chunk count and processing status

	### 3. Query and Answer
	- Navigate to "Query" tab
	- Enter your question
	- Adjust Top-K retrieval and reranker settings
	- Get answer with inline citations [1], [2] and source details

	## 📈 Performance & Monitoring

	### Metrics Tracked
	- Processing Time: End-to-end query response time
	- Token Usage: Query, context, and answer token counts
	- Cost Estimates: Embedding, LLM, and reranking costs
	- Retrieval Quality: Vector similarity scores and rerank scores

	### Optimization Tips
	- Adjust chunk size based on document characteristics
	- Use reranker for better relevance (adds ~100ms but improves quality)
	- Batch process documents for efficient ingestion
	- Monitor Pinecone index performance and costs

	## 🚨 Error Handling

	### Common Issues
	- Missing API Keys: Check environment variables
	- Pinecone Connection: Verify index name and region
	- Document Processing: Check file formats and encoding
	- Rate Limits: Implement exponential backoff for API calls

	### Graceful Degradation
	- Fallback to original retrieval order if reranker fails
	- Continue processing if individual documents fail
	- Provide clear error messages with troubleshooting steps

	## 🔮 Future Enhancements

	### Planned Improvements
	- Advanced Chunking: Semantic chunking with sentence transformers
	- Hybrid Search: Combine vector and keyword search
	- Multi-modal Support: Image and document processing
	- Caching Layer: Redis for frequently accessed results
	- Analytics Dashboard: Query performance and usage metrics

	### Scalability Considerations
	- Vector DB: Pinecone pod scaling for larger datasets
	- Embedding Models: Local models for cost reduction
	- Load Balancing: Multiple LLM providers for redundancy
	- CDN Integration: Static asset optimization

	## 📝 Remarks

	### Trade-offs Made
	- API Dependencies: Relies on external services for embeddings and LLM
	- Cost vs Quality: OpenAI embeddings provide quality but add cost
	- Latency: Reranking adds ~100ms but significantly improves relevance
	- Chunking Strategy: Fixed-size chunks for simplicity vs semantic chunking

	### Provider Limits
	- OpenAI: Rate limits and token limits per request
	- Pinecone: Free tier index size and query limits
	- Cohere: Reranking API rate limits
	- Groq: Alternative LLM with different pricing model

	### What I'd Do Next
	1. Implement semantic chunking for better document understanding
	2. Add hybrid search combining vector and keyword approaches
	3. Build evaluation framework with automated testing
	4. Optimize for production with proper logging and monitoring
	5. Add authentication for multi-user support

	## 👨‍💻 Author

	Your Name - AI Engineer Assessment Candidate
	- GitHub: [Your GitHub Profile]
	- LinkedIn: [Your LinkedIn Profile]
	- Portfolio: [Your Portfolio/Website]

	## 📄 License

	This project is created for the AI Engineer Assessment. Feel free to use and modify for learning purposes.

	---

	Note: This implementation demonstrates production-ready practices including proper error handling, environment variable management, comprehensive documentation, and scalable architecture design.