--- title: Mini RAG - Track B Assessment emoji: 🤖 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 4.44.0 app_file: app.py pinned: false --- # Mini RAG - Track B Assessment A production-ready RAG (Retrieval-Augmented Generation) application that demonstrates text input, vector storage, retrieval + reranking, and LLM answering with inline citations. ## 🎯 Goal Build and host a small RAG app where users input text (upload file is optional) from the frontend, store it in a cloud-hosted vector DB, retrieve the most relevant chunks with a retriever + reranker, and answer queries via an LLM with proper citations. ## 🏗️ Architecture ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Frontend │ │ Backend │ │ External │ │ (Gradio UI) │◄──►│ (Python) │◄──►│ Services │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │ │ │ • Text Input/Upload │ • Text Processing │ • OpenAI API │ │ • Query Interface │ • Chunking Strategy │ • Groq API │ │ • Results Display │ • Embedding Generation │ • Cohere API │ │ • Citations & Sources │ • Vector Storage │ • Pinecone │ └─────────────────┘ └─────────────────┘ └─────────────────┘ ``` ### Data Flow 1. **Ingestion**: Text → Chunking → Embedding → Pinecone Vector DB 2. **Query**: Question → Embedding → Vector Search → Top-K Retrieval 3. **Reranking**: Retrieved chunks → Cohere Reranker → Reordered results 4. **Generation**: Reranked chunks → LLM → Answer with inline citations [1], [2] ## 🚀 Features ### ✅ Requirements Met - **Vector Database**: Pinecone cloud-hosted with serverless index - **Embeddings & Chunking**: OpenAI embeddings with configurable chunk size (400-1200 tokens) and overlap (10-15%) - **Retriever + Reranker**: Top-k retrieval with optional Cohere reranker - **LLM & Answering**: OpenAI/Groq with inline citations and source mapping - **Frontend**: Text input/upload, query interface, citations display, timing & cost estimates - **Metadata Storage**: Source, title, section, position tracking ### 🔧 Technical Details - **Chunking Strategy**: 800 tokens default with 120 token overlap (15%) - **Vector Dimension**: 1536 (OpenAI text-embedding-3-small) - **Index Configuration**: Pinecone serverless, cosine similarity - **Upsert Strategy**: Batch processing (100 chunks) with metadata preservation ## 🛠️ Setup ### Prerequisites - Python 3.8+ - Pinecone account and API key - OpenAI API key - Groq API key (optional) - Cohere API key (optional, for reranking) ### Installation 1. **Clone and setup environment** ```bash git clone cd mini-rag python -m venv .venv source .venv/bin/activate # On Windows: .\.venv\Scripts\activate pip install -r requirements.txt ``` 2. **Configure environment variables** ```bash cp .env.example .env # Edit .env with your API keys ``` 3. **Create data directory** ```bash mkdir data ``` 4. **Run the application** ```bash python app.py ``` ### Environment Variables ```bash # Pinecone PINECONE_API_KEY=your_pinecone_key PINECONE_INDEX=mini-rag-index PINECONE_CLOUD=aws PINECONE_REGION=us-east-1 # LLMs OPENAI_API_KEY=your_openai_key GROQ_API_KEY=your_groq_key # Reranker COHERE_API_KEY=your_cohere_key # Models EMBEDDING_MODEL=text-embedding-3-small LLM_PROVIDER=openai LLM_MODEL=gpt-4o-mini RERANK_PROVIDER=cohere RERANK_MODEL=rerank-3 # Chunking CHUNK_SIZE=800 CHUNK_OVERLAP=120 DATA_DIR=./data ``` ## 📊 Evaluation ### Gold Set Q&A Pairs 1. **Q:** What is the main topic of the document? **Expected:** Clear identification of document subject 2. **Q:** What are the key findings or conclusions? **Expected:** Specific facts or conclusions from the text 3. **Q:** What methodology was used? **Expected:** Description of approach or methods mentioned 4. **Q:** What are the limitations discussed? **Expected:** Any limitations or constraints mentioned 5. **Q:** What future work is suggested? **Expected:** Recommendations or future directions ### Success Metrics - **Precision**: Relevant information in answers - **Recall**: Coverage of available information - **Citation Accuracy**: Proper source attribution with [1], [2] format - **Response Time**: Query processing speed - **Cost Efficiency**: Token usage and API cost estimates ## 🚀 Deployment ### Free Hosting Options - **Hugging Face Spaces**: Gradio apps with free tier - **Render**: Free tier for Python web services - **Railway**: Free tier for small applications - **Vercel**: Free tier for static sites (with API routes) ### Deployment Steps 1. **Prepare for deployment** - Ensure all API keys are environment variables - Test locally with production settings - Add proper error handling and logging 2. **Deploy to chosen platform** - Follow platform-specific deployment guides - Set environment variables in platform dashboard - Configure domain and SSL if needed ## 📁 Project Structure ``` mini-rag/ ├── app.py # Gradio UI and main application ├── rag_core.py # RAG orchestration logic ├── llm.py # LLM provider abstraction ├── pinecone_client.py # Pinecone vector DB client ├── ingest.py # Document ingestion pipeline ├── chunker.py # Text chunking strategy ├── requirements.txt # Python dependencies ├── .env.example # Environment variables template ├── README.md # This file └── data/ # Document storage directory ``` ## 🔍 Usage Examples ### 1. Text Input Processing - Paste text into the "Text Input" tab - Configure chunk size (400-1200 tokens) and overlap (10-15%) - Click "Process & Store Text" to ingest into vector DB ### 2. File Ingestion - Place documents (.txt, .md, .pdf) in the `data/` directory - Use the "File Ingestion" tab to process all files - Monitor chunk count and processing status ### 3. Query and Answer - Navigate to "Query" tab - Enter your question - Adjust Top-K retrieval and reranker settings - Get answer with inline citations [1], [2] and source details ## 📈 Performance & Monitoring ### Metrics Tracked - **Processing Time**: End-to-end query response time - **Token Usage**: Query, context, and answer token counts - **Cost Estimates**: Embedding, LLM, and reranking costs - **Retrieval Quality**: Vector similarity scores and rerank scores ### Optimization Tips - Adjust chunk size based on document characteristics - Use reranker for better relevance (adds ~100ms but improves quality) - Batch process documents for efficient ingestion - Monitor Pinecone index performance and costs ## 🚨 Error Handling ### Common Issues - **Missing API Keys**: Check environment variables - **Pinecone Connection**: Verify index name and region - **Document Processing**: Check file formats and encoding - **Rate Limits**: Implement exponential backoff for API calls ### Graceful Degradation - Fallback to original retrieval order if reranker fails - Continue processing if individual documents fail - Provide clear error messages with troubleshooting steps ## 🔮 Future Enhancements ### Planned Improvements - **Advanced Chunking**: Semantic chunking with sentence transformers - **Hybrid Search**: Combine vector and keyword search - **Multi-modal Support**: Image and document processing - **Caching Layer**: Redis for frequently accessed results - **Analytics Dashboard**: Query performance and usage metrics ### Scalability Considerations - **Vector DB**: Pinecone pod scaling for larger datasets - **Embedding Models**: Local models for cost reduction - **Load Balancing**: Multiple LLM providers for redundancy - **CDN Integration**: Static asset optimization ## 📝 Remarks ### Trade-offs Made - **API Dependencies**: Relies on external services for embeddings and LLM - **Cost vs Quality**: OpenAI embeddings provide quality but add cost - **Latency**: Reranking adds ~100ms but significantly improves relevance - **Chunking Strategy**: Fixed-size chunks for simplicity vs semantic chunking ### Provider Limits - **OpenAI**: Rate limits and token limits per request - **Pinecone**: Free tier index size and query limits - **Cohere**: Reranking API rate limits - **Groq**: Alternative LLM with different pricing model ### What I'd Do Next 1. **Implement semantic chunking** for better document understanding 2. **Add hybrid search** combining vector and keyword approaches 3. **Build evaluation framework** with automated testing 4. **Optimize for production** with proper logging and monitoring 5. **Add authentication** for multi-user support ## 👨‍💻 Author **Your Name** - AI Engineer Assessment Candidate - **GitHub**: [Your GitHub Profile] - **LinkedIn**: [Your LinkedIn Profile] - **Portfolio**: [Your Portfolio/Website] ## 📄 License This project is created for the AI Engineer Assessment. Feel free to use and modify for learning purposes. --- **Note**: This implementation demonstrates production-ready practices including proper error handling, environment variable management, comprehensive documentation, and scalable architecture design.