Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.9.0
title: Mini RAG - Track B Assessment
emoji: ๐ค
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
Mini RAG - Track B Assessment
A production-ready RAG (Retrieval-Augmented Generation) application that demonstrates text input, vector storage, retrieval + reranking, and LLM answering with inline citations.
๐ฏ Goal
Build and host a small RAG app where users input text (upload file is optional) from the frontend, store it in a cloud-hosted vector DB, retrieve the most relevant chunks with a retriever + reranker, and answer queries via an LLM with proper citations.
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ Frontend โ โ Backend โ โ External โ
โ (Gradio UI) โโโโโบโ (Python) โโโโโบโ Services โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ โ โ
โ โข Text Input/Upload โ โข Text Processing โ โข OpenAI API โ
โ โข Query Interface โ โข Chunking Strategy โ โข Groq API โ
โ โข Results Display โ โข Embedding Generation โ โข Cohere API โ
โ โข Citations & Sources โ โข Vector Storage โ โข Pinecone โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
Data Flow
- Ingestion: Text โ Chunking โ Embedding โ Pinecone Vector DB
- Query: Question โ Embedding โ Vector Search โ Top-K Retrieval
- Reranking: Retrieved chunks โ Cohere Reranker โ Reordered results
- Generation: Reranked chunks โ LLM โ Answer with inline citations [1], [2]
๐ Features
โ Requirements Met
- Vector Database: Pinecone cloud-hosted with serverless index
- Embeddings & Chunking: OpenAI embeddings with configurable chunk size (400-1200 tokens) and overlap (10-15%)
- Retriever + Reranker: Top-k retrieval with optional Cohere reranker
- LLM & Answering: OpenAI/Groq with inline citations and source mapping
- Frontend: Text input/upload, query interface, citations display, timing & cost estimates
- Metadata Storage: Source, title, section, position tracking
๐ง Technical Details
- Chunking Strategy: 800 tokens default with 120 token overlap (15%)
- Vector Dimension: 1536 (OpenAI text-embedding-3-small)
- Index Configuration: Pinecone serverless, cosine similarity
- Upsert Strategy: Batch processing (100 chunks) with metadata preservation
๐ ๏ธ Setup
Prerequisites
- Python 3.8+
- Pinecone account and API key
- OpenAI API key
- Groq API key (optional)
- Cohere API key (optional, for reranking)
Installation
- Clone and setup environment
git clone <your-repo-url>
cd mini-rag
python -m venv .venv
source .venv/bin/activate # On Windows: .\.venv\Scripts\activate
pip install -r requirements.txt
- Configure environment variables
cp .env.example .env
# Edit .env with your API keys
- Create data directory
mkdir data
- Run the application
python app.py
Environment Variables
# Pinecone
PINECONE_API_KEY=your_pinecone_key
PINECONE_INDEX=mini-rag-index
PINECONE_CLOUD=aws
PINECONE_REGION=us-east-1
# LLMs
OPENAI_API_KEY=your_openai_key
GROQ_API_KEY=your_groq_key
# Reranker
COHERE_API_KEY=your_cohere_key
# Models
EMBEDDING_MODEL=text-embedding-3-small
LLM_PROVIDER=openai
LLM_MODEL=gpt-4o-mini
RERANK_PROVIDER=cohere
RERANK_MODEL=rerank-3
# Chunking
CHUNK_SIZE=800
CHUNK_OVERLAP=120
DATA_DIR=./data
๐ Evaluation
Gold Set Q&A Pairs
Q: What is the main topic of the document? Expected: Clear identification of document subject
Q: What are the key findings or conclusions? Expected: Specific facts or conclusions from the text
Q: What methodology was used? Expected: Description of approach or methods mentioned
Q: What are the limitations discussed? Expected: Any limitations or constraints mentioned
Q: What future work is suggested? Expected: Recommendations or future directions
Success Metrics
- Precision: Relevant information in answers
- Recall: Coverage of available information
- Citation Accuracy: Proper source attribution with [1], [2] format
- Response Time: Query processing speed
- Cost Efficiency: Token usage and API cost estimates
๐ Deployment
Free Hosting Options
- Hugging Face Spaces: Gradio apps with free tier
- Render: Free tier for Python web services
- Railway: Free tier for small applications
- Vercel: Free tier for static sites (with API routes)
Deployment Steps
Prepare for deployment
- Ensure all API keys are environment variables
- Test locally with production settings
- Add proper error handling and logging
Deploy to chosen platform
- Follow platform-specific deployment guides
- Set environment variables in platform dashboard
- Configure domain and SSL if needed
๐ Project Structure
mini-rag/
โโโ app.py # Gradio UI and main application
โโโ rag_core.py # RAG orchestration logic
โโโ llm.py # LLM provider abstraction
โโโ pinecone_client.py # Pinecone vector DB client
โโโ ingest.py # Document ingestion pipeline
โโโ chunker.py # Text chunking strategy
โโโ requirements.txt # Python dependencies
โโโ .env.example # Environment variables template
โโโ README.md # This file
โโโ data/ # Document storage directory
๐ Usage Examples
1. Text Input Processing
- Paste text into the "Text Input" tab
- Configure chunk size (400-1200 tokens) and overlap (10-15%)
- Click "Process & Store Text" to ingest into vector DB
2. File Ingestion
- Place documents (.txt, .md, .pdf) in the
data/directory - Use the "File Ingestion" tab to process all files
- Monitor chunk count and processing status
3. Query and Answer
- Navigate to "Query" tab
- Enter your question
- Adjust Top-K retrieval and reranker settings
- Get answer with inline citations [1], [2] and source details
๐ Performance & Monitoring
Metrics Tracked
- Processing Time: End-to-end query response time
- Token Usage: Query, context, and answer token counts
- Cost Estimates: Embedding, LLM, and reranking costs
- Retrieval Quality: Vector similarity scores and rerank scores
Optimization Tips
- Adjust chunk size based on document characteristics
- Use reranker for better relevance (adds ~100ms but improves quality)
- Batch process documents for efficient ingestion
- Monitor Pinecone index performance and costs
๐จ Error Handling
Common Issues
- Missing API Keys: Check environment variables
- Pinecone Connection: Verify index name and region
- Document Processing: Check file formats and encoding
- Rate Limits: Implement exponential backoff for API calls
Graceful Degradation
- Fallback to original retrieval order if reranker fails
- Continue processing if individual documents fail
- Provide clear error messages with troubleshooting steps
๐ฎ Future Enhancements
Planned Improvements
- Advanced Chunking: Semantic chunking with sentence transformers
- Hybrid Search: Combine vector and keyword search
- Multi-modal Support: Image and document processing
- Caching Layer: Redis for frequently accessed results
- Analytics Dashboard: Query performance and usage metrics
Scalability Considerations
- Vector DB: Pinecone pod scaling for larger datasets
- Embedding Models: Local models for cost reduction
- Load Balancing: Multiple LLM providers for redundancy
- CDN Integration: Static asset optimization
๐ Remarks
Trade-offs Made
- API Dependencies: Relies on external services for embeddings and LLM
- Cost vs Quality: OpenAI embeddings provide quality but add cost
- Latency: Reranking adds ~100ms but significantly improves relevance
- Chunking Strategy: Fixed-size chunks for simplicity vs semantic chunking
Provider Limits
- OpenAI: Rate limits and token limits per request
- Pinecone: Free tier index size and query limits
- Cohere: Reranking API rate limits
- Groq: Alternative LLM with different pricing model
What I'd Do Next
- Implement semantic chunking for better document understanding
- Add hybrid search combining vector and keyword approaches
- Build evaluation framework with automated testing
- Optimize for production with proper logging and monitoring
- Add authentication for multi-user support
๐จโ๐ป Author
Your Name - AI Engineer Assessment Candidate
- GitHub: [Your GitHub Profile]
- LinkedIn: [Your LinkedIn Profile]
- Portfolio: [Your Portfolio/Website]
๐ License
This project is created for the AI Engineer Assessment. Feel free to use and modify for learning purposes.
Note: This implementation demonstrates production-ready practices including proper error handling, environment variable management, comprehensive documentation, and scalable architecture design.