---
title: Mini RAG - Track B Assessment
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.0
app_file: app.py
pinned: false
---

# Mini RAG - Track B Assessment

A production-ready RAG (Retrieval-Augmented Generation) application that demonstrates text input, vector storage, retrieval + reranking, and LLM answering with inline citations.

## 🎯 Goal
Build and host a small RAG app where users input text (upload file is optional) from the frontend, store it in a cloud-hosted vector DB, retrieve the most relevant chunks with a retriever + reranker, and answer queries via an LLM with proper citations.

## 🏗️ Architecture

```
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Frontend      │    │   Backend       │    │   External      │
│   (Gradio UI)   │◄──►│   (Python)      │◄──►│   Services      │
└─────────────────┘    └─────────────────┘    └─────────────────┘
│                        │                        │
│ • Text Input/Upload    │ • Text Processing      │ • OpenAI API    │
│ • Query Interface      │ • Chunking Strategy    │ • Groq API      │
│ • Results Display      │ • Embedding Generation │ • Cohere API    │
│ • Citations & Sources  │ • Vector Storage      │ • Pinecone      │
└─────────────────┘    └─────────────────┘    └─────────────────┘
```

### Data Flow
1. **Ingestion**: Text → Chunking → Embedding → Pinecone Vector DB
2. **Query**: Question → Embedding → Vector Search → Top-K Retrieval
3. **Reranking**: Retrieved chunks → Cohere Reranker → Reordered results
4. **Generation**: Reranked chunks → LLM → Answer with inline citations [1], [2]

## 🚀 Features

### ✅ Requirements Met
- **Vector Database**: Pinecone cloud-hosted with serverless index
- **Embeddings & Chunking**: OpenAI embeddings with configurable chunk size (400-1200 tokens) and overlap (10-15%)
- **Retriever + Reranker**: Top-k retrieval with optional Cohere reranker
- **LLM & Answering**: OpenAI/Groq with inline citations and source mapping
- **Frontend**: Text input/upload, query interface, citations display, timing & cost estimates
- **Metadata Storage**: Source, title, section, position tracking

### 🔧 Technical Details
- **Chunking Strategy**: 800 tokens default with 120 token overlap (15%)
- **Vector Dimension**: 1536 (OpenAI text-embedding-3-small)
- **Index Configuration**: Pinecone serverless, cosine similarity
- **Upsert Strategy**: Batch processing (100 chunks) with metadata preservation

## 🛠️ Setup

### Prerequisites
- Python 3.8+
- Pinecone account and API key
- OpenAI API key
- Groq API key (optional)
- Cohere API key (optional, for reranking)

### Installation

1. **Clone and setup environment**
```bash
git clone <your-repo-url>
cd mini-rag
python -m venv .venv
source .venv/bin/activate  # On Windows: .\.venv\Scripts\activate
pip install -r requirements.txt
```

2. **Configure environment variables**
```bash
cp .env.example .env
# Edit .env with your API keys
```

3. **Create data directory**
```bash
mkdir data
```

4. **Run the application**
```bash
python app.py
```

### Environment Variables
```bash
# Pinecone
PINECONE_API_KEY=your_pinecone_key
PINECONE_INDEX=mini-rag-index
PINECONE_CLOUD=aws
PINECONE_REGION=us-east-1

# LLMs
OPENAI_API_KEY=your_openai_key
GROQ_API_KEY=your_groq_key

# Reranker
COHERE_API_KEY=your_cohere_key

# Models
EMBEDDING_MODEL=text-embedding-3-small
LLM_PROVIDER=openai
LLM_MODEL=gpt-4o-mini
RERANK_PROVIDER=cohere
RERANK_MODEL=rerank-3

# Chunking
CHUNK_SIZE=800
CHUNK_OVERLAP=120
DATA_DIR=./data
```

## 📊 Evaluation

### Gold Set Q&A Pairs
1. **Q:** What is the main topic of the document?
   **Expected:** Clear identification of document subject
   
2. **Q:** What are the key findings or conclusions?
   **Expected:** Specific facts or conclusions from the text
   
3. **Q:** What methodology was used?
   **Expected:** Description of approach or methods mentioned
   
4. **Q:** What are the limitations discussed?
   **Expected:** Any limitations or constraints mentioned
   
5. **Q:** What future work is suggested?
   **Expected:** Recommendations or future directions

### Success Metrics
- **Precision**: Relevant information in answers
- **Recall**: Coverage of available information  
- **Citation Accuracy**: Proper source attribution with [1], [2] format
- **Response Time**: Query processing speed
- **Cost Efficiency**: Token usage and API cost estimates

## 🚀 Deployment

### Free Hosting Options
- **Hugging Face Spaces**: Gradio apps with free tier
- **Render**: Free tier for Python web services
- **Railway**: Free tier for small applications
- **Vercel**: Free tier for static sites (with API routes)

### Deployment Steps
1. **Prepare for deployment**
   - Ensure all API keys are environment variables
   - Test locally with production settings
   - Add proper error handling and logging

2. **Deploy to chosen platform**
   - Follow platform-specific deployment guides
   - Set environment variables in platform dashboard
   - Configure domain and SSL if needed

## 📁 Project Structure
```
mini-rag/
├── app.py              # Gradio UI and main application
├── rag_core.py         # RAG orchestration logic
├── llm.py             # LLM provider abstraction
├── pinecone_client.py # Pinecone vector DB client
├── ingest.py          # Document ingestion pipeline
├── chunker.py         # Text chunking strategy
├── requirements.txt   # Python dependencies
├── .env.example      # Environment variables template
├── README.md         # This file
└── data/             # Document storage directory
```

## 🔍 Usage Examples

### 1. Text Input Processing
- Paste text into the "Text Input" tab
- Configure chunk size (400-1200 tokens) and overlap (10-15%)
- Click "Process & Store Text" to ingest into vector DB

### 2. File Ingestion
- Place documents (.txt, .md, .pdf) in the `data/` directory
- Use the "File Ingestion" tab to process all files
- Monitor chunk count and processing status

### 3. Query and Answer
- Navigate to "Query" tab
- Enter your question
- Adjust Top-K retrieval and reranker settings
- Get answer with inline citations [1], [2] and source details

## 📈 Performance & Monitoring

### Metrics Tracked
- **Processing Time**: End-to-end query response time
- **Token Usage**: Query, context, and answer token counts
- **Cost Estimates**: Embedding, LLM, and reranking costs
- **Retrieval Quality**: Vector similarity scores and rerank scores

### Optimization Tips
- Adjust chunk size based on document characteristics
- Use reranker for better relevance (adds ~100ms but improves quality)
- Batch process documents for efficient ingestion
- Monitor Pinecone index performance and costs

## 🚨 Error Handling

### Common Issues
- **Missing API Keys**: Check environment variables
- **Pinecone Connection**: Verify index name and region
- **Document Processing**: Check file formats and encoding
- **Rate Limits**: Implement exponential backoff for API calls

### Graceful Degradation
- Fallback to original retrieval order if reranker fails
- Continue processing if individual documents fail
- Provide clear error messages with troubleshooting steps

## 🔮 Future Enhancements

### Planned Improvements
- **Advanced Chunking**: Semantic chunking with sentence transformers
- **Hybrid Search**: Combine vector and keyword search
- **Multi-modal Support**: Image and document processing
- **Caching Layer**: Redis for frequently accessed results
- **Analytics Dashboard**: Query performance and usage metrics

### Scalability Considerations
- **Vector DB**: Pinecone pod scaling for larger datasets
- **Embedding Models**: Local models for cost reduction
- **Load Balancing**: Multiple LLM providers for redundancy
- **CDN Integration**: Static asset optimization

## 📝 Remarks

### Trade-offs Made
- **API Dependencies**: Relies on external services for embeddings and LLM
- **Cost vs Quality**: OpenAI embeddings provide quality but add cost
- **Latency**: Reranking adds ~100ms but significantly improves relevance
- **Chunking Strategy**: Fixed-size chunks for simplicity vs semantic chunking

### Provider Limits
- **OpenAI**: Rate limits and token limits per request
- **Pinecone**: Free tier index size and query limits
- **Cohere**: Reranking API rate limits
- **Groq**: Alternative LLM with different pricing model

### What I'd Do Next
1. **Implement semantic chunking** for better document understanding
2. **Add hybrid search** combining vector and keyword approaches
3. **Build evaluation framework** with automated testing
4. **Optimize for production** with proper logging and monitoring
5. **Add authentication** for multi-user support

## 👨‍💻 Author

**Your Name** - AI Engineer Assessment Candidate
- **GitHub**: [Your GitHub Profile]
- **LinkedIn**: [Your LinkedIn Profile]
- **Portfolio**: [Your Portfolio/Website]

## 📄 License

This project is created for the AI Engineer Assessment. Feel free to use and modify for learning purposes.

---

**Note**: This implementation demonstrates production-ready practices including proper error handling, environment variable management, comprehensive documentation, and scalable architecture design.