---
title: Cortex RAG
sdk: docker
emoji: 🧠
colorFrom: blue
colorTo: purple
---
# Cortex RAG — Next-Gen Retrieval-Augmented Generation
**Production-grade RAG system with dense retrieval, semantic chunking, knowledge graph integration, CRAG gating, and multi-provider LLM support.**




---
## 🎯 Overview
**Cortex** is a production-ready Retrieval-Augmented Generation (RAG) framework that combines:
- **Dense Vector Search** — Fast, accurate document retrieval using BAAI embeddings (384-dim)
- **Semantic Chunking** — Intelligent split boundaries based on sentence-level cosine similarity
- **Parent-Child Chunks** — 256-token child chunks for precision, 1024-token parents for context
- **Multi-Strategy Retrieval** — Dense search, BM25 hybrid, knowledge graph traversal
- **CRAG Gating** — Automatic relevance assessment with fallback to web search
- **Multi-Provider LLM** — Support for Groq, OpenAI, NVIDIA NIM, and custom endpoints
- **Streaming Responses** — Real-time SSE-based answer generation with inline citations
- **Knowledge Graphs** — Automatic relation extraction and entity-based retrieval
- **Caching Layer** — Redis integration for query result caching
- **Evaluation Framework** — RAGAS-based RAG evaluation metrics
---
## 🏗️ Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Document Ingestion │
├─────────────────────────────────────────────────────────────────┤
│ PDF/HTML/TXT → DocumentLoader → SemanticChunker │
│ ↓ │
│ Child (~256 tokens) + Parent (~1024 tokens) │
├─────────────────────────────────────────────────────────────────┤
│ Embedding Layer │
├─────────────────────────────────────────────────────────────────┤
│ BAAI/bge-small-en-v1.5 (384-dim, L2-normalized) │
│ → Milvus Store (IVF_FLAT, COSINE metric) │
│ → BM25 Index (keyword search) │
│ → Knowledge Graph (entities, relations, triples) │
├─────────────────────────────────────────────────────────────────┤
│ Query Processing │
├─────────────────────────────────────────────────────────────────┤
│ Dense Search (top-15) → Reranking → CRAG Gate │
│ ↓ ↓ │
│ High Confidence? Low Confidence? │
│ ↓ ↓ │
│ Use KnowledgeBase ⚠️ Web Search (Tavily) │
├─────────────────────────────────────────────────────────────────┤
│ LLM Generation (Streaming) │
├─────────────────────────────────────────────────────────────────┤
│ Groq Llama 3.3-70B / OpenAI GPT-4o / NVIDIA NIM / Custom │
│ Process context → Generate answer → Extract citations │
│ Stream via SSE → Client receives real-time response │
├─────────────────────────────────────────────────────────────────┤
│ Frontend Interfaces │
├─────────────────────────────────────────────────────────────────┤
│ Streamlit UI (Ask/Ingest/System) | REST API (FastAPI) │
└─────────────────────────────────────────────────────────────────┘
```
---
## ✨ Key Features
| Feature | Details |
|---------|---------|
| 🔍 **Dense Retrieval** | Sub-50ms semantic search via Milvus with 384-dim embeddings |
| 📚 **Smart Chunking** | Semantic splits + parent-child hierarchy for precision + context |
| 🧬 **Knowledge Graphs** | Automatic relation extraction (REBEL or LLM-based) |
| 🚨 **CRAG Gating** | Relevance assessment with web search fallback |
| 🔗 **Multi-Strategy** | Dense + BM25 keyword + graph traversal combined |
| 💾 **Redis Cache** | Query result caching with configurable TTL |
| 🌐 **Multi-Provider LLM** | Groq, OpenAI, NVIDIA NIM, Ollama, custom OpenAI-compatible |
| 📊 **Evaluation** | RAGAS metrics for answer relevance, faithfulness, context precision |
| 🎨 **Streaming UI** | Real-time responses with inline citations and source cards |
| 🐳 **Docker Ready** | Full Docker Compose setup with Milvus, Redis, API, UI |
---
## 🚀 Quick Start
### Prerequisites
- Python 3.10+
- Docker & Docker Compose (optional, for containerized setup)
- GROQ API key (default LLM provider)
### 1. Clone & Setup
```bash
# Clone repository
git clone
cd cortex
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
```
### 2. Environment Configuration
Create `.env` file in project root:
```bash
# LLM Providers
GROQ_API_KEY=your_groq_api_key
GROQ_MODEL=llama-3.3-70b-versatile
GROQ_TEMPERATURE=0.1
# Optional: Other LLM providers
OPENAI_API_KEY=your_openai_key
MISTRAL_API_KEY=your_mistral_key
NVIDIA_API_KEY=your_nvidia_key
# Embedding & Storage
EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5
EMBED_DEVICE=cpu # "cuda" if GPU available
# Milvus Vector Store
MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_COLLECTION=cortex_chunks
MILVUS_INDEX_TYPE=IVF_FLAT
# Redis Cache (optional)
REDIS_URL=redis://localhost:6379
# Retrieval
RETRIEVAL_TOP_K=15
FINAL_TOP_K=5
# CRAG (Consistency-based Retrieval Augmented Generation)
CRAG_ENABLED=true
CRAG_RELEVANCE_THRESHOLD=0.5
# Knowledge Graph
GRAPH_ENABLED=true
GRAPH_EXTRACTOR=llm-filtered # "rebel", "llm", "rebel-filtered", "llm-filtered"
GRAPH_MAX_HOPS=2
# API
API_HOST=0.0.0.0
API_PORT=8000
```
### 3. Start Services
**Option A: Docker Compose (Recommended)**
```bash
docker-compose up -d
# API: http://localhost:8000
# Streamlit UI: http://localhost:8501
# Milvus: http://localhost:19530
```
**Option B: Local Setup**
Make sure Milvus is running:
```bash
# Using Milvus Docker (if not using compose)
docker run -d -p 19530:19530 -p 9091:9091 milvusdb/milvus:latest
# Start API
python -m uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
# In another terminal, start UI
streamlit run ui/app.py
```
### 4. Ingest Documents
**Via Streamlit UI:**
- Open http://localhost:8501
- Go to "📥 Ingest" tab
- Upload PDF/HTML/TXT or provide directory path
**Via REST API:**
```bash
curl -X POST "http://localhost:8000/ingest" \
-H "Content-Type: application/json" \
-d '{
"mode": "directory",
"path": "/path/to/documents"
}'
```
### 5. Ask Questions
**Via Streamlit UI:**
- Go to "🔍 Ask" tab
- Type your question
- Watch streaming response with citations
**Via REST API:**
```bash
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{
"query": "What is machine learning?",
"provider": "groq",
"top_k": 5
}' | jq .
```
**Streaming Response:**
```bash
curl -X POST "http://localhost:8000/query/stream" \
-H "Content-Type: application/json" \
-d '{
"query": "Your question here",
"provider": "groq"
}'
```
---
## 📡 REST API Endpoints
### Health & Status
```http
GET /health
```
Returns system health, Milvus status, collection stats.
```json
{
"status": "healthy",
"milvus": {
"connected": true,
"collection_count": 2500,
"index_type": "IVF_FLAT"
}
}
```
### Document Ingestion
```http
POST /ingest
Content-Type: application/json
{
"mode": "directory|file|upload",
"path": "/path/to/documents",
"chunk_size": 256,
"overlap": 32
}
```
### Query (Blocking)
```http
POST /query
Content-Type: application/json
{
"query": "Your question",
"provider": "groq",
"model": "llama-3.3-70b-versatile",
"top_k": 5,
"crag": true,
"graph": true
}
```
**Response:**
```json
{
"answer": "Answer text with citations [1][2]...",
"chunks": [
{
"id": "chunk_001",
"text": "...",
"score": 0.87,
"source": "document_name.pdf"
}
],
"citations": [1, 2],
"latency_ms": 1245
}
```
### Query (Streaming)
```http
POST /query/stream
Content-Type: application/json
{
"query": "Your question",
"provider": "groq"
}
```
**Response:** Server-Sent Events (SSE) stream
```
data: {"type": "start"}
data: {"type": "chunk", "content": "Answer "}
data: {"type": "chunk", "content": "is "}
data: {"type": "chunk", "content": "streaming..."}
data: {"type": "citations", "citations": [1, 2]}
data: {"type": "end"}
```
### Model Information
```http
GET /providers
```
Lists all available LLM providers and models.
---
## 🛠️ Configuration Guide
### Retrieval Configuration
```env
# Chunk sizes (tokens)
CHUNK_SIZE_TOKENS=256 # Child chunk size
PARENT_CHUNK_SIZE_TOKENS=1024 # Parent chunk size
SEMANTIC_SIMILARITY_THRESHOLD=0.82 # Split boundary threshold
CHUNK_OVERLAP_TOKENS=32 # Overlap padding
# Retrieval settings
RETRIEVAL_TOP_K=15 # Candidates before reranking
FINAL_TOP_K=5 # Chunks sent to LLM
```
### Embedding Configuration
```env
EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5 # Model identifier
EMBED_DIM=384 # Output dimension
EMBED_BATCH_SIZE=64 # Batch size for processing
EMBED_DEVICE=cpu # cpu or cuda
```
### Milvus Configuration
```env
MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_COLLECTION=cortex_chunks
MILVUS_INDEX_TYPE=IVF_FLAT # or HNSW for larger corpora
MILVUS_METRIC_TYPE=COSINE # Vector similarity metric
MILVUS_NLIST=128 # clustering parameter for IVF
MILVUS_NPROBE=16 # search parameter
```
### LLM Provider Configuration
**Groq (Default)**
```env
GROQ_API_KEY=your_key
GROQ_MODEL=llama-3.3-70b-versatile
GROQ_TEMPERATURE=0.1
GROQ_MAX_TOKENS=1024
GROQ_TIMEOUT=30
```
**OpenAI**
```env
OPENAI_API_KEY=your_key
```
**NVIDIA NIM**
```env
NVIDIA_API_KEY=your_key
```
**Custom/Ollama**
```env
CUSTOM_BASE_URL=http://localhost:11434/v1
CUSTOM_API_KEY=your_key
```
### CRAG (Consistency-based Retrieval Augmented Generation)
```env
CRAG_ENABLED=true
CRAG_RELEVANCE_THRESHOLD=0.5 # Grade boundary
TAVILY_API_KEY=your_tavily_key # For web search fallback
```
The CRAG gate automatically assesses retrieval quality:
- **High confidence** (score ≥ threshold) → Use knowledge base
- **Low confidence** (score < threshold) → Augment with web search
### Knowledge Graph
```env
GRAPH_ENABLED=true
GRAPH_EXTRACTOR=llm-filtered # rebel|llm|rebel-filtered|llm-filtered
GRAPH_MAX_HOPS=2 # Traversal depth
GRAPH_PATH=/data/storage/knowledge_graph.json
# Density filtering (for "filtered" extractors)
DENSITY_TOP_FRACTION=0.30 # Process top 30% entity-dense chunks
DENSITY_MIN_ENTITIES=2 # Minimum entities per chunk
```
### Caching
```env
REDIS_URL=redis://localhost:6379
CACHE_TTL_SECONDS=3600 # 1 hour
```
### Evaluation
```env
EVAL_DB_PATH=/data/storage/eval.db
```
---
## 📁 Project Structure
```
cortex/
├── api/ # FastAPI REST endpoints
│ ├── main.py # App initialization, endpoints
│ └── schemas.py # Request/response Pydantic models
│
├── ingestion/ # Document processing pipeline
│ ├── pipeline.py # Orchestration
│ ├── document_loader.py # PDF/HTML/TXT parsing
│ ├── chunker.py # Semantic chunking
│ └── __init__.py
│
├── retrieval/ # Multi-strategy retrieval
│ ├── orchestrator.py # Coordinate retrieval strategies
│ ├── dense.py # Milvus vector search
│ ├── bm25.py # Keyword search index
│ ├── embedder.py # HuggingFace embedding model
│ ├── router.py # Query routing logic
│ ├── fusion.py # Result fusion & reranking
│ ├── graph_builder.py # Build knowledge graphs
│ ├── graph_retriever.py # Entity-based retrieval
│ ├── relation_extractors.py # REBEL + LLM extractors
│ ├── cache.py # Redis caching wrapper
│ └── __init__.py
│
├── generation/ # LLM generation & CRAG
│ ├── generator.py # Multi-provider LLM wrapper
│ ├── crag.py # CRAG gate logic
│ └── __init__.py
│
├── evaluation/ # RAG evaluation metrics
│ ├── ragas_eval.py # RAGAS evaluator
│ ├── store.py # Evaluation database
│ └── __init__.py
│
├── ui/ # Streamlit frontend
│ ├── app.py # Main UI
│ └── static/ # (Optional) HTML/CSS/JS
│
├── data/ # Data storage
│ ├── documents/ # Input documents
│ ├── storage/ # Persistent storage
│ │ ├── knowledge_graph.json
│ │ ├── bm25_index.pkl
│ │ └── uploads/
│ └── synthetic_knowledge_items.txt
│
├── config.py # Configuration & settings
├── requirements.txt # Python dependencies
├── Dockerfile # Docker image build
├── docker-compose.yml # Multi-container orchestration
├── test.py # Test suite
└── README.md # This file
```
---
## 🐳 Docker & Deployment
### Docker Compose Quick Deploy
```bash
# Start all services
docker-compose up -d
# View logs
docker-compose logs -f api
# Stop services
docker-compose down
```
**Services:**
- `milvus` — Vector database (port 19530)
- `redis` — Caching layer (port 6379)
- `api` — FastAPI backend (port 8000)
- `ui` — Streamlit frontend (port 8501)
### Environment Variables in Compose
Edit `docker-compose.yml` to customize:
```yaml
services:
api:
environment:
- GROQ_API_KEY=${GROQ_API_KEY}
- GROQ_MODEL=llama-3.3-70b-versatile
- MILVUS_HOST=milvus
- REDIS_URL=redis://redis:6379
- GRAPH_EXTRACTOR=llm-filtered
```
### Production Deployment
For production, consider:
1. **Use HNSW index** instead of IVF_FLAT for better recall:
```env
MILVUS_INDEX_TYPE=HNSW
```
2. **Enable caching** for frequently asked questions:
```env
REDIS_URL=redis://redis-prod:6379
```
3. **Use stronger embedding model** for higher quality:
```env
EMBED_MODEL_NAME=BAAI/bge-base-en-v1.5 # 768-dim, better quality
```
4. **Configure CRAG** for reliability:
```env
CRAG_ENABLED=true
CRAG_RELEVANCE_THRESHOLD=0.6
TAVILY_API_KEY=your_key
```
---
## 🔄 Workflow Examples
### Example 1: Legal Document Q&A
```bash
# 1. Ingest legal documents
curl -X POST "http://localhost:8000/ingest" \
-H "Content-Type: application/json" \
-d '{
"mode": "directory",
"path": "/data/legal_documents"
}'
# 2. Query with graph enabled for relation extraction
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{
"query": "What are the penalties for breach of contract?",
"provider": "groq",
"graph": true,
"crag": true
}'
```
### Example 2: Research Paper Analysis
```bash
# Ingest PDF papers
python -c "
from ingestion.pipeline import IngestionPipeline
from retrieval.embedder import Embedder
from retrieval.dense import MilvusStore
embedder = Embedder()
store = MilvusStore(embedder=embedder)
pipeline = IngestionPipeline(embedder=embedder, store=store, bm25=None)
pipeline.ingest('/data/papers', mode='pdf')
"
# Query for specific findings
curl -X POST "http://localhost:8000/query/stream" \
-H "Content-Type: application/json" \
-d '{
"query": "What are the key findings about transformer performance?",
"model": "gpt-4o"
}'
```
### Example 3: Customer Support Bot
```bash
# 1. Ingest FAQ and documentation
# 2. Set up CRAG with relevant threshold
# 3. Route low-confidence queries to web search
CRAG_RELEVANCE_THRESHOLD=0.6
TAVILY_API_KEY=your_key
```
---
## 📊 Advanced Features
### Knowledge Graph Extraction
Three modes available:
| Mode | Backend | Speed | Quality | Cost |
|------|---------|-------|---------|------|
| `rebel` | Local REBEL model | Fast | Good | Free |
| `llm` | LLM (Groq/OpenAI) | Slower | Excellent | $$ |
| `rebel-filtered` | REBEL + entity filtering | Fast | Good | Free |
| `llm-filtered` | LLM + entity filtering | Slower | Excellent | $$ |
Switch via config:
```env
GRAPH_EXTRACTOR=llm-filtered
```
### CRAG (Consistency-based RAG)
Automatically:
1. Evaluates retrieval confidence
2. Assigns relevance grade (Correct/Partially-Correct/Missing)
3. Supplements low-confidence with web search via Tavily
```python
from generation.crag import CRAGGate
crag = CRAGGate()
response = crag.evaluate(query, context, answer)
# Returns: grade, supplemental_docs
```
### Evaluation & Metrics
RAGAS-based evaluation:
```python
from evaluation.ragas_eval import RAGASEvaluator
from evaluation.store import EvalStore
evaluator = RAGASEvaluator(store=EvalStore())
metrics = evaluator.evaluate(query, context, answer)
# Returns: answer_relevance, faithfulness, context_precision
```
### Caching Strategy
```python
from retrieval.cache import CachedRetriever
retriever = CachedRetriever(base_retriever)
# First call: 1000ms (database query)
# Second call: 5ms (Redis cache hit, TTL: 1 hour)
results = retriever.retrieve("machine learning basics")
```
---
## ⚙️ Performance Tuning
### For Speed
```env
# Smaller embedding model
EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5
# Smaller chunks
CHUNK_SIZE_TOKENS=128
PARENT_CHUNK_SIZE_TOKENS=512
# Faster index
MILVUS_INDEX_TYPE=IVF_FLAT
MILVUS_NPROBE=8 # Lower = faster
# Enable cache
REDIS_URL=redis://localhost:6379
# Fewer LLM tokens
GROQ_MAX_TOKENS=512
```
### For Quality
```env
# Larger embedding model
EMBED_MODEL_NAME=BAAI/bge-base-en-v1.5
# Optimal chunks
CHUNK_SIZE_TOKENS=512
PARENT_CHUNK_SIZE_TOKENS=2048
# More precise index
MILVUS_INDEX_TYPE=HNSW
MILVUS_NPROBE=32
# Better LLM
GROQ_MODEL=llama-3.3-70b-versatile
# Enable CRAG
CRAG_ENABLED=true
```
---
## 🐛 Troubleshooting
### Milvus Connection Failed
```bash
# Check if Milvus is running
curl http://localhost:19530/healthz
# Restart Milvus
docker-compose restart milvus
# Verify in settings
python -c "from config import get_settings; print(get_settings().milvus_host)"
```
### Low Retrieval Quality
1. **Check chunk quality:**
```python
from ingestion.chunker import SemanticChunker
chunker = SemanticChunker()
chunks = chunker.chunk("your document text")
print([c.text for c in chunks[:3]])
```
2. **Verify embeddings:**
```python
from retrieval.embedder import Embedder
embedder = Embedder()
emb = embedder.embed("test query")
print(f"Embedding dim: {len(emb)}, sample: {emb[:5]}")
```
3. **Enable CRAG** for automatic augmentation:
```env
CRAG_ENABLED=true
```
### Slow Response Times
1. Check cache hit rate
2. Reduce `MILVUS_NPROBE`
3. Use streaming endpoint (`/query/stream`)
4. Enable Redis caching
### Out of Memory
```env
# Reduce batch sizes
EMBED_BATCH_SIZE=16
# Reduce chunk sizes
CHUNK_SIZE_TOKENS=128
# Switch to CPU if using GPU
EMBED_DEVICE=cpu
```
---
## 📈 Monitoring & Evaluation
### Health Check
```bash
curl http://localhost:8000/health | jq .
```
### Collection Statistics
```python
from retrieval.dense import MilvusStore
from retrieval.embedder import Embedder
store = MilvusStore(embedder=Embedder())
stats = store.get_stats()
print(f"Documents: {stats['collection_count']}")
```
### Query Evaluation
```python
from evaluation.ragas_eval import RAGASEvaluator
from evaluation.store import EvalStore
evaluator = RAGASEvaluator(store=EvalStore(db_path="/data/storage/eval.db"))
metrics = evaluator.evaluate(query, context, answer)
print(f"Answer Relevance: {metrics['answer_relevance']:.2f}")
print(f"Faithfulness: {metrics['faithfulness']:.2f}")
print(f"Context Precision: {metrics['context_precision']:.2f}")
```
---
## 🤝 Contributing
Contributions welcome! Areas for enhancement:
- [ ] Multi-language support
- [ ] Fine-tuned domain-specific embeddings
- [ ] Advanced reranking strategies
- [ ] GraphQL API
- [ ] Persistent trace logging
- [ ] A/B testing framework
---
## 📝 License
MIT License — see LICENSE file for details
---
## 🔗 Resources
- [Milvus Documentation](https://milvus.io/docs)
- [FastAPI Guide](https://fastapi.tiangolo.com/)
- [RAGAS Evaluation Framework](https://github.com/explorerx3/ragas)
- [Groq API Reference](https://console.groq.com/docs/api-reference)
- [CRAG Paper](https://arxiv.org/abs/2401.15884)
---
**Questions?** Open an issue on GitHub or check the documentation.
source .venv/bin/activate
pip install -r requirements.txt
python -m nltk.downloader punkt
python -m spacy download en_core_web_sm
```
### 2. Configure
```bash
cp .env.example .env
# Edit .env — set GROQ_API_KEY at minimum
```
Get a free Groq API key at https://console.groq.com
### 3. Start Milvus
```bash
docker-compose up -d
# Wait ~30s for Milvus to be healthy
docker-compose ps # all three services should show "healthy"
```
### 4. Ingest documents
```bash
mkdir -p data/documents
# Copy PDFs / HTML / TXT files into data/documents/
python -m ingestion.pipeline data/documents
```
Or use the CLI:
```bash
python ingestion/pipeline.py data/documents
python ingestion/pipeline.py data/documents/paper.pdf
```
### 5. Start the API
```bash
uvicorn api.main:app --reload --port 8000
```
### 6. Start the UI
```bash
streamlit run ui/app.py
```
Open http://localhost:8501 in your browser.
---
## API endpoints
| Method | Path | Description |
|--------|------|-------------|
| GET | `/health` | Component health check |
| POST | `/ingest` | Trigger ingestion pipeline |
| POST | `/query` | Blocking query (full JSON) |
| POST | `/query/stream` | Streaming query (SSE) |
### Example — blocking query
```bash
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "What is attention in transformers?", "top_k": 5}'
```
### Example — streaming query
```bash
curl -X POST http://localhost:8000/query/stream \
-H "Content-Type: application/json" \
-d '{"query": "Explain PagedAttention", "stream": true}'
```
---
## Key design decisions
### Semantic chunking
Fixed-size chunking (e.g. 1000 chars with 200 overlap) splits mid-sentence
and mid-concept. Semantic chunking detects topic boundaries using cosine
similarity between consecutive sentence embeddings, producing chunks that
align with natural concept transitions. Combined with a fallback on token
count (child_max = 256 tokens), chunks are both semantically coherent and
bounded in size.
### Parent-child hierarchy
The child chunk (≈256 tokens) is what gets embedded and indexed — small,
precise, high-relevance. When a child chunk is retrieved, its parent chunk
(≈1024 tokens, centred on the child) is what goes into the LLM context.
This separates the **retrieval granularity** from the **generation context
width**, giving you the precision of small chunks with the coherence of
large ones.
### BGE query prefix
`BAAI/bge-small-en-v1.5` is trained to expect a task-specific prefix on
query strings for retrieval tasks:
`"Represent this sentence for searching relevant passages: "`
Documents are embedded as-is. Skipping this prefix typically costs 3-5
points on retrieval benchmarks.
---
## Phase roadmap
| Phase | Status | What's added |
|-------|--------|--------------|
| 1 | ✅ Done | Dense RAG, semantic chunking, parent-child, streaming UI |
| 2 | ✅ Done | BM25 sparse, query router, RRF fusion, cross-encoder reranking |
| 3 | ✅ Done | GraphRAG (spaCy NER + NetworkX), CRAG gate, web fallback |
| 4 | ✅ Done | RAGAS eval harness, Redis cache, evaluation dashboard |