Spaces:

aditya-joshi-05
/

Cortex

Running

File size: 26,334 Bytes

---
title: Cortex RAG
sdk: docker
emoji: 🧠
colorFrom: blue
colorTo: purple
---

# Cortex RAG — Next-Gen Retrieval-Augmented Generation

<div align="center">

**Production-grade RAG system with dense retrieval, semantic chunking, knowledge graph integration, CRAG gating, and multi-provider LLM support.**

![Python](https://img.shields.io/badge/Python-3.10+-3776ab?logo=python&logoColor=white)
![FastAPI](https://img.shields.io/badge/FastAPI-0.100+-009688?logo=fastapi&logoColor=white)
![Docker](https://img.shields.io/badge/Docker-Ready-2496ED?logo=docker&logoColor=white)
![License](https://img.shields.io/badge/License-MIT-green)

</div>

---

## 🎯 Overview

**Cortex** is a production-ready Retrieval-Augmented Generation (RAG) framework that combines:

- **Dense Vector Search** — Fast, accurate document retrieval using BAAI embeddings (384-dim)
- **Semantic Chunking** — Intelligent split boundaries based on sentence-level cosine similarity
- **Parent-Child Chunks** — 256-token child chunks for precision, 1024-token parents for context
- **Multi-Strategy Retrieval** — Dense search, BM25 hybrid, knowledge graph traversal
- **CRAG Gating** — Automatic relevance assessment with fallback to web search
- **Multi-Provider LLM** — Support for Groq, OpenAI, NVIDIA NIM, and custom endpoints
- **Streaming Responses** — Real-time SSE-based answer generation with inline citations
- **Knowledge Graphs** — Automatic relation extraction and entity-based retrieval
- **Caching Layer** — Redis integration for query result caching
- **Evaluation Framework** — RAGAS-based RAG evaluation metrics

---

## 🏗️ Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                    Document Ingestion                            │
├─────────────────────────────────────────────────────────────────┤
│  PDF/HTML/TXT → DocumentLoader → SemanticChunker                 │
│                                       ↓                          │
│                   Child (~256 tokens) + Parent (~1024 tokens)    │
├─────────────────────────────────────────────────────────────────┤
│                        Embedding Layer                            │
├─────────────────────────────────────────────────────────────────┤
│  BAAI/bge-small-en-v1.5 (384-dim, L2-normalized)                │
│  → Milvus Store (IVF_FLAT, COSINE metric)                        │
│  → BM25 Index (keyword search)                                   │
│  → Knowledge Graph (entities, relations, triples)                │
├─────────────────────────────────────────────────────────────────┤
│                       Query Processing                            │
├─────────────────────────────────────────────────────────────────┤
│  Dense Search (top-15) → Reranking → CRAG Gate                   │
│         ↓                                    ↓                    │
│  High Confidence?                    Low Confidence?             │
│         ↓                                    ↓                    │
│    Use KnowledgeBase               ⚠️ Web Search (Tavily)        │
├─────────────────────────────────────────────────────────────────┤
│                    LLM Generation (Streaming)                     │
├─────────────────────────────────────────────────────────────────┤
│  Groq Llama 3.3-70B / OpenAI GPT-4o / NVIDIA NIM / Custom        │
│  Process context → Generate answer → Extract citations           │
│  Stream via SSE → Client receives real-time response             │
├─────────────────────────────────────────────────────────────────┤
│                  Frontend Interfaces                              │
├─────────────────────────────────────────────────────────────────┤
│  Streamlit UI (Ask/Ingest/System) | REST API (FastAPI)           │
└─────────────────────────────────────────────────────────────────┘
```

---

## ✨ Key Features

| Feature | Details |
|---------|---------|
| 🔍 **Dense Retrieval** | Sub-50ms semantic search via Milvus with 384-dim embeddings |
| 📚 **Smart Chunking** | Semantic splits + parent-child hierarchy for precision + context |
| 🧬 **Knowledge Graphs** | Automatic relation extraction (REBEL or LLM-based) |
| 🚨 **CRAG Gating** | Relevance assessment with web search fallback |
| 🔗 **Multi-Strategy** | Dense + BM25 keyword + graph traversal combined |
| 💾 **Redis Cache** | Query result caching with configurable TTL |
| 🌐 **Multi-Provider LLM** | Groq, OpenAI, NVIDIA NIM, Ollama, custom OpenAI-compatible |
| 📊 **Evaluation** | RAGAS metrics for answer relevance, faithfulness, context precision |
| 🎨 **Streaming UI** | Real-time responses with inline citations and source cards |
| 🐳 **Docker Ready** | Full Docker Compose setup with Milvus, Redis, API, UI |

---

## 🚀 Quick Start

### Prerequisites

- Python 3.10+
- Docker & Docker Compose (optional, for containerized setup)
- GROQ API key (default LLM provider)

### 1. Clone & Setup

```bash
# Clone repository
git clone <repo-url>
cd cortex

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
```

### 2. Environment Configuration

Create `.env` file in project root:

```bash
# LLM Providers
GROQ_API_KEY=your_groq_api_key
GROQ_MODEL=llama-3.3-70b-versatile
GROQ_TEMPERATURE=0.1

# Optional: Other LLM providers
OPENAI_API_KEY=your_openai_key
MISTRAL_API_KEY=your_mistral_key
NVIDIA_API_KEY=your_nvidia_key

# Embedding & Storage
EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5
EMBED_DEVICE=cpu  # "cuda" if GPU available

# Milvus Vector Store
MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_COLLECTION=cortex_chunks
MILVUS_INDEX_TYPE=IVF_FLAT

# Redis Cache (optional)
REDIS_URL=redis://localhost:6379

# Retrieval
RETRIEVAL_TOP_K=15
FINAL_TOP_K=5

# CRAG (Consistency-based Retrieval Augmented Generation)
CRAG_ENABLED=true
CRAG_RELEVANCE_THRESHOLD=0.5

# Knowledge Graph
GRAPH_ENABLED=true
GRAPH_EXTRACTOR=llm-filtered  # "rebel", "llm", "rebel-filtered", "llm-filtered"
GRAPH_MAX_HOPS=2

# API
API_HOST=0.0.0.0
API_PORT=8000
```

### 3. Start Services

**Option A: Docker Compose (Recommended)**

```bash
docker-compose up -d
# API: http://localhost:8000
# Streamlit UI: http://localhost:8501
# Milvus: http://localhost:19530
```

**Option B: Local Setup**

Make sure Milvus is running:

```bash
# Using Milvus Docker (if not using compose)
docker run -d -p 19530:19530 -p 9091:9091 milvusdb/milvus:latest

# Start API
python -m uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload

# In another terminal, start UI
streamlit run ui/app.py
```

### 4. Ingest Documents

**Via Streamlit UI:**
- Open http://localhost:8501
- Go to "📥 Ingest" tab
- Upload PDF/HTML/TXT or provide directory path

**Via REST API:**

```bash
curl -X POST "http://localhost:8000/ingest" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "directory",
    "path": "/path/to/documents"
  }'
```

### 5. Ask Questions

**Via Streamlit UI:**
- Go to "🔍 Ask" tab
- Type your question
- Watch streaming response with citations

**Via REST API:**

```bash
curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is machine learning?",
    "provider": "groq",
    "top_k": 5
  }' | jq .
```

**Streaming Response:**

```bash
curl -X POST "http://localhost:8000/query/stream" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Your question here",
    "provider": "groq"
  }'
```

---

## 📡 REST API Endpoints

### Health & Status

```http
GET /health
```

Returns system health, Milvus status, collection stats.

```json
{
  "status": "healthy",
  "milvus": {
    "connected": true,
    "collection_count": 2500,
    "index_type": "IVF_FLAT"
  }
}
```

### Document Ingestion

```http
POST /ingest
Content-Type: application/json

{
  "mode": "directory|file|upload",
  "path": "/path/to/documents",
  "chunk_size": 256,
  "overlap": 32
}
```

### Query (Blocking)

```http
POST /query
Content-Type: application/json

{
  "query": "Your question",
  "provider": "groq",
  "model": "llama-3.3-70b-versatile",
  "top_k": 5,
  "crag": true,
  "graph": true
}
```

**Response:**

```json
{
  "answer": "Answer text with citations [1][2]...",
  "chunks": [
    {
      "id": "chunk_001",
      "text": "...",
      "score": 0.87,
      "source": "document_name.pdf"
    }
  ],
  "citations": [1, 2],
  "latency_ms": 1245
}
```

### Query (Streaming)

```http
POST /query/stream
Content-Type: application/json

{
  "query": "Your question",
  "provider": "groq"
}
```

**Response:** Server-Sent Events (SSE) stream

```
data: {"type": "start"}
data: {"type": "chunk", "content": "Answer "}
data: {"type": "chunk", "content": "is "}
data: {"type": "chunk", "content": "streaming..."}
data: {"type": "citations", "citations": [1, 2]}
data: {"type": "end"}
```

### Model Information

```http
GET /providers
```

Lists all available LLM providers and models.

---

## 🛠️ Configuration Guide

### Retrieval Configuration

```env
# Chunk sizes (tokens)
CHUNK_SIZE_TOKENS=256                    # Child chunk size
PARENT_CHUNK_SIZE_TOKENS=1024            # Parent chunk size
SEMANTIC_SIMILARITY_THRESHOLD=0.82       # Split boundary threshold
CHUNK_OVERLAP_TOKENS=32                  # Overlap padding

# Retrieval settings
RETRIEVAL_TOP_K=15                       # Candidates before reranking
FINAL_TOP_K=5                            # Chunks sent to LLM
```

### Embedding Configuration

```env
EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5  # Model identifier
EMBED_DIM=384                             # Output dimension
EMBED_BATCH_SIZE=64                       # Batch size for processing
EMBED_DEVICE=cpu                          # cpu or cuda
```

### Milvus Configuration

```env
MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_COLLECTION=cortex_chunks
MILVUS_INDEX_TYPE=IVF_FLAT                # or HNSW for larger corpora
MILVUS_METRIC_TYPE=COSINE                 # Vector similarity metric
MILVUS_NLIST=128                          # clustering parameter for IVF
MILVUS_NPROBE=16                          # search parameter
```

### LLM Provider Configuration

**Groq (Default)**
```env
GROQ_API_KEY=your_key
GROQ_MODEL=llama-3.3-70b-versatile
GROQ_TEMPERATURE=0.1
GROQ_MAX_TOKENS=1024
GROQ_TIMEOUT=30
```

**OpenAI**
```env
OPENAI_API_KEY=your_key
```

**NVIDIA NIM**
```env
NVIDIA_API_KEY=your_key
```

**Custom/Ollama**
```env
CUSTOM_BASE_URL=http://localhost:11434/v1
CUSTOM_API_KEY=your_key
```

### CRAG (Consistency-based Retrieval Augmented Generation)

```env
CRAG_ENABLED=true
CRAG_RELEVANCE_THRESHOLD=0.5             # Grade boundary
TAVILY_API_KEY=your_tavily_key           # For web search fallback
```

The CRAG gate automatically assesses retrieval quality:
- **High confidence** (score ≥ threshold) → Use knowledge base
- **Low confidence** (score < threshold) → Augment with web search

### Knowledge Graph

```env
GRAPH_ENABLED=true
GRAPH_EXTRACTOR=llm-filtered             # rebel|llm|rebel-filtered|llm-filtered
GRAPH_MAX_HOPS=2                          # Traversal depth
GRAPH_PATH=/data/storage/knowledge_graph.json

# Density filtering (for "filtered" extractors)
DENSITY_TOP_FRACTION=0.30                 # Process top 30% entity-dense chunks
DENSITY_MIN_ENTITIES=2                    # Minimum entities per chunk
```

### Caching

```env
REDIS_URL=redis://localhost:6379
CACHE_TTL_SECONDS=3600                    # 1 hour
```

### Evaluation

```env
EVAL_DB_PATH=/data/storage/eval.db
```

---

## 📁 Project Structure

```
cortex/
├── api/                          # FastAPI REST endpoints
│   ├── main.py                   # App initialization, endpoints
│   └── schemas.py                # Request/response Pydantic models
│
├── ingestion/                    # Document processing pipeline
│   ├── pipeline.py               # Orchestration
│   ├── document_loader.py        # PDF/HTML/TXT parsing
│   ├── chunker.py                # Semantic chunking
│   └── __init__.py
│
├── retrieval/                    # Multi-strategy retrieval
│   ├── orchestrator.py           # Coordinate retrieval strategies
│   ├── dense.py                  # Milvus vector search
│   ├── bm25.py                   # Keyword search index
│   ├── embedder.py               # HuggingFace embedding model
│   ├── router.py                 # Query routing logic
│   ├── fusion.py                 # Result fusion & reranking
│   ├── graph_builder.py          # Build knowledge graphs
│   ├── graph_retriever.py        # Entity-based retrieval
│   ├── relation_extractors.py    # REBEL + LLM extractors
│   ├── cache.py                  # Redis caching wrapper
│   └── __init__.py
│
├── generation/                   # LLM generation & CRAG
│   ├── generator.py              # Multi-provider LLM wrapper
│   ├── crag.py                   # CRAG gate logic
│   └── __init__.py
│
├── evaluation/                   # RAG evaluation metrics
│   ├── ragas_eval.py             # RAGAS evaluator
│   ├── store.py                  # Evaluation database
│   └── __init__.py
│
├── ui/                           # Streamlit frontend
│   ├── app.py                    # Main UI
│   └── static/                   # (Optional) HTML/CSS/JS
│
├── data/                         # Data storage
│   ├── documents/                # Input documents
│   ├── storage/                  # Persistent storage
│   │   ├── knowledge_graph.json
│   │   ├── bm25_index.pkl
│   │   └── uploads/
│   └── synthetic_knowledge_items.txt
│
├── config.py                     # Configuration & settings
├── requirements.txt              # Python dependencies
├── Dockerfile                    # Docker image build
├── docker-compose.yml            # Multi-container orchestration
├── test.py                       # Test suite
└── README.md                     # This file
```

---

## 🐳 Docker & Deployment

### Docker Compose Quick Deploy

```bash
# Start all services
docker-compose up -d

# View logs
docker-compose logs -f api

# Stop services
docker-compose down
```

**Services:**
- `milvus` — Vector database (port 19530)
- `redis` — Caching layer (port 6379)
- `api` — FastAPI backend (port 8000)
- `ui` — Streamlit frontend (port 8501)

### Environment Variables in Compose

Edit `docker-compose.yml` to customize:

```yaml
services:
  api:
    environment:
      - GROQ_API_KEY=${GROQ_API_KEY}
      - GROQ_MODEL=llama-3.3-70b-versatile
      - MILVUS_HOST=milvus
      - REDIS_URL=redis://redis:6379
      - GRAPH_EXTRACTOR=llm-filtered
```

### Production Deployment

For production, consider:

1. **Use HNSW index** instead of IVF_FLAT for better recall:
   ```env
   MILVUS_INDEX_TYPE=HNSW
   ```

2. **Enable caching** for frequently asked questions:
   ```env
   REDIS_URL=redis://redis-prod:6379
   ```

3. **Use stronger embedding model** for higher quality:
   ```env
   EMBED_MODEL_NAME=BAAI/bge-base-en-v1.5  # 768-dim, better quality
   ```

4. **Configure CRAG** for reliability:
   ```env
   CRAG_ENABLED=true
   CRAG_RELEVANCE_THRESHOLD=0.6
   TAVILY_API_KEY=your_key
   ```

---

## 🔄 Workflow Examples

### Example 1: Legal Document Q&A

```bash
# 1. Ingest legal documents
curl -X POST "http://localhost:8000/ingest" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "directory",
    "path": "/data/legal_documents"
  }'

# 2. Query with graph enabled for relation extraction
curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the penalties for breach of contract?",
    "provider": "groq",
    "graph": true,
    "crag": true
  }'
```

### Example 2: Research Paper Analysis

```bash
# Ingest PDF papers
python -c "
from ingestion.pipeline import IngestionPipeline
from retrieval.embedder import Embedder
from retrieval.dense import MilvusStore

embedder = Embedder()
store = MilvusStore(embedder=embedder)
pipeline = IngestionPipeline(embedder=embedder, store=store, bm25=None)

pipeline.ingest('/data/papers', mode='pdf')
"

# Query for specific findings
curl -X POST "http://localhost:8000/query/stream" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the key findings about transformer performance?",
    "model": "gpt-4o"
  }'
```

### Example 3: Customer Support Bot

```bash
# 1. Ingest FAQ and documentation
# 2. Set up CRAG with relevant threshold
# 3. Route low-confidence queries to web search

CRAG_RELEVANCE_THRESHOLD=0.6
TAVILY_API_KEY=your_key
```

---

## 📊 Advanced Features

### Knowledge Graph Extraction

Three modes available:

| Mode | Backend | Speed | Quality | Cost |
|------|---------|-------|---------|------|
| `rebel` | Local REBEL model | Fast | Good | Free |
| `llm` | LLM (Groq/OpenAI) | Slower | Excellent | $$ |
| `rebel-filtered` | REBEL + entity filtering | Fast | Good | Free |
| `llm-filtered` | LLM + entity filtering | Slower | Excellent | $$ |

Switch via config:
```env
GRAPH_EXTRACTOR=llm-filtered
```

### CRAG (Consistency-based RAG)

Automatically:
1. Evaluates retrieval confidence
2. Assigns relevance grade (Correct/Partially-Correct/Missing)
3. Supplements low-confidence with web search via Tavily

```python
from generation.crag import CRAGGate

crag = CRAGGate()
response = crag.evaluate(query, context, answer)
# Returns: grade, supplemental_docs
```

### Evaluation & Metrics

RAGAS-based evaluation:

```python
from evaluation.ragas_eval import RAGASEvaluator
from evaluation.store import EvalStore

evaluator = RAGASEvaluator(store=EvalStore())
metrics = evaluator.evaluate(query, context, answer)
# Returns: answer_relevance, faithfulness, context_precision
```

### Caching Strategy

```python
from retrieval.cache import CachedRetriever

retriever = CachedRetriever(base_retriever)
# First call: 1000ms (database query)
# Second call: 5ms (Redis cache hit, TTL: 1 hour)
results = retriever.retrieve("machine learning basics")
```

---

## ⚙️ Performance Tuning

### For Speed

```env
# Smaller embedding model
EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5

# Smaller chunks
CHUNK_SIZE_TOKENS=128
PARENT_CHUNK_SIZE_TOKENS=512

# Faster index
MILVUS_INDEX_TYPE=IVF_FLAT
MILVUS_NPROBE=8  # Lower = faster

# Enable cache
REDIS_URL=redis://localhost:6379

# Fewer LLM tokens
GROQ_MAX_TOKENS=512
```

### For Quality

```env
# Larger embedding model
EMBED_MODEL_NAME=BAAI/bge-base-en-v1.5

# Optimal chunks
CHUNK_SIZE_TOKENS=512
PARENT_CHUNK_SIZE_TOKENS=2048

# More precise index
MILVUS_INDEX_TYPE=HNSW
MILVUS_NPROBE=32

# Better LLM
GROQ_MODEL=llama-3.3-70b-versatile

# Enable CRAG
CRAG_ENABLED=true
```

---

## 🐛 Troubleshooting

### Milvus Connection Failed

```bash
# Check if Milvus is running
curl http://localhost:19530/healthz

# Restart Milvus
docker-compose restart milvus

# Verify in settings
python -c "from config import get_settings; print(get_settings().milvus_host)"
```

### Low Retrieval Quality

1. **Check chunk quality:**
   ```python
   from ingestion.chunker import SemanticChunker
   chunker = SemanticChunker()
   chunks = chunker.chunk("your document text")
   print([c.text for c in chunks[:3]])
   ```

2. **Verify embeddings:**
   ```python
   from retrieval.embedder import Embedder
   embedder = Embedder()
   emb = embedder.embed("test query")
   print(f"Embedding dim: {len(emb)}, sample: {emb[:5]}")
   ```

3. **Enable CRAG** for automatic augmentation:
   ```env
   CRAG_ENABLED=true
   ```

### Slow Response Times

1. Check cache hit rate
2. Reduce `MILVUS_NPROBE`
3. Use streaming endpoint (`/query/stream`)
4. Enable Redis caching

### Out of Memory

```env
# Reduce batch sizes
EMBED_BATCH_SIZE=16

# Reduce chunk sizes
CHUNK_SIZE_TOKENS=128

# Switch to CPU if using GPU
EMBED_DEVICE=cpu
```

---

## 📈 Monitoring & Evaluation

### Health Check

```bash
curl http://localhost:8000/health | jq .
```

### Collection Statistics

```python
from retrieval.dense import MilvusStore
from retrieval.embedder import Embedder

store = MilvusStore(embedder=Embedder())
stats = store.get_stats()
print(f"Documents: {stats['collection_count']}")
```

### Query Evaluation

```python
from evaluation.ragas_eval import RAGASEvaluator
from evaluation.store import EvalStore

evaluator = RAGASEvaluator(store=EvalStore(db_path="/data/storage/eval.db"))
metrics = evaluator.evaluate(query, context, answer)
print(f"Answer Relevance: {metrics['answer_relevance']:.2f}")
print(f"Faithfulness: {metrics['faithfulness']:.2f}")
print(f"Context Precision: {metrics['context_precision']:.2f}")
```

---

## 🤝 Contributing

Contributions welcome! Areas for enhancement:

- [ ] Multi-language support
- [ ] Fine-tuned domain-specific embeddings
- [ ] Advanced reranking strategies
- [ ] GraphQL API
- [ ] Persistent trace logging
- [ ] A/B testing framework

---

## 📝 License

MIT License — see LICENSE file for details

---

## 🔗 Resources

- [Milvus Documentation](https://milvus.io/docs)
- [FastAPI Guide](https://fastapi.tiangolo.com/)
- [RAGAS Evaluation Framework](https://github.com/explorerx3/ragas)
- [Groq API Reference](https://console.groq.com/docs/api-reference)
- [CRAG Paper](https://arxiv.org/abs/2401.15884)

---

**Questions?** Open an issue on GitHub or check the documentation.
source .venv/bin/activate
pip install -r requirements.txt
python -m nltk.downloader punkt
python -m spacy download en_core_web_sm
```

### 2. Configure

```bash
cp .env.example .env
# Edit .env — set GROQ_API_KEY at minimum
```

Get a free Groq API key at https://console.groq.com

### 3. Start Milvus

```bash
docker-compose up -d
# Wait ~30s for Milvus to be healthy
docker-compose ps   # all three services should show "healthy"
```

### 4. Ingest documents

```bash
mkdir -p data/documents
# Copy PDFs / HTML / TXT files into data/documents/

python -m ingestion.pipeline data/documents
```

Or use the CLI:
```bash
python ingestion/pipeline.py data/documents
python ingestion/pipeline.py data/documents/paper.pdf
```

### 5. Start the API

```bash
uvicorn api.main:app --reload --port 8000
```

### 6. Start the UI

```bash
streamlit run ui/app.py
```

Open http://localhost:8501 in your browser.

---

## API endpoints

| Method | Path | Description |
|--------|------|-------------|
| GET  | `/health` | Component health check |
| POST | `/ingest` | Trigger ingestion pipeline |
| POST | `/query` | Blocking query (full JSON) |
| POST | `/query/stream` | Streaming query (SSE) |

### Example — blocking query

```bash
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is attention in transformers?", "top_k": 5}'
```

### Example — streaming query

```bash
curl -X POST http://localhost:8000/query/stream \
  -H "Content-Type: application/json" \
  -d '{"query": "Explain PagedAttention", "stream": true}'
```

---

## Key design decisions

### Semantic chunking
Fixed-size chunking (e.g. 1000 chars with 200 overlap) splits mid-sentence
and mid-concept. Semantic chunking detects topic boundaries using cosine
similarity between consecutive sentence embeddings, producing chunks that
align with natural concept transitions. Combined with a fallback on token
count (child_max = 256 tokens), chunks are both semantically coherent and
bounded in size.

### Parent-child hierarchy
The child chunk (≈256 tokens) is what gets embedded and indexed — small,
precise, high-relevance. When a child chunk is retrieved, its parent chunk
(≈1024 tokens, centred on the child) is what goes into the LLM context.
This separates the **retrieval granularity** from the **generation context
width**, giving you the precision of small chunks with the coherence of
large ones.

### BGE query prefix
`BAAI/bge-small-en-v1.5` is trained to expect a task-specific prefix on
query strings for retrieval tasks:
`"Represent this sentence for searching relevant passages: <query>"`
Documents are embedded as-is. Skipping this prefix typically costs 3-5
points on retrieval benchmarks.

---

## Phase roadmap

| Phase | Status | What's added |
|-------|--------|--------------|
| 1 | ✅ Done | Dense RAG, semantic chunking, parent-child, streaming UI |
| 2 | ✅ Done | BM25 sparse, query router, RRF fusion, cross-encoder reranking |
| 3 | ✅ Done | GraphRAG (spaCy NER + NetworkX), CRAG gate, web fallback |
| 4 | ✅ Done | RAGAS eval harness, Redis cache, evaluation dashboard |