Cortex / README.md
Aditya Joshi
Update README.md
f063263 unverified
metadata
title: Cortex RAG
sdk: docker
emoji: 🧠
colorFrom: blue
colorTo: purple

Cortex RAG β€” Next-Gen Retrieval-Augmented Generation

Production-grade RAG system with dense retrieval, semantic chunking, knowledge graph integration, CRAG gating, and multi-provider LLM support.

Python FastAPI Docker License


🎯 Overview

Cortex is a production-ready Retrieval-Augmented Generation (RAG) framework that combines:

  • Dense Vector Search β€” Fast, accurate document retrieval using BAAI embeddings (384-dim)
  • Semantic Chunking β€” Intelligent split boundaries based on sentence-level cosine similarity
  • Parent-Child Chunks β€” 256-token child chunks for precision, 1024-token parents for context
  • Multi-Strategy Retrieval β€” Dense search, BM25 hybrid, knowledge graph traversal
  • CRAG Gating β€” Automatic relevance assessment with fallback to web search
  • Multi-Provider LLM β€” Support for Groq, OpenAI, NVIDIA NIM, and custom endpoints
  • Streaming Responses β€” Real-time SSE-based answer generation with inline citations
  • Knowledge Graphs β€” Automatic relation extraction and entity-based retrieval
  • Caching Layer β€” Redis integration for query result caching
  • Evaluation Framework β€” RAGAS-based RAG evaluation metrics

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Document Ingestion                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  PDF/HTML/TXT β†’ DocumentLoader β†’ SemanticChunker                 β”‚
β”‚                                       ↓                          β”‚
β”‚                   Child (~256 tokens) + Parent (~1024 tokens)    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                        Embedding Layer                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  BAAI/bge-small-en-v1.5 (384-dim, L2-normalized)                β”‚
β”‚  β†’ Milvus Store (IVF_FLAT, COSINE metric)                        β”‚
β”‚  β†’ BM25 Index (keyword search)                                   β”‚
β”‚  β†’ Knowledge Graph (entities, relations, triples)                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                       Query Processing                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Dense Search (top-15) β†’ Reranking β†’ CRAG Gate                   β”‚
β”‚         ↓                                    ↓                    β”‚
β”‚  High Confidence?                    Low Confidence?             β”‚
β”‚         ↓                                    ↓                    β”‚
β”‚    Use KnowledgeBase               ⚠️ Web Search (Tavily)        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                    LLM Generation (Streaming)                     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Groq Llama 3.3-70B / OpenAI GPT-4o / NVIDIA NIM / Custom        β”‚
β”‚  Process context β†’ Generate answer β†’ Extract citations           β”‚
β”‚  Stream via SSE β†’ Client receives real-time response             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                  Frontend Interfaces                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Streamlit UI (Ask/Ingest/System) | REST API (FastAPI)           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

✨ Key Features

Feature Details
πŸ” Dense Retrieval Sub-50ms semantic search via Milvus with 384-dim embeddings
πŸ“š Smart Chunking Semantic splits + parent-child hierarchy for precision + context
🧬 Knowledge Graphs Automatic relation extraction (REBEL or LLM-based)
🚨 CRAG Gating Relevance assessment with web search fallback
πŸ”— Multi-Strategy Dense + BM25 keyword + graph traversal combined
πŸ’Ύ Redis Cache Query result caching with configurable TTL
🌐 Multi-Provider LLM Groq, OpenAI, NVIDIA NIM, Ollama, custom OpenAI-compatible
πŸ“Š Evaluation RAGAS metrics for answer relevance, faithfulness, context precision
🎨 Streaming UI Real-time responses with inline citations and source cards
🐳 Docker Ready Full Docker Compose setup with Milvus, Redis, API, UI

πŸš€ Quick Start

Prerequisites

  • Python 3.10+
  • Docker & Docker Compose (optional, for containerized setup)
  • GROQ API key (default LLM provider)

1. Clone & Setup

# Clone repository
git clone <repo-url>
cd cortex

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Environment Configuration

Create .env file in project root:

# LLM Providers
GROQ_API_KEY=your_groq_api_key
GROQ_MODEL=llama-3.3-70b-versatile
GROQ_TEMPERATURE=0.1

# Optional: Other LLM providers
OPENAI_API_KEY=your_openai_key
MISTRAL_API_KEY=your_mistral_key
NVIDIA_API_KEY=your_nvidia_key

# Embedding & Storage
EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5
EMBED_DEVICE=cpu  # "cuda" if GPU available

# Milvus Vector Store
MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_COLLECTION=cortex_chunks
MILVUS_INDEX_TYPE=IVF_FLAT

# Redis Cache (optional)
REDIS_URL=redis://localhost:6379

# Retrieval
RETRIEVAL_TOP_K=15
FINAL_TOP_K=5

# CRAG (Consistency-based Retrieval Augmented Generation)
CRAG_ENABLED=true
CRAG_RELEVANCE_THRESHOLD=0.5

# Knowledge Graph
GRAPH_ENABLED=true
GRAPH_EXTRACTOR=llm-filtered  # "rebel", "llm", "rebel-filtered", "llm-filtered"
GRAPH_MAX_HOPS=2

# API
API_HOST=0.0.0.0
API_PORT=8000

3. Start Services

Option A: Docker Compose (Recommended)

docker-compose up -d
# API: http://localhost:8000
# Streamlit UI: http://localhost:8501
# Milvus: http://localhost:19530

Option B: Local Setup

Make sure Milvus is running:

# Using Milvus Docker (if not using compose)
docker run -d -p 19530:19530 -p 9091:9091 milvusdb/milvus:latest

# Start API
python -m uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload

# In another terminal, start UI
streamlit run ui/app.py

4. Ingest Documents

Via Streamlit UI:

Via REST API:

curl -X POST "http://localhost:8000/ingest" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "directory",
    "path": "/path/to/documents"
  }'

5. Ask Questions

Via Streamlit UI:

  • Go to "πŸ” Ask" tab
  • Type your question
  • Watch streaming response with citations

Via REST API:

curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is machine learning?",
    "provider": "groq",
    "top_k": 5
  }' | jq .

Streaming Response:

curl -X POST "http://localhost:8000/query/stream" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Your question here",
    "provider": "groq"
  }'

πŸ“‘ REST API Endpoints

Health & Status

GET /health

Returns system health, Milvus status, collection stats.

{
  "status": "healthy",
  "milvus": {
    "connected": true,
    "collection_count": 2500,
    "index_type": "IVF_FLAT"
  }
}

Document Ingestion

POST /ingest
Content-Type: application/json

{
  "mode": "directory|file|upload",
  "path": "/path/to/documents",
  "chunk_size": 256,
  "overlap": 32
}

Query (Blocking)

POST /query
Content-Type: application/json

{
  "query": "Your question",
  "provider": "groq",
  "model": "llama-3.3-70b-versatile",
  "top_k": 5,
  "crag": true,
  "graph": true
}

Response:

{
  "answer": "Answer text with citations [1][2]...",
  "chunks": [
    {
      "id": "chunk_001",
      "text": "...",
      "score": 0.87,
      "source": "document_name.pdf"
    }
  ],
  "citations": [1, 2],
  "latency_ms": 1245
}

Query (Streaming)

POST /query/stream
Content-Type: application/json

{
  "query": "Your question",
  "provider": "groq"
}

Response: Server-Sent Events (SSE) stream

data: {"type": "start"}
data: {"type": "chunk", "content": "Answer "}
data: {"type": "chunk", "content": "is "}
data: {"type": "chunk", "content": "streaming..."}
data: {"type": "citations", "citations": [1, 2]}
data: {"type": "end"}

Model Information

GET /providers

Lists all available LLM providers and models.


πŸ› οΈ Configuration Guide

Retrieval Configuration

# Chunk sizes (tokens)
CHUNK_SIZE_TOKENS=256                    # Child chunk size
PARENT_CHUNK_SIZE_TOKENS=1024            # Parent chunk size
SEMANTIC_SIMILARITY_THRESHOLD=0.82       # Split boundary threshold
CHUNK_OVERLAP_TOKENS=32                  # Overlap padding

# Retrieval settings
RETRIEVAL_TOP_K=15                       # Candidates before reranking
FINAL_TOP_K=5                            # Chunks sent to LLM

Embedding Configuration

EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5  # Model identifier
EMBED_DIM=384                             # Output dimension
EMBED_BATCH_SIZE=64                       # Batch size for processing
EMBED_DEVICE=cpu                          # cpu or cuda

Milvus Configuration

MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_COLLECTION=cortex_chunks
MILVUS_INDEX_TYPE=IVF_FLAT                # or HNSW for larger corpora
MILVUS_METRIC_TYPE=COSINE                 # Vector similarity metric
MILVUS_NLIST=128                          # clustering parameter for IVF
MILVUS_NPROBE=16                          # search parameter

LLM Provider Configuration

Groq (Default)

GROQ_API_KEY=your_key
GROQ_MODEL=llama-3.3-70b-versatile
GROQ_TEMPERATURE=0.1
GROQ_MAX_TOKENS=1024
GROQ_TIMEOUT=30

OpenAI

OPENAI_API_KEY=your_key

NVIDIA NIM

NVIDIA_API_KEY=your_key

Custom/Ollama

CUSTOM_BASE_URL=http://localhost:11434/v1
CUSTOM_API_KEY=your_key

CRAG (Consistency-based Retrieval Augmented Generation)

CRAG_ENABLED=true
CRAG_RELEVANCE_THRESHOLD=0.5             # Grade boundary
TAVILY_API_KEY=your_tavily_key           # For web search fallback

The CRAG gate automatically assesses retrieval quality:

  • High confidence (score β‰₯ threshold) β†’ Use knowledge base
  • Low confidence (score < threshold) β†’ Augment with web search

Knowledge Graph

GRAPH_ENABLED=true
GRAPH_EXTRACTOR=llm-filtered             # rebel|llm|rebel-filtered|llm-filtered
GRAPH_MAX_HOPS=2                          # Traversal depth
GRAPH_PATH=/data/storage/knowledge_graph.json

# Density filtering (for "filtered" extractors)
DENSITY_TOP_FRACTION=0.30                 # Process top 30% entity-dense chunks
DENSITY_MIN_ENTITIES=2                    # Minimum entities per chunk

Caching

REDIS_URL=redis://localhost:6379
CACHE_TTL_SECONDS=3600                    # 1 hour

Evaluation

EVAL_DB_PATH=/data/storage/eval.db

πŸ“ Project Structure

cortex/
β”œβ”€β”€ api/                          # FastAPI REST endpoints
β”‚   β”œβ”€β”€ main.py                   # App initialization, endpoints
β”‚   └── schemas.py                # Request/response Pydantic models
β”‚
β”œβ”€β”€ ingestion/                    # Document processing pipeline
β”‚   β”œβ”€β”€ pipeline.py               # Orchestration
β”‚   β”œβ”€β”€ document_loader.py        # PDF/HTML/TXT parsing
β”‚   β”œβ”€β”€ chunker.py                # Semantic chunking
β”‚   └── __init__.py
β”‚
β”œβ”€β”€ retrieval/                    # Multi-strategy retrieval
β”‚   β”œβ”€β”€ orchestrator.py           # Coordinate retrieval strategies
β”‚   β”œβ”€β”€ dense.py                  # Milvus vector search
β”‚   β”œβ”€β”€ bm25.py                   # Keyword search index
β”‚   β”œβ”€β”€ embedder.py               # HuggingFace embedding model
β”‚   β”œβ”€β”€ router.py                 # Query routing logic
β”‚   β”œβ”€β”€ fusion.py                 # Result fusion & reranking
β”‚   β”œβ”€β”€ graph_builder.py          # Build knowledge graphs
β”‚   β”œβ”€β”€ graph_retriever.py        # Entity-based retrieval
β”‚   β”œβ”€β”€ relation_extractors.py    # REBEL + LLM extractors
β”‚   β”œβ”€β”€ cache.py                  # Redis caching wrapper
β”‚   └── __init__.py
β”‚
β”œβ”€β”€ generation/                   # LLM generation & CRAG
β”‚   β”œβ”€β”€ generator.py              # Multi-provider LLM wrapper
β”‚   β”œβ”€β”€ crag.py                   # CRAG gate logic
β”‚   └── __init__.py
β”‚
β”œβ”€β”€ evaluation/                   # RAG evaluation metrics
β”‚   β”œβ”€β”€ ragas_eval.py             # RAGAS evaluator
β”‚   β”œβ”€β”€ store.py                  # Evaluation database
β”‚   └── __init__.py
β”‚
β”œβ”€β”€ ui/                           # Streamlit frontend
β”‚   β”œβ”€β”€ app.py                    # Main UI
β”‚   └── static/                   # (Optional) HTML/CSS/JS
β”‚
β”œβ”€β”€ data/                         # Data storage
β”‚   β”œβ”€β”€ documents/                # Input documents
β”‚   β”œβ”€β”€ storage/                  # Persistent storage
β”‚   β”‚   β”œβ”€β”€ knowledge_graph.json
β”‚   β”‚   β”œβ”€β”€ bm25_index.pkl
β”‚   β”‚   └── uploads/
β”‚   └── synthetic_knowledge_items.txt
β”‚
β”œβ”€β”€ config.py                     # Configuration & settings
β”œβ”€β”€ requirements.txt              # Python dependencies
β”œβ”€β”€ Dockerfile                    # Docker image build
β”œβ”€β”€ docker-compose.yml            # Multi-container orchestration
β”œβ”€β”€ test.py                       # Test suite
└── README.md                     # This file

🐳 Docker & Deployment

Docker Compose Quick Deploy

# Start all services
docker-compose up -d

# View logs
docker-compose logs -f api

# Stop services
docker-compose down

Services:

  • milvus β€” Vector database (port 19530)
  • redis β€” Caching layer (port 6379)
  • api β€” FastAPI backend (port 8000)
  • ui β€” Streamlit frontend (port 8501)

Environment Variables in Compose

Edit docker-compose.yml to customize:

services:
  api:
    environment:
      - GROQ_API_KEY=${GROQ_API_KEY}
      - GROQ_MODEL=llama-3.3-70b-versatile
      - MILVUS_HOST=milvus
      - REDIS_URL=redis://redis:6379
      - GRAPH_EXTRACTOR=llm-filtered

Production Deployment

For production, consider:

  1. Use HNSW index instead of IVF_FLAT for better recall:

    MILVUS_INDEX_TYPE=HNSW
    
  2. Enable caching for frequently asked questions:

    REDIS_URL=redis://redis-prod:6379
    
  3. Use stronger embedding model for higher quality:

    EMBED_MODEL_NAME=BAAI/bge-base-en-v1.5  # 768-dim, better quality
    
  4. Configure CRAG for reliability:

    CRAG_ENABLED=true
    CRAG_RELEVANCE_THRESHOLD=0.6
    TAVILY_API_KEY=your_key
    

πŸ”„ Workflow Examples

Example 1: Legal Document Q&A

# 1. Ingest legal documents
curl -X POST "http://localhost:8000/ingest" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "directory",
    "path": "/data/legal_documents"
  }'

# 2. Query with graph enabled for relation extraction
curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the penalties for breach of contract?",
    "provider": "groq",
    "graph": true,
    "crag": true
  }'

Example 2: Research Paper Analysis

# Ingest PDF papers
python -c "
from ingestion.pipeline import IngestionPipeline
from retrieval.embedder import Embedder
from retrieval.dense import MilvusStore

embedder = Embedder()
store = MilvusStore(embedder=embedder)
pipeline = IngestionPipeline(embedder=embedder, store=store, bm25=None)

pipeline.ingest('/data/papers', mode='pdf')
"

# Query for specific findings
curl -X POST "http://localhost:8000/query/stream" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the key findings about transformer performance?",
    "model": "gpt-4o"
  }'

Example 3: Customer Support Bot

# 1. Ingest FAQ and documentation
# 2. Set up CRAG with relevant threshold
# 3. Route low-confidence queries to web search

CRAG_RELEVANCE_THRESHOLD=0.6
TAVILY_API_KEY=your_key

πŸ“Š Advanced Features

Knowledge Graph Extraction

Three modes available:

Mode Backend Speed Quality Cost
rebel Local REBEL model Fast Good Free
llm LLM (Groq/OpenAI) Slower Excellent $$
rebel-filtered REBEL + entity filtering Fast Good Free
llm-filtered LLM + entity filtering Slower Excellent $$

Switch via config:

GRAPH_EXTRACTOR=llm-filtered

CRAG (Consistency-based RAG)

Automatically:

  1. Evaluates retrieval confidence
  2. Assigns relevance grade (Correct/Partially-Correct/Missing)
  3. Supplements low-confidence with web search via Tavily
from generation.crag import CRAGGate

crag = CRAGGate()
response = crag.evaluate(query, context, answer)
# Returns: grade, supplemental_docs

Evaluation & Metrics

RAGAS-based evaluation:

from evaluation.ragas_eval import RAGASEvaluator
from evaluation.store import EvalStore

evaluator = RAGASEvaluator(store=EvalStore())
metrics = evaluator.evaluate(query, context, answer)
# Returns: answer_relevance, faithfulness, context_precision

Caching Strategy

from retrieval.cache import CachedRetriever

retriever = CachedRetriever(base_retriever)
# First call: 1000ms (database query)
# Second call: 5ms (Redis cache hit, TTL: 1 hour)
results = retriever.retrieve("machine learning basics")

βš™οΈ Performance Tuning

For Speed

# Smaller embedding model
EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5

# Smaller chunks
CHUNK_SIZE_TOKENS=128
PARENT_CHUNK_SIZE_TOKENS=512

# Faster index
MILVUS_INDEX_TYPE=IVF_FLAT
MILVUS_NPROBE=8  # Lower = faster

# Enable cache
REDIS_URL=redis://localhost:6379

# Fewer LLM tokens
GROQ_MAX_TOKENS=512

For Quality

# Larger embedding model
EMBED_MODEL_NAME=BAAI/bge-base-en-v1.5

# Optimal chunks
CHUNK_SIZE_TOKENS=512
PARENT_CHUNK_SIZE_TOKENS=2048

# More precise index
MILVUS_INDEX_TYPE=HNSW
MILVUS_NPROBE=32

# Better LLM
GROQ_MODEL=llama-3.3-70b-versatile

# Enable CRAG
CRAG_ENABLED=true

πŸ› Troubleshooting

Milvus Connection Failed

# Check if Milvus is running
curl http://localhost:19530/healthz

# Restart Milvus
docker-compose restart milvus

# Verify in settings
python -c "from config import get_settings; print(get_settings().milvus_host)"

Low Retrieval Quality

  1. Check chunk quality:

    from ingestion.chunker import SemanticChunker
    chunker = SemanticChunker()
    chunks = chunker.chunk("your document text")
    print([c.text for c in chunks[:3]])
    
  2. Verify embeddings:

    from retrieval.embedder import Embedder
    embedder = Embedder()
    emb = embedder.embed("test query")
    print(f"Embedding dim: {len(emb)}, sample: {emb[:5]}")
    
  3. Enable CRAG for automatic augmentation:

    CRAG_ENABLED=true
    

Slow Response Times

  1. Check cache hit rate
  2. Reduce MILVUS_NPROBE
  3. Use streaming endpoint (/query/stream)
  4. Enable Redis caching

Out of Memory

# Reduce batch sizes
EMBED_BATCH_SIZE=16

# Reduce chunk sizes
CHUNK_SIZE_TOKENS=128

# Switch to CPU if using GPU
EMBED_DEVICE=cpu

πŸ“ˆ Monitoring & Evaluation

Health Check

curl http://localhost:8000/health | jq .

Collection Statistics

from retrieval.dense import MilvusStore
from retrieval.embedder import Embedder

store = MilvusStore(embedder=Embedder())
stats = store.get_stats()
print(f"Documents: {stats['collection_count']}")

Query Evaluation

from evaluation.ragas_eval import RAGASEvaluator
from evaluation.store import EvalStore

evaluator = RAGASEvaluator(store=EvalStore(db_path="/data/storage/eval.db"))
metrics = evaluator.evaluate(query, context, answer)
print(f"Answer Relevance: {metrics['answer_relevance']:.2f}")
print(f"Faithfulness: {metrics['faithfulness']:.2f}")
print(f"Context Precision: {metrics['context_precision']:.2f}")

🀝 Contributing

Contributions welcome! Areas for enhancement:

  • Multi-language support
  • Fine-tuned domain-specific embeddings
  • Advanced reranking strategies
  • GraphQL API
  • Persistent trace logging
  • A/B testing framework

πŸ“ License

MIT License β€” see LICENSE file for details


πŸ”— Resources


Questions? Open an issue on GitHub or check the documentation. source .venv/bin/activate pip install -r requirements.txt python -m nltk.downloader punkt python -m spacy download en_core_web_sm


### 2. Configure

```bash
cp .env.example .env
# Edit .env β€” set GROQ_API_KEY at minimum

Get a free Groq API key at https://console.groq.com

3. Start Milvus

docker-compose up -d
# Wait ~30s for Milvus to be healthy
docker-compose ps   # all three services should show "healthy"

4. Ingest documents

mkdir -p data/documents
# Copy PDFs / HTML / TXT files into data/documents/

python -m ingestion.pipeline data/documents

Or use the CLI:

python ingestion/pipeline.py data/documents
python ingestion/pipeline.py data/documents/paper.pdf

5. Start the API

uvicorn api.main:app --reload --port 8000

6. Start the UI

streamlit run ui/app.py

Open http://localhost:8501 in your browser.


API endpoints

Method Path Description
GET /health Component health check
POST /ingest Trigger ingestion pipeline
POST /query Blocking query (full JSON)
POST /query/stream Streaming query (SSE)

Example β€” blocking query

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is attention in transformers?", "top_k": 5}'

Example β€” streaming query

curl -X POST http://localhost:8000/query/stream \
  -H "Content-Type: application/json" \
  -d '{"query": "Explain PagedAttention", "stream": true}'

Key design decisions

Semantic chunking

Fixed-size chunking (e.g. 1000 chars with 200 overlap) splits mid-sentence and mid-concept. Semantic chunking detects topic boundaries using cosine similarity between consecutive sentence embeddings, producing chunks that align with natural concept transitions. Combined with a fallback on token count (child_max = 256 tokens), chunks are both semantically coherent and bounded in size.

Parent-child hierarchy

The child chunk (β‰ˆ256 tokens) is what gets embedded and indexed β€” small, precise, high-relevance. When a child chunk is retrieved, its parent chunk (β‰ˆ1024 tokens, centred on the child) is what goes into the LLM context. This separates the retrieval granularity from the generation context width, giving you the precision of small chunks with the coherence of large ones.

BGE query prefix

BAAI/bge-small-en-v1.5 is trained to expect a task-specific prefix on query strings for retrieval tasks: "Represent this sentence for searching relevant passages: <query>" Documents are embedded as-is. Skipping this prefix typically costs 3-5 points on retrieval benchmarks.


Phase roadmap

Phase Status What's added
1 βœ… Done Dense RAG, semantic chunking, parent-child, streaming UI
2 βœ… Done BM25 sparse, query router, RRF fusion, cross-encoder reranking
3 βœ… Done GraphRAG (spaCy NER + NetworkX), CRAG gate, web fallback
4 βœ… Done RAGAS eval harness, Redis cache, evaluation dashboard