Spaces:
Sleeping
title: Cortex RAG
sdk: docker
emoji: π§
colorFrom: blue
colorTo: purple
Cortex RAG β Next-Gen Retrieval-Augmented Generation
Production-grade RAG system with dense retrieval, semantic chunking, knowledge graph integration, CRAG gating, and multi-provider LLM support.
π― Overview
Cortex is a production-ready Retrieval-Augmented Generation (RAG) framework that combines:
- Dense Vector Search β Fast, accurate document retrieval using BAAI embeddings (384-dim)
- Semantic Chunking β Intelligent split boundaries based on sentence-level cosine similarity
- Parent-Child Chunks β 256-token child chunks for precision, 1024-token parents for context
- Multi-Strategy Retrieval β Dense search, BM25 hybrid, knowledge graph traversal
- CRAG Gating β Automatic relevance assessment with fallback to web search
- Multi-Provider LLM β Support for Groq, OpenAI, NVIDIA NIM, and custom endpoints
- Streaming Responses β Real-time SSE-based answer generation with inline citations
- Knowledge Graphs β Automatic relation extraction and entity-based retrieval
- Caching Layer β Redis integration for query result caching
- Evaluation Framework β RAGAS-based RAG evaluation metrics
ποΈ Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Document Ingestion β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β PDF/HTML/TXT β DocumentLoader β SemanticChunker β
β β β
β Child (~256 tokens) + Parent (~1024 tokens) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Embedding Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β BAAI/bge-small-en-v1.5 (384-dim, L2-normalized) β
β β Milvus Store (IVF_FLAT, COSINE metric) β
β β BM25 Index (keyword search) β
β β Knowledge Graph (entities, relations, triples) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Query Processing β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Dense Search (top-15) β Reranking β CRAG Gate β
β β β β
β High Confidence? Low Confidence? β
β β β β
β Use KnowledgeBase β οΈ Web Search (Tavily) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β LLM Generation (Streaming) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Groq Llama 3.3-70B / OpenAI GPT-4o / NVIDIA NIM / Custom β
β Process context β Generate answer β Extract citations β
β Stream via SSE β Client receives real-time response β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Frontend Interfaces β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Streamlit UI (Ask/Ingest/System) | REST API (FastAPI) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β¨ Key Features
| Feature | Details |
|---|---|
| π Dense Retrieval | Sub-50ms semantic search via Milvus with 384-dim embeddings |
| π Smart Chunking | Semantic splits + parent-child hierarchy for precision + context |
| 𧬠Knowledge Graphs | Automatic relation extraction (REBEL or LLM-based) |
| π¨ CRAG Gating | Relevance assessment with web search fallback |
| π Multi-Strategy | Dense + BM25 keyword + graph traversal combined |
| πΎ Redis Cache | Query result caching with configurable TTL |
| π Multi-Provider LLM | Groq, OpenAI, NVIDIA NIM, Ollama, custom OpenAI-compatible |
| π Evaluation | RAGAS metrics for answer relevance, faithfulness, context precision |
| π¨ Streaming UI | Real-time responses with inline citations and source cards |
| π³ Docker Ready | Full Docker Compose setup with Milvus, Redis, API, UI |
π Quick Start
Prerequisites
- Python 3.10+
- Docker & Docker Compose (optional, for containerized setup)
- GROQ API key (default LLM provider)
1. Clone & Setup
# Clone repository
git clone <repo-url>
cd cortex
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
2. Environment Configuration
Create .env file in project root:
# LLM Providers
GROQ_API_KEY=your_groq_api_key
GROQ_MODEL=llama-3.3-70b-versatile
GROQ_TEMPERATURE=0.1
# Optional: Other LLM providers
OPENAI_API_KEY=your_openai_key
MISTRAL_API_KEY=your_mistral_key
NVIDIA_API_KEY=your_nvidia_key
# Embedding & Storage
EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5
EMBED_DEVICE=cpu # "cuda" if GPU available
# Milvus Vector Store
MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_COLLECTION=cortex_chunks
MILVUS_INDEX_TYPE=IVF_FLAT
# Redis Cache (optional)
REDIS_URL=redis://localhost:6379
# Retrieval
RETRIEVAL_TOP_K=15
FINAL_TOP_K=5
# CRAG (Consistency-based Retrieval Augmented Generation)
CRAG_ENABLED=true
CRAG_RELEVANCE_THRESHOLD=0.5
# Knowledge Graph
GRAPH_ENABLED=true
GRAPH_EXTRACTOR=llm-filtered # "rebel", "llm", "rebel-filtered", "llm-filtered"
GRAPH_MAX_HOPS=2
# API
API_HOST=0.0.0.0
API_PORT=8000
3. Start Services
Option A: Docker Compose (Recommended)
docker-compose up -d
# API: http://localhost:8000
# Streamlit UI: http://localhost:8501
# Milvus: http://localhost:19530
Option B: Local Setup
Make sure Milvus is running:
# Using Milvus Docker (if not using compose)
docker run -d -p 19530:19530 -p 9091:9091 milvusdb/milvus:latest
# Start API
python -m uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
# In another terminal, start UI
streamlit run ui/app.py
4. Ingest Documents
Via Streamlit UI:
- Open http://localhost:8501
- Go to "π₯ Ingest" tab
- Upload PDF/HTML/TXT or provide directory path
Via REST API:
curl -X POST "http://localhost:8000/ingest" \
-H "Content-Type: application/json" \
-d '{
"mode": "directory",
"path": "/path/to/documents"
}'
5. Ask Questions
Via Streamlit UI:
- Go to "π Ask" tab
- Type your question
- Watch streaming response with citations
Via REST API:
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{
"query": "What is machine learning?",
"provider": "groq",
"top_k": 5
}' | jq .
Streaming Response:
curl -X POST "http://localhost:8000/query/stream" \
-H "Content-Type: application/json" \
-d '{
"query": "Your question here",
"provider": "groq"
}'
π‘ REST API Endpoints
Health & Status
GET /health
Returns system health, Milvus status, collection stats.
{
"status": "healthy",
"milvus": {
"connected": true,
"collection_count": 2500,
"index_type": "IVF_FLAT"
}
}
Document Ingestion
POST /ingest
Content-Type: application/json
{
"mode": "directory|file|upload",
"path": "/path/to/documents",
"chunk_size": 256,
"overlap": 32
}
Query (Blocking)
POST /query
Content-Type: application/json
{
"query": "Your question",
"provider": "groq",
"model": "llama-3.3-70b-versatile",
"top_k": 5,
"crag": true,
"graph": true
}
Response:
{
"answer": "Answer text with citations [1][2]...",
"chunks": [
{
"id": "chunk_001",
"text": "...",
"score": 0.87,
"source": "document_name.pdf"
}
],
"citations": [1, 2],
"latency_ms": 1245
}
Query (Streaming)
POST /query/stream
Content-Type: application/json
{
"query": "Your question",
"provider": "groq"
}
Response: Server-Sent Events (SSE) stream
data: {"type": "start"}
data: {"type": "chunk", "content": "Answer "}
data: {"type": "chunk", "content": "is "}
data: {"type": "chunk", "content": "streaming..."}
data: {"type": "citations", "citations": [1, 2]}
data: {"type": "end"}
Model Information
GET /providers
Lists all available LLM providers and models.
π οΈ Configuration Guide
Retrieval Configuration
# Chunk sizes (tokens)
CHUNK_SIZE_TOKENS=256 # Child chunk size
PARENT_CHUNK_SIZE_TOKENS=1024 # Parent chunk size
SEMANTIC_SIMILARITY_THRESHOLD=0.82 # Split boundary threshold
CHUNK_OVERLAP_TOKENS=32 # Overlap padding
# Retrieval settings
RETRIEVAL_TOP_K=15 # Candidates before reranking
FINAL_TOP_K=5 # Chunks sent to LLM
Embedding Configuration
EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5 # Model identifier
EMBED_DIM=384 # Output dimension
EMBED_BATCH_SIZE=64 # Batch size for processing
EMBED_DEVICE=cpu # cpu or cuda
Milvus Configuration
MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_COLLECTION=cortex_chunks
MILVUS_INDEX_TYPE=IVF_FLAT # or HNSW for larger corpora
MILVUS_METRIC_TYPE=COSINE # Vector similarity metric
MILVUS_NLIST=128 # clustering parameter for IVF
MILVUS_NPROBE=16 # search parameter
LLM Provider Configuration
Groq (Default)
GROQ_API_KEY=your_key
GROQ_MODEL=llama-3.3-70b-versatile
GROQ_TEMPERATURE=0.1
GROQ_MAX_TOKENS=1024
GROQ_TIMEOUT=30
OpenAI
OPENAI_API_KEY=your_key
NVIDIA NIM
NVIDIA_API_KEY=your_key
Custom/Ollama
CUSTOM_BASE_URL=http://localhost:11434/v1
CUSTOM_API_KEY=your_key
CRAG (Consistency-based Retrieval Augmented Generation)
CRAG_ENABLED=true
CRAG_RELEVANCE_THRESHOLD=0.5 # Grade boundary
TAVILY_API_KEY=your_tavily_key # For web search fallback
The CRAG gate automatically assesses retrieval quality:
- High confidence (score β₯ threshold) β Use knowledge base
- Low confidence (score < threshold) β Augment with web search
Knowledge Graph
GRAPH_ENABLED=true
GRAPH_EXTRACTOR=llm-filtered # rebel|llm|rebel-filtered|llm-filtered
GRAPH_MAX_HOPS=2 # Traversal depth
GRAPH_PATH=/data/storage/knowledge_graph.json
# Density filtering (for "filtered" extractors)
DENSITY_TOP_FRACTION=0.30 # Process top 30% entity-dense chunks
DENSITY_MIN_ENTITIES=2 # Minimum entities per chunk
Caching
REDIS_URL=redis://localhost:6379
CACHE_TTL_SECONDS=3600 # 1 hour
Evaluation
EVAL_DB_PATH=/data/storage/eval.db
π Project Structure
cortex/
βββ api/ # FastAPI REST endpoints
β βββ main.py # App initialization, endpoints
β βββ schemas.py # Request/response Pydantic models
β
βββ ingestion/ # Document processing pipeline
β βββ pipeline.py # Orchestration
β βββ document_loader.py # PDF/HTML/TXT parsing
β βββ chunker.py # Semantic chunking
β βββ __init__.py
β
βββ retrieval/ # Multi-strategy retrieval
β βββ orchestrator.py # Coordinate retrieval strategies
β βββ dense.py # Milvus vector search
β βββ bm25.py # Keyword search index
β βββ embedder.py # HuggingFace embedding model
β βββ router.py # Query routing logic
β βββ fusion.py # Result fusion & reranking
β βββ graph_builder.py # Build knowledge graphs
β βββ graph_retriever.py # Entity-based retrieval
β βββ relation_extractors.py # REBEL + LLM extractors
β βββ cache.py # Redis caching wrapper
β βββ __init__.py
β
βββ generation/ # LLM generation & CRAG
β βββ generator.py # Multi-provider LLM wrapper
β βββ crag.py # CRAG gate logic
β βββ __init__.py
β
βββ evaluation/ # RAG evaluation metrics
β βββ ragas_eval.py # RAGAS evaluator
β βββ store.py # Evaluation database
β βββ __init__.py
β
βββ ui/ # Streamlit frontend
β βββ app.py # Main UI
β βββ static/ # (Optional) HTML/CSS/JS
β
βββ data/ # Data storage
β βββ documents/ # Input documents
β βββ storage/ # Persistent storage
β β βββ knowledge_graph.json
β β βββ bm25_index.pkl
β β βββ uploads/
β βββ synthetic_knowledge_items.txt
β
βββ config.py # Configuration & settings
βββ requirements.txt # Python dependencies
βββ Dockerfile # Docker image build
βββ docker-compose.yml # Multi-container orchestration
βββ test.py # Test suite
βββ README.md # This file
π³ Docker & Deployment
Docker Compose Quick Deploy
# Start all services
docker-compose up -d
# View logs
docker-compose logs -f api
# Stop services
docker-compose down
Services:
milvusβ Vector database (port 19530)redisβ Caching layer (port 6379)apiβ FastAPI backend (port 8000)uiβ Streamlit frontend (port 8501)
Environment Variables in Compose
Edit docker-compose.yml to customize:
services:
api:
environment:
- GROQ_API_KEY=${GROQ_API_KEY}
- GROQ_MODEL=llama-3.3-70b-versatile
- MILVUS_HOST=milvus
- REDIS_URL=redis://redis:6379
- GRAPH_EXTRACTOR=llm-filtered
Production Deployment
For production, consider:
Use HNSW index instead of IVF_FLAT for better recall:
MILVUS_INDEX_TYPE=HNSWEnable caching for frequently asked questions:
REDIS_URL=redis://redis-prod:6379Use stronger embedding model for higher quality:
EMBED_MODEL_NAME=BAAI/bge-base-en-v1.5 # 768-dim, better qualityConfigure CRAG for reliability:
CRAG_ENABLED=true CRAG_RELEVANCE_THRESHOLD=0.6 TAVILY_API_KEY=your_key
π Workflow Examples
Example 1: Legal Document Q&A
# 1. Ingest legal documents
curl -X POST "http://localhost:8000/ingest" \
-H "Content-Type: application/json" \
-d '{
"mode": "directory",
"path": "/data/legal_documents"
}'
# 2. Query with graph enabled for relation extraction
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{
"query": "What are the penalties for breach of contract?",
"provider": "groq",
"graph": true,
"crag": true
}'
Example 2: Research Paper Analysis
# Ingest PDF papers
python -c "
from ingestion.pipeline import IngestionPipeline
from retrieval.embedder import Embedder
from retrieval.dense import MilvusStore
embedder = Embedder()
store = MilvusStore(embedder=embedder)
pipeline = IngestionPipeline(embedder=embedder, store=store, bm25=None)
pipeline.ingest('/data/papers', mode='pdf')
"
# Query for specific findings
curl -X POST "http://localhost:8000/query/stream" \
-H "Content-Type: application/json" \
-d '{
"query": "What are the key findings about transformer performance?",
"model": "gpt-4o"
}'
Example 3: Customer Support Bot
# 1. Ingest FAQ and documentation
# 2. Set up CRAG with relevant threshold
# 3. Route low-confidence queries to web search
CRAG_RELEVANCE_THRESHOLD=0.6
TAVILY_API_KEY=your_key
π Advanced Features
Knowledge Graph Extraction
Three modes available:
| Mode | Backend | Speed | Quality | Cost |
|---|---|---|---|---|
rebel |
Local REBEL model | Fast | Good | Free |
llm |
LLM (Groq/OpenAI) | Slower | Excellent | $$ |
rebel-filtered |
REBEL + entity filtering | Fast | Good | Free |
llm-filtered |
LLM + entity filtering | Slower | Excellent | $$ |
Switch via config:
GRAPH_EXTRACTOR=llm-filtered
CRAG (Consistency-based RAG)
Automatically:
- Evaluates retrieval confidence
- Assigns relevance grade (Correct/Partially-Correct/Missing)
- Supplements low-confidence with web search via Tavily
from generation.crag import CRAGGate
crag = CRAGGate()
response = crag.evaluate(query, context, answer)
# Returns: grade, supplemental_docs
Evaluation & Metrics
RAGAS-based evaluation:
from evaluation.ragas_eval import RAGASEvaluator
from evaluation.store import EvalStore
evaluator = RAGASEvaluator(store=EvalStore())
metrics = evaluator.evaluate(query, context, answer)
# Returns: answer_relevance, faithfulness, context_precision
Caching Strategy
from retrieval.cache import CachedRetriever
retriever = CachedRetriever(base_retriever)
# First call: 1000ms (database query)
# Second call: 5ms (Redis cache hit, TTL: 1 hour)
results = retriever.retrieve("machine learning basics")
βοΈ Performance Tuning
For Speed
# Smaller embedding model
EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5
# Smaller chunks
CHUNK_SIZE_TOKENS=128
PARENT_CHUNK_SIZE_TOKENS=512
# Faster index
MILVUS_INDEX_TYPE=IVF_FLAT
MILVUS_NPROBE=8 # Lower = faster
# Enable cache
REDIS_URL=redis://localhost:6379
# Fewer LLM tokens
GROQ_MAX_TOKENS=512
For Quality
# Larger embedding model
EMBED_MODEL_NAME=BAAI/bge-base-en-v1.5
# Optimal chunks
CHUNK_SIZE_TOKENS=512
PARENT_CHUNK_SIZE_TOKENS=2048
# More precise index
MILVUS_INDEX_TYPE=HNSW
MILVUS_NPROBE=32
# Better LLM
GROQ_MODEL=llama-3.3-70b-versatile
# Enable CRAG
CRAG_ENABLED=true
π Troubleshooting
Milvus Connection Failed
# Check if Milvus is running
curl http://localhost:19530/healthz
# Restart Milvus
docker-compose restart milvus
# Verify in settings
python -c "from config import get_settings; print(get_settings().milvus_host)"
Low Retrieval Quality
Check chunk quality:
from ingestion.chunker import SemanticChunker chunker = SemanticChunker() chunks = chunker.chunk("your document text") print([c.text for c in chunks[:3]])Verify embeddings:
from retrieval.embedder import Embedder embedder = Embedder() emb = embedder.embed("test query") print(f"Embedding dim: {len(emb)}, sample: {emb[:5]}")Enable CRAG for automatic augmentation:
CRAG_ENABLED=true
Slow Response Times
- Check cache hit rate
- Reduce
MILVUS_NPROBE - Use streaming endpoint (
/query/stream) - Enable Redis caching
Out of Memory
# Reduce batch sizes
EMBED_BATCH_SIZE=16
# Reduce chunk sizes
CHUNK_SIZE_TOKENS=128
# Switch to CPU if using GPU
EMBED_DEVICE=cpu
π Monitoring & Evaluation
Health Check
curl http://localhost:8000/health | jq .
Collection Statistics
from retrieval.dense import MilvusStore
from retrieval.embedder import Embedder
store = MilvusStore(embedder=Embedder())
stats = store.get_stats()
print(f"Documents: {stats['collection_count']}")
Query Evaluation
from evaluation.ragas_eval import RAGASEvaluator
from evaluation.store import EvalStore
evaluator = RAGASEvaluator(store=EvalStore(db_path="/data/storage/eval.db"))
metrics = evaluator.evaluate(query, context, answer)
print(f"Answer Relevance: {metrics['answer_relevance']:.2f}")
print(f"Faithfulness: {metrics['faithfulness']:.2f}")
print(f"Context Precision: {metrics['context_precision']:.2f}")
π€ Contributing
Contributions welcome! Areas for enhancement:
- Multi-language support
- Fine-tuned domain-specific embeddings
- Advanced reranking strategies
- GraphQL API
- Persistent trace logging
- A/B testing framework
π License
MIT License β see LICENSE file for details
π Resources
Questions? Open an issue on GitHub or check the documentation. source .venv/bin/activate pip install -r requirements.txt python -m nltk.downloader punkt python -m spacy download en_core_web_sm
### 2. Configure
```bash
cp .env.example .env
# Edit .env β set GROQ_API_KEY at minimum
Get a free Groq API key at https://console.groq.com
3. Start Milvus
docker-compose up -d
# Wait ~30s for Milvus to be healthy
docker-compose ps # all three services should show "healthy"
4. Ingest documents
mkdir -p data/documents
# Copy PDFs / HTML / TXT files into data/documents/
python -m ingestion.pipeline data/documents
Or use the CLI:
python ingestion/pipeline.py data/documents
python ingestion/pipeline.py data/documents/paper.pdf
5. Start the API
uvicorn api.main:app --reload --port 8000
6. Start the UI
streamlit run ui/app.py
Open http://localhost:8501 in your browser.
API endpoints
| Method | Path | Description |
|---|---|---|
| GET | /health |
Component health check |
| POST | /ingest |
Trigger ingestion pipeline |
| POST | /query |
Blocking query (full JSON) |
| POST | /query/stream |
Streaming query (SSE) |
Example β blocking query
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "What is attention in transformers?", "top_k": 5}'
Example β streaming query
curl -X POST http://localhost:8000/query/stream \
-H "Content-Type: application/json" \
-d '{"query": "Explain PagedAttention", "stream": true}'
Key design decisions
Semantic chunking
Fixed-size chunking (e.g. 1000 chars with 200 overlap) splits mid-sentence and mid-concept. Semantic chunking detects topic boundaries using cosine similarity between consecutive sentence embeddings, producing chunks that align with natural concept transitions. Combined with a fallback on token count (child_max = 256 tokens), chunks are both semantically coherent and bounded in size.
Parent-child hierarchy
The child chunk (β256 tokens) is what gets embedded and indexed β small, precise, high-relevance. When a child chunk is retrieved, its parent chunk (β1024 tokens, centred on the child) is what goes into the LLM context. This separates the retrieval granularity from the generation context width, giving you the precision of small chunks with the coherence of large ones.
BGE query prefix
BAAI/bge-small-en-v1.5 is trained to expect a task-specific prefix on
query strings for retrieval tasks:
"Represent this sentence for searching relevant passages: <query>"
Documents are embedded as-is. Skipping this prefix typically costs 3-5
points on retrieval benchmarks.
Phase roadmap
| Phase | Status | What's added |
|---|---|---|
| 1 | β Done | Dense RAG, semantic chunking, parent-child, streaming UI |
| 2 | β Done | BM25 sparse, query router, RRF fusion, cross-encoder reranking |
| 3 | β Done | GraphRAG (spaCy NER + NetworkX), CRAG gate, web fallback |
| 4 | β Done | RAGAS eval harness, Redis cache, evaluation dashboard |