Spaces:

aditya-joshi-05
/

Cortex

Sleeping

App Files Files Community

Cortex / README.md

Aditya Joshi

Update README.md

f063263 unverified about 1 month ago

preview code

raw

history blame contribute delete

26.3 kB

metadata

title: Cortex RAG
sdk: docker
emoji: 🧠
colorFrom: blue
colorTo: purple

Cortex RAG — Next-Gen Retrieval-Augmented Generation

Production-grade RAG system with dense retrieval, semantic chunking, knowledge graph integration, CRAG gating, and multi-provider LLM support.

🎯 Overview

Cortex is a production-ready Retrieval-Augmented Generation (RAG) framework that combines:

Dense Vector Search — Fast, accurate document retrieval using BAAI embeddings (384-dim)
Semantic Chunking — Intelligent split boundaries based on sentence-level cosine similarity
Parent-Child Chunks — 256-token child chunks for precision, 1024-token parents for context
Multi-Strategy Retrieval — Dense search, BM25 hybrid, knowledge graph traversal
CRAG Gating — Automatic relevance assessment with fallback to web search
Multi-Provider LLM — Support for Groq, OpenAI, NVIDIA NIM, and custom endpoints
Streaming Responses — Real-time SSE-based answer generation with inline citations
Knowledge Graphs — Automatic relation extraction and entity-based retrieval
Caching Layer — Redis integration for query result caching
Evaluation Framework — RAGAS-based RAG evaluation metrics

🏗️ Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Document Ingestion                            │
├─────────────────────────────────────────────────────────────────┤
│  PDF/HTML/TXT → DocumentLoader → SemanticChunker                 │
│                                       ↓                          │
│                   Child (~256 tokens) + Parent (~1024 tokens)    │
├─────────────────────────────────────────────────────────────────┤
│                        Embedding Layer                            │
├─────────────────────────────────────────────────────────────────┤
│  BAAI/bge-small-en-v1.5 (384-dim, L2-normalized)                │
│  → Milvus Store (IVF_FLAT, COSINE metric)                        │
│  → BM25 Index (keyword search)                                   │
│  → Knowledge Graph (entities, relations, triples)                │
├─────────────────────────────────────────────────────────────────┤
│                       Query Processing                            │
├─────────────────────────────────────────────────────────────────┤
│  Dense Search (top-15) → Reranking → CRAG Gate                   │
│         ↓                                    ↓                    │
│  High Confidence?                    Low Confidence?             │
│         ↓                                    ↓                    │
│    Use KnowledgeBase               ⚠️ Web Search (Tavily)        │
├─────────────────────────────────────────────────────────────────┤
│                    LLM Generation (Streaming)                     │
├─────────────────────────────────────────────────────────────────┤
│  Groq Llama 3.3-70B / OpenAI GPT-4o / NVIDIA NIM / Custom        │
│  Process context → Generate answer → Extract citations           │
│  Stream via SSE → Client receives real-time response             │
├─────────────────────────────────────────────────────────────────┤
│                  Frontend Interfaces                              │
├─────────────────────────────────────────────────────────────────┤
│  Streamlit UI (Ask/Ingest/System) | REST API (FastAPI)           │
└─────────────────────────────────────────────────────────────────┘

✨ Key Features

Feature	Details
🔍 Dense Retrieval	Sub-50ms semantic search via Milvus with 384-dim embeddings
📚 Smart Chunking	Semantic splits + parent-child hierarchy for precision + context
🧬 Knowledge Graphs	Automatic relation extraction (REBEL or LLM-based)
🚨 CRAG Gating	Relevance assessment with web search fallback
🔗 Multi-Strategy	Dense + BM25 keyword + graph traversal combined
💾 Redis Cache	Query result caching with configurable TTL
🌐 Multi-Provider LLM	Groq, OpenAI, NVIDIA NIM, Ollama, custom OpenAI-compatible
📊 Evaluation	RAGAS metrics for answer relevance, faithfulness, context precision
🎨 Streaming UI	Real-time responses with inline citations and source cards
🐳 Docker Ready	Full Docker Compose setup with Milvus, Redis, API, UI

🚀 Quick Start

Prerequisites

Python 3.10+
Docker & Docker Compose (optional, for containerized setup)
GROQ API key (default LLM provider)

1. Clone & Setup

# Clone repository
git clone <repo-url>
cd cortex

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Environment Configuration

Create .env file in project root:

# LLM Providers
GROQ_API_KEY=your_groq_api_key
GROQ_MODEL=llama-3.3-70b-versatile
GROQ_TEMPERATURE=0.1

# Optional: Other LLM providers
OPENAI_API_KEY=your_openai_key
MISTRAL_API_KEY=your_mistral_key
NVIDIA_API_KEY=your_nvidia_key

# Embedding & Storage
EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5
EMBED_DEVICE=cpu  # "cuda" if GPU available

# Milvus Vector Store
MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_COLLECTION=cortex_chunks
MILVUS_INDEX_TYPE=IVF_FLAT

# Redis Cache (optional)
REDIS_URL=redis://localhost:6379

# Retrieval
RETRIEVAL_TOP_K=15
FINAL_TOP_K=5

# CRAG (Consistency-based Retrieval Augmented Generation)
CRAG_ENABLED=true
CRAG_RELEVANCE_THRESHOLD=0.5

# Knowledge Graph
GRAPH_ENABLED=true
GRAPH_EXTRACTOR=llm-filtered  # "rebel", "llm", "rebel-filtered", "llm-filtered"
GRAPH_MAX_HOPS=2

# API
API_HOST=0.0.0.0
API_PORT=8000

3. Start Services

Option A: Docker Compose (Recommended)

docker-compose up -d
# API: http://localhost:8000
# Streamlit UI: http://localhost:8501
# Milvus: http://localhost:19530

Option B: Local Setup

Make sure Milvus is running:

# Using Milvus Docker (if not using compose)
docker run -d -p 19530:19530 -p 9091:9091 milvusdb/milvus:latest

# Start API
python -m uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload

# In another terminal, start UI
streamlit run ui/app.py

4. Ingest Documents

Via Streamlit UI:

Open http://localhost:8501
Go to "📥 Ingest" tab
Upload PDF/HTML/TXT or provide directory path

Via REST API:

curl -X POST "http://localhost:8000/ingest" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "directory",
    "path": "/path/to/documents"
  }'

5. Ask Questions

Via Streamlit UI:

Go to "🔍 Ask" tab
Type your question
Watch streaming response with citations

Via REST API:

curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is machine learning?",
    "provider": "groq",
    "top_k": 5
  }' | jq .

Streaming Response:

curl -X POST "http://localhost:8000/query/stream" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "Your question here",
    "provider": "groq"
  }'

📡 REST API Endpoints

Health & Status

GET /health

Returns system health, Milvus status, collection stats.

{
  "status": "healthy",
  "milvus": {
    "connected": true,
    "collection_count": 2500,
    "index_type": "IVF_FLAT"
  }
}

Document Ingestion

POST /ingest
Content-Type: application/json

{
  "mode": "directory|file|upload",
  "path": "/path/to/documents",
  "chunk_size": 256,
  "overlap": 32
}

Query (Blocking)

POST /query
Content-Type: application/json

{
  "query": "Your question",
  "provider": "groq",
  "model": "llama-3.3-70b-versatile",
  "top_k": 5,
  "crag": true,
  "graph": true
}

Response:

{
  "answer": "Answer text with citations [1][2]...",
  "chunks": [
    {
      "id": "chunk_001",
      "text": "...",
      "score": 0.87,
      "source": "document_name.pdf"
    }
  ],
  "citations": [1, 2],
  "latency_ms": 1245
}

Query (Streaming)

POST /query/stream
Content-Type: application/json

{
  "query": "Your question",
  "provider": "groq"
}

Response: Server-Sent Events (SSE) stream

data: {"type": "start"}
data: {"type": "chunk", "content": "Answer "}
data: {"type": "chunk", "content": "is "}
data: {"type": "chunk", "content": "streaming..."}
data: {"type": "citations", "citations": [1, 2]}
data: {"type": "end"}

Model Information

GET /providers

Lists all available LLM providers and models.

🛠️ Configuration Guide

Retrieval Configuration

# Chunk sizes (tokens)
CHUNK_SIZE_TOKENS=256                    # Child chunk size
PARENT_CHUNK_SIZE_TOKENS=1024            # Parent chunk size
SEMANTIC_SIMILARITY_THRESHOLD=0.82       # Split boundary threshold
CHUNK_OVERLAP_TOKENS=32                  # Overlap padding

# Retrieval settings
RETRIEVAL_TOP_K=15                       # Candidates before reranking
FINAL_TOP_K=5                            # Chunks sent to LLM

Embedding Configuration

EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5  # Model identifier
EMBED_DIM=384                             # Output dimension
EMBED_BATCH_SIZE=64                       # Batch size for processing
EMBED_DEVICE=cpu                          # cpu or cuda

Milvus Configuration

MILVUS_HOST=localhost
MILVUS_PORT=19530
MILVUS_COLLECTION=cortex_chunks
MILVUS_INDEX_TYPE=IVF_FLAT                # or HNSW for larger corpora
MILVUS_METRIC_TYPE=COSINE                 # Vector similarity metric
MILVUS_NLIST=128                          # clustering parameter for IVF
MILVUS_NPROBE=16                          # search parameter

LLM Provider Configuration

Groq (Default)

GROQ_API_KEY=your_key
GROQ_MODEL=llama-3.3-70b-versatile
GROQ_TEMPERATURE=0.1
GROQ_MAX_TOKENS=1024
GROQ_TIMEOUT=30

OpenAI

OPENAI_API_KEY=your_key

NVIDIA NIM

NVIDIA_API_KEY=your_key

Custom/Ollama

CUSTOM_BASE_URL=http://localhost:11434/v1
CUSTOM_API_KEY=your_key

CRAG (Consistency-based Retrieval Augmented Generation)

CRAG_ENABLED=true
CRAG_RELEVANCE_THRESHOLD=0.5             # Grade boundary
TAVILY_API_KEY=your_tavily_key           # For web search fallback

The CRAG gate automatically assesses retrieval quality:

High confidence (score ≥ threshold) → Use knowledge base
Low confidence (score < threshold) → Augment with web search

Knowledge Graph

GRAPH_ENABLED=true
GRAPH_EXTRACTOR=llm-filtered             # rebel|llm|rebel-filtered|llm-filtered
GRAPH_MAX_HOPS=2                          # Traversal depth
GRAPH_PATH=/data/storage/knowledge_graph.json

# Density filtering (for "filtered" extractors)
DENSITY_TOP_FRACTION=0.30                 # Process top 30% entity-dense chunks
DENSITY_MIN_ENTITIES=2                    # Minimum entities per chunk

Caching

REDIS_URL=redis://localhost:6379
CACHE_TTL_SECONDS=3600                    # 1 hour

Evaluation

EVAL_DB_PATH=/data/storage/eval.db

📁 Project Structure

cortex/
├── api/                          # FastAPI REST endpoints
│   ├── main.py                   # App initialization, endpoints
│   └── schemas.py                # Request/response Pydantic models
│
├── ingestion/                    # Document processing pipeline
│   ├── pipeline.py               # Orchestration
│   ├── document_loader.py        # PDF/HTML/TXT parsing
│   ├── chunker.py                # Semantic chunking
│   └── __init__.py
│
├── retrieval/                    # Multi-strategy retrieval
│   ├── orchestrator.py           # Coordinate retrieval strategies
│   ├── dense.py                  # Milvus vector search
│   ├── bm25.py                   # Keyword search index
│   ├── embedder.py               # HuggingFace embedding model
│   ├── router.py                 # Query routing logic
│   ├── fusion.py                 # Result fusion & reranking
│   ├── graph_builder.py          # Build knowledge graphs
│   ├── graph_retriever.py        # Entity-based retrieval
│   ├── relation_extractors.py    # REBEL + LLM extractors
│   ├── cache.py                  # Redis caching wrapper
│   └── __init__.py
│
├── generation/                   # LLM generation & CRAG
│   ├── generator.py              # Multi-provider LLM wrapper
│   ├── crag.py                   # CRAG gate logic
│   └── __init__.py
│
├── evaluation/                   # RAG evaluation metrics
│   ├── ragas_eval.py             # RAGAS evaluator
│   ├── store.py                  # Evaluation database
│   └── __init__.py
│
├── ui/                           # Streamlit frontend
│   ├── app.py                    # Main UI
│   └── static/                   # (Optional) HTML/CSS/JS
│
├── data/                         # Data storage
│   ├── documents/                # Input documents
│   ├── storage/                  # Persistent storage
│   │   ├── knowledge_graph.json
│   │   ├── bm25_index.pkl
│   │   └── uploads/
│   └── synthetic_knowledge_items.txt
│
├── config.py                     # Configuration & settings
├── requirements.txt              # Python dependencies
├── Dockerfile                    # Docker image build
├── docker-compose.yml            # Multi-container orchestration
├── test.py                       # Test suite
└── README.md                     # This file

🐳 Docker & Deployment

Docker Compose Quick Deploy

# Start all services
docker-compose up -d

# View logs
docker-compose logs -f api

# Stop services
docker-compose down

Services:

milvus — Vector database (port 19530)
redis — Caching layer (port 6379)
api — FastAPI backend (port 8000)
ui — Streamlit frontend (port 8501)

Environment Variables in Compose

Edit docker-compose.yml to customize:

services:
  api:
    environment:
      - GROQ_API_KEY=${GROQ_API_KEY}
      - GROQ_MODEL=llama-3.3-70b-versatile
      - MILVUS_HOST=milvus
      - REDIS_URL=redis://redis:6379
      - GRAPH_EXTRACTOR=llm-filtered

Production Deployment

For production, consider:

Use HNSW index instead of IVF_FLAT for better recall:
```
MILVUS_INDEX_TYPE=HNSW
```
Enable caching for frequently asked questions:
```
REDIS_URL=redis://redis-prod:6379
```

Use stronger embedding model for higher quality:

EMBED_MODEL_NAME=BAAI/bge-base-en-v1.5  # 768-dim, better quality

Configure CRAG for reliability:

CRAG_ENABLED=true
CRAG_RELEVANCE_THRESHOLD=0.6
TAVILY_API_KEY=your_key

🔄 Workflow Examples

Example 1: Legal Document Q&A

# 1. Ingest legal documents
curl -X POST "http://localhost:8000/ingest" \
  -H "Content-Type: application/json" \
  -d '{
    "mode": "directory",
    "path": "/data/legal_documents"
  }'

# 2. Query with graph enabled for relation extraction
curl -X POST "http://localhost:8000/query" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the penalties for breach of contract?",
    "provider": "groq",
    "graph": true,
    "crag": true
  }'

Example 2: Research Paper Analysis

# Ingest PDF papers
python -c "
from ingestion.pipeline import IngestionPipeline
from retrieval.embedder import Embedder
from retrieval.dense import MilvusStore

embedder = Embedder()
store = MilvusStore(embedder=embedder)
pipeline = IngestionPipeline(embedder=embedder, store=store, bm25=None)

pipeline.ingest('/data/papers', mode='pdf')
"

# Query for specific findings
curl -X POST "http://localhost:8000/query/stream" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What are the key findings about transformer performance?",
    "model": "gpt-4o"
  }'

Example 3: Customer Support Bot

# 1. Ingest FAQ and documentation
# 2. Set up CRAG with relevant threshold
# 3. Route low-confidence queries to web search

CRAG_RELEVANCE_THRESHOLD=0.6
TAVILY_API_KEY=your_key

📊 Advanced Features

Knowledge Graph Extraction

Three modes available:

Mode	Backend	Speed	Quality	Cost
`rebel`	Local REBEL model	Fast	Good	Free
`llm`	LLM (Groq/OpenAI)	Slower	Excellent	$$
`rebel-filtered`	REBEL + entity filtering	Fast	Good	Free
`llm-filtered`	LLM + entity filtering	Slower	Excellent	$$

Switch via config:

GRAPH_EXTRACTOR=llm-filtered

CRAG (Consistency-based RAG)

Automatically:

Evaluates retrieval confidence
Assigns relevance grade (Correct/Partially-Correct/Missing)
Supplements low-confidence with web search via Tavily

from generation.crag import CRAGGate

crag = CRAGGate()
response = crag.evaluate(query, context, answer)
# Returns: grade, supplemental_docs

Evaluation & Metrics

RAGAS-based evaluation:

from evaluation.ragas_eval import RAGASEvaluator
from evaluation.store import EvalStore

evaluator = RAGASEvaluator(store=EvalStore())
metrics = evaluator.evaluate(query, context, answer)
# Returns: answer_relevance, faithfulness, context_precision

Caching Strategy

from retrieval.cache import CachedRetriever

retriever = CachedRetriever(base_retriever)
# First call: 1000ms (database query)
# Second call: 5ms (Redis cache hit, TTL: 1 hour)
results = retriever.retrieve("machine learning basics")

⚙️ Performance Tuning

For Speed

# Smaller embedding model
EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5

# Smaller chunks
CHUNK_SIZE_TOKENS=128
PARENT_CHUNK_SIZE_TOKENS=512

# Faster index
MILVUS_INDEX_TYPE=IVF_FLAT
MILVUS_NPROBE=8  # Lower = faster

# Enable cache
REDIS_URL=redis://localhost:6379

# Fewer LLM tokens
GROQ_MAX_TOKENS=512

For Quality

# Larger embedding model
EMBED_MODEL_NAME=BAAI/bge-base-en-v1.5

# Optimal chunks
CHUNK_SIZE_TOKENS=512
PARENT_CHUNK_SIZE_TOKENS=2048

# More precise index
MILVUS_INDEX_TYPE=HNSW
MILVUS_NPROBE=32

# Better LLM
GROQ_MODEL=llama-3.3-70b-versatile

# Enable CRAG
CRAG_ENABLED=true

🐛 Troubleshooting

Milvus Connection Failed

# Check if Milvus is running
curl http://localhost:19530/healthz

# Restart Milvus
docker-compose restart milvus

# Verify in settings
python -c "from config import get_settings; print(get_settings().milvus_host)"

Low Retrieval Quality

Check chunk quality:

from ingestion.chunker import SemanticChunker
chunker = SemanticChunker()
chunks = chunker.chunk("your document text")
print([c.text for c in chunks[:3]])

Verify embeddings:

from retrieval.embedder import Embedder
embedder = Embedder()
emb = embedder.embed("test query")
print(f"Embedding dim: {len(emb)}, sample: {emb[:5]}")

Enable CRAG for automatic augmentation:
```
CRAG_ENABLED=true
```

Slow Response Times

Check cache hit rate
Reduce MILVUS_NPROBE
Use streaming endpoint (/query/stream)
Enable Redis caching

Out of Memory

# Reduce batch sizes
EMBED_BATCH_SIZE=16

# Reduce chunk sizes
CHUNK_SIZE_TOKENS=128

# Switch to CPU if using GPU
EMBED_DEVICE=cpu

📈 Monitoring & Evaluation

Health Check

curl http://localhost:8000/health | jq .

Collection Statistics

from retrieval.dense import MilvusStore
from retrieval.embedder import Embedder

store = MilvusStore(embedder=Embedder())
stats = store.get_stats()
print(f"Documents: {stats['collection_count']}")

Query Evaluation

from evaluation.ragas_eval import RAGASEvaluator
from evaluation.store import EvalStore

evaluator = RAGASEvaluator(store=EvalStore(db_path="/data/storage/eval.db"))
metrics = evaluator.evaluate(query, context, answer)
print(f"Answer Relevance: {metrics['answer_relevance']:.2f}")
print(f"Faithfulness: {metrics['faithfulness']:.2f}")
print(f"Context Precision: {metrics['context_precision']:.2f}")

🤝 Contributing

Contributions welcome! Areas for enhancement:

Multi-language support
Fine-tuned domain-specific embeddings
Advanced reranking strategies
GraphQL API
Persistent trace logging
A/B testing framework

📝 License

MIT License — see LICENSE file for details

🔗 Resources

Questions? Open an issue on GitHub or check the documentation. source .venv/bin/activate pip install -r requirements.txt python -m nltk.downloader punkt python -m spacy download en_core_web_sm


### 2. Configure

```bash
cp .env.example .env
# Edit .env — set GROQ_API_KEY at minimum

Get a free Groq API key at https://console.groq.com

3. Start Milvus

docker-compose up -d
# Wait ~30s for Milvus to be healthy
docker-compose ps   # all three services should show "healthy"

4. Ingest documents

mkdir -p data/documents
# Copy PDFs / HTML / TXT files into data/documents/

python -m ingestion.pipeline data/documents

Or use the CLI:

python ingestion/pipeline.py data/documents
python ingestion/pipeline.py data/documents/paper.pdf

5. Start the API

uvicorn api.main:app --reload --port 8000

6. Start the UI

streamlit run ui/app.py

Open http://localhost:8501 in your browser.

API endpoints

Method	Path	Description
GET	`/health`	Component health check
POST	`/ingest`	Trigger ingestion pipeline
POST	`/query`	Blocking query (full JSON)
POST	`/query/stream`	Streaming query (SSE)

Example — blocking query

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is attention in transformers?", "top_k": 5}'

Example — streaming query

curl -X POST http://localhost:8000/query/stream \
  -H "Content-Type: application/json" \
  -d '{"query": "Explain PagedAttention", "stream": true}'

Key design decisions

Semantic chunking

Fixed-size chunking (e.g. 1000 chars with 200 overlap) splits mid-sentence and mid-concept. Semantic chunking detects topic boundaries using cosine similarity between consecutive sentence embeddings, producing chunks that align with natural concept transitions. Combined with a fallback on token count (child_max = 256 tokens), chunks are both semantically coherent and bounded in size.

Parent-child hierarchy

The child chunk (≈256 tokens) is what gets embedded and indexed — small, precise, high-relevance. When a child chunk is retrieved, its parent chunk (≈1024 tokens, centred on the child) is what goes into the LLM context. This separates the retrieval granularity from the generation context width, giving you the precision of small chunks with the coherence of large ones.

BGE query prefix

BAAI/bge-small-en-v1.5 is trained to expect a task-specific prefix on query strings for retrieval tasks: "Represent this sentence for searching relevant passages: <query>" Documents are embedded as-is. Skipping this prefix typically costs 3-5 points on retrieval benchmarks.

Phase roadmap

Phase	Status	What's added
1	✅ Done	Dense RAG, semantic chunking, parent-child, streaming UI
2	✅ Done	BM25 sparse, query router, RRF fusion, cross-encoder reranking
3	✅ Done	GraphRAG (spaCy NER + NetworkX), CRAG gate, web fallback
4	✅ Done	RAGAS eval harness, Redis cache, evaluation dashboard