Spaces:
Sleeping
Sleeping
| title: Cortex RAG | |
| sdk: docker | |
| emoji: π§ | |
| colorFrom: blue | |
| colorTo: purple | |
| # Cortex RAG β Next-Gen Retrieval-Augmented Generation | |
| <div align="center"> | |
| **Production-grade RAG system with dense retrieval, semantic chunking, knowledge graph integration, CRAG gating, and multi-provider LLM support.** | |
|  | |
|  | |
|  | |
|  | |
| </div> | |
| --- | |
| ## π― Overview | |
| **Cortex** is a production-ready Retrieval-Augmented Generation (RAG) framework that combines: | |
| - **Dense Vector Search** β Fast, accurate document retrieval using BAAI embeddings (384-dim) | |
| - **Semantic Chunking** β Intelligent split boundaries based on sentence-level cosine similarity | |
| - **Parent-Child Chunks** β 256-token child chunks for precision, 1024-token parents for context | |
| - **Multi-Strategy Retrieval** β Dense search, BM25 hybrid, knowledge graph traversal | |
| - **CRAG Gating** β Automatic relevance assessment with fallback to web search | |
| - **Multi-Provider LLM** β Support for Groq, OpenAI, NVIDIA NIM, and custom endpoints | |
| - **Streaming Responses** β Real-time SSE-based answer generation with inline citations | |
| - **Knowledge Graphs** β Automatic relation extraction and entity-based retrieval | |
| - **Caching Layer** β Redis integration for query result caching | |
| - **Evaluation Framework** β RAGAS-based RAG evaluation metrics | |
| --- | |
| ## ποΈ Architecture | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Document Ingestion β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β PDF/HTML/TXT β DocumentLoader β SemanticChunker β | |
| β β β | |
| β Child (~256 tokens) + Parent (~1024 tokens) β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Embedding Layer β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β BAAI/bge-small-en-v1.5 (384-dim, L2-normalized) β | |
| β β Milvus Store (IVF_FLAT, COSINE metric) β | |
| β β BM25 Index (keyword search) β | |
| β β Knowledge Graph (entities, relations, triples) β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Query Processing β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Dense Search (top-15) β Reranking β CRAG Gate β | |
| β β β β | |
| β High Confidence? Low Confidence? β | |
| β β β β | |
| β Use KnowledgeBase β οΈ Web Search (Tavily) β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β LLM Generation (Streaming) β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Groq Llama 3.3-70B / OpenAI GPT-4o / NVIDIA NIM / Custom β | |
| β Process context β Generate answer β Extract citations β | |
| β Stream via SSE β Client receives real-time response β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Frontend Interfaces β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β Streamlit UI (Ask/Ingest/System) | REST API (FastAPI) β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| --- | |
| ## β¨ Key Features | |
| | Feature | Details | | |
| |---------|---------| | |
| | π **Dense Retrieval** | Sub-50ms semantic search via Milvus with 384-dim embeddings | | |
| | π **Smart Chunking** | Semantic splits + parent-child hierarchy for precision + context | | |
| | 𧬠**Knowledge Graphs** | Automatic relation extraction (REBEL or LLM-based) | | |
| | π¨ **CRAG Gating** | Relevance assessment with web search fallback | | |
| | π **Multi-Strategy** | Dense + BM25 keyword + graph traversal combined | | |
| | πΎ **Redis Cache** | Query result caching with configurable TTL | | |
| | π **Multi-Provider LLM** | Groq, OpenAI, NVIDIA NIM, Ollama, custom OpenAI-compatible | | |
| | π **Evaluation** | RAGAS metrics for answer relevance, faithfulness, context precision | | |
| | π¨ **Streaming UI** | Real-time responses with inline citations and source cards | | |
| | π³ **Docker Ready** | Full Docker Compose setup with Milvus, Redis, API, UI | | |
| --- | |
| ## π Quick Start | |
| ### Prerequisites | |
| - Python 3.10+ | |
| - Docker & Docker Compose (optional, for containerized setup) | |
| - GROQ API key (default LLM provider) | |
| ### 1. Clone & Setup | |
| ```bash | |
| # Clone repository | |
| git clone <repo-url> | |
| cd cortex | |
| # Create virtual environment | |
| python -m venv .venv | |
| source .venv/bin/activate # On Windows: .venv\Scripts\activate | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| ``` | |
| ### 2. Environment Configuration | |
| Create `.env` file in project root: | |
| ```bash | |
| # LLM Providers | |
| GROQ_API_KEY=your_groq_api_key | |
| GROQ_MODEL=llama-3.3-70b-versatile | |
| GROQ_TEMPERATURE=0.1 | |
| # Optional: Other LLM providers | |
| OPENAI_API_KEY=your_openai_key | |
| MISTRAL_API_KEY=your_mistral_key | |
| NVIDIA_API_KEY=your_nvidia_key | |
| # Embedding & Storage | |
| EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5 | |
| EMBED_DEVICE=cpu # "cuda" if GPU available | |
| # Milvus Vector Store | |
| MILVUS_HOST=localhost | |
| MILVUS_PORT=19530 | |
| MILVUS_COLLECTION=cortex_chunks | |
| MILVUS_INDEX_TYPE=IVF_FLAT | |
| # Redis Cache (optional) | |
| REDIS_URL=redis://localhost:6379 | |
| # Retrieval | |
| RETRIEVAL_TOP_K=15 | |
| FINAL_TOP_K=5 | |
| # CRAG (Consistency-based Retrieval Augmented Generation) | |
| CRAG_ENABLED=true | |
| CRAG_RELEVANCE_THRESHOLD=0.5 | |
| # Knowledge Graph | |
| GRAPH_ENABLED=true | |
| GRAPH_EXTRACTOR=llm-filtered # "rebel", "llm", "rebel-filtered", "llm-filtered" | |
| GRAPH_MAX_HOPS=2 | |
| # API | |
| API_HOST=0.0.0.0 | |
| API_PORT=8000 | |
| ``` | |
| ### 3. Start Services | |
| **Option A: Docker Compose (Recommended)** | |
| ```bash | |
| docker-compose up -d | |
| # API: http://localhost:8000 | |
| # Streamlit UI: http://localhost:8501 | |
| # Milvus: http://localhost:19530 | |
| ``` | |
| **Option B: Local Setup** | |
| Make sure Milvus is running: | |
| ```bash | |
| # Using Milvus Docker (if not using compose) | |
| docker run -d -p 19530:19530 -p 9091:9091 milvusdb/milvus:latest | |
| # Start API | |
| python -m uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload | |
| # In another terminal, start UI | |
| streamlit run ui/app.py | |
| ``` | |
| ### 4. Ingest Documents | |
| **Via Streamlit UI:** | |
| - Open http://localhost:8501 | |
| - Go to "π₯ Ingest" tab | |
| - Upload PDF/HTML/TXT or provide directory path | |
| **Via REST API:** | |
| ```bash | |
| curl -X POST "http://localhost:8000/ingest" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "mode": "directory", | |
| "path": "/path/to/documents" | |
| }' | |
| ``` | |
| ### 5. Ask Questions | |
| **Via Streamlit UI:** | |
| - Go to "π Ask" tab | |
| - Type your question | |
| - Watch streaming response with citations | |
| **Via REST API:** | |
| ```bash | |
| curl -X POST "http://localhost:8000/query" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "query": "What is machine learning?", | |
| "provider": "groq", | |
| "top_k": 5 | |
| }' | jq . | |
| ``` | |
| **Streaming Response:** | |
| ```bash | |
| curl -X POST "http://localhost:8000/query/stream" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "query": "Your question here", | |
| "provider": "groq" | |
| }' | |
| ``` | |
| --- | |
| ## π‘ REST API Endpoints | |
| ### Health & Status | |
| ```http | |
| GET /health | |
| ``` | |
| Returns system health, Milvus status, collection stats. | |
| ```json | |
| { | |
| "status": "healthy", | |
| "milvus": { | |
| "connected": true, | |
| "collection_count": 2500, | |
| "index_type": "IVF_FLAT" | |
| } | |
| } | |
| ``` | |
| ### Document Ingestion | |
| ```http | |
| POST /ingest | |
| Content-Type: application/json | |
| { | |
| "mode": "directory|file|upload", | |
| "path": "/path/to/documents", | |
| "chunk_size": 256, | |
| "overlap": 32 | |
| } | |
| ``` | |
| ### Query (Blocking) | |
| ```http | |
| POST /query | |
| Content-Type: application/json | |
| { | |
| "query": "Your question", | |
| "provider": "groq", | |
| "model": "llama-3.3-70b-versatile", | |
| "top_k": 5, | |
| "crag": true, | |
| "graph": true | |
| } | |
| ``` | |
| **Response:** | |
| ```json | |
| { | |
| "answer": "Answer text with citations [1][2]...", | |
| "chunks": [ | |
| { | |
| "id": "chunk_001", | |
| "text": "...", | |
| "score": 0.87, | |
| "source": "document_name.pdf" | |
| } | |
| ], | |
| "citations": [1, 2], | |
| "latency_ms": 1245 | |
| } | |
| ``` | |
| ### Query (Streaming) | |
| ```http | |
| POST /query/stream | |
| Content-Type: application/json | |
| { | |
| "query": "Your question", | |
| "provider": "groq" | |
| } | |
| ``` | |
| **Response:** Server-Sent Events (SSE) stream | |
| ``` | |
| data: {"type": "start"} | |
| data: {"type": "chunk", "content": "Answer "} | |
| data: {"type": "chunk", "content": "is "} | |
| data: {"type": "chunk", "content": "streaming..."} | |
| data: {"type": "citations", "citations": [1, 2]} | |
| data: {"type": "end"} | |
| ``` | |
| ### Model Information | |
| ```http | |
| GET /providers | |
| ``` | |
| Lists all available LLM providers and models. | |
| --- | |
| ## π οΈ Configuration Guide | |
| ### Retrieval Configuration | |
| ```env | |
| # Chunk sizes (tokens) | |
| CHUNK_SIZE_TOKENS=256 # Child chunk size | |
| PARENT_CHUNK_SIZE_TOKENS=1024 # Parent chunk size | |
| SEMANTIC_SIMILARITY_THRESHOLD=0.82 # Split boundary threshold | |
| CHUNK_OVERLAP_TOKENS=32 # Overlap padding | |
| # Retrieval settings | |
| RETRIEVAL_TOP_K=15 # Candidates before reranking | |
| FINAL_TOP_K=5 # Chunks sent to LLM | |
| ``` | |
| ### Embedding Configuration | |
| ```env | |
| EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5 # Model identifier | |
| EMBED_DIM=384 # Output dimension | |
| EMBED_BATCH_SIZE=64 # Batch size for processing | |
| EMBED_DEVICE=cpu # cpu or cuda | |
| ``` | |
| ### Milvus Configuration | |
| ```env | |
| MILVUS_HOST=localhost | |
| MILVUS_PORT=19530 | |
| MILVUS_COLLECTION=cortex_chunks | |
| MILVUS_INDEX_TYPE=IVF_FLAT # or HNSW for larger corpora | |
| MILVUS_METRIC_TYPE=COSINE # Vector similarity metric | |
| MILVUS_NLIST=128 # clustering parameter for IVF | |
| MILVUS_NPROBE=16 # search parameter | |
| ``` | |
| ### LLM Provider Configuration | |
| **Groq (Default)** | |
| ```env | |
| GROQ_API_KEY=your_key | |
| GROQ_MODEL=llama-3.3-70b-versatile | |
| GROQ_TEMPERATURE=0.1 | |
| GROQ_MAX_TOKENS=1024 | |
| GROQ_TIMEOUT=30 | |
| ``` | |
| **OpenAI** | |
| ```env | |
| OPENAI_API_KEY=your_key | |
| ``` | |
| **NVIDIA NIM** | |
| ```env | |
| NVIDIA_API_KEY=your_key | |
| ``` | |
| **Custom/Ollama** | |
| ```env | |
| CUSTOM_BASE_URL=http://localhost:11434/v1 | |
| CUSTOM_API_KEY=your_key | |
| ``` | |
| ### CRAG (Consistency-based Retrieval Augmented Generation) | |
| ```env | |
| CRAG_ENABLED=true | |
| CRAG_RELEVANCE_THRESHOLD=0.5 # Grade boundary | |
| TAVILY_API_KEY=your_tavily_key # For web search fallback | |
| ``` | |
| The CRAG gate automatically assesses retrieval quality: | |
| - **High confidence** (score β₯ threshold) β Use knowledge base | |
| - **Low confidence** (score < threshold) β Augment with web search | |
| ### Knowledge Graph | |
| ```env | |
| GRAPH_ENABLED=true | |
| GRAPH_EXTRACTOR=llm-filtered # rebel|llm|rebel-filtered|llm-filtered | |
| GRAPH_MAX_HOPS=2 # Traversal depth | |
| GRAPH_PATH=/data/storage/knowledge_graph.json | |
| # Density filtering (for "filtered" extractors) | |
| DENSITY_TOP_FRACTION=0.30 # Process top 30% entity-dense chunks | |
| DENSITY_MIN_ENTITIES=2 # Minimum entities per chunk | |
| ``` | |
| ### Caching | |
| ```env | |
| REDIS_URL=redis://localhost:6379 | |
| CACHE_TTL_SECONDS=3600 # 1 hour | |
| ``` | |
| ### Evaluation | |
| ```env | |
| EVAL_DB_PATH=/data/storage/eval.db | |
| ``` | |
| --- | |
| ## π Project Structure | |
| ``` | |
| cortex/ | |
| βββ api/ # FastAPI REST endpoints | |
| β βββ main.py # App initialization, endpoints | |
| β βββ schemas.py # Request/response Pydantic models | |
| β | |
| βββ ingestion/ # Document processing pipeline | |
| β βββ pipeline.py # Orchestration | |
| β βββ document_loader.py # PDF/HTML/TXT parsing | |
| β βββ chunker.py # Semantic chunking | |
| β βββ __init__.py | |
| β | |
| βββ retrieval/ # Multi-strategy retrieval | |
| β βββ orchestrator.py # Coordinate retrieval strategies | |
| β βββ dense.py # Milvus vector search | |
| β βββ bm25.py # Keyword search index | |
| β βββ embedder.py # HuggingFace embedding model | |
| β βββ router.py # Query routing logic | |
| β βββ fusion.py # Result fusion & reranking | |
| β βββ graph_builder.py # Build knowledge graphs | |
| β βββ graph_retriever.py # Entity-based retrieval | |
| β βββ relation_extractors.py # REBEL + LLM extractors | |
| β βββ cache.py # Redis caching wrapper | |
| β βββ __init__.py | |
| β | |
| βββ generation/ # LLM generation & CRAG | |
| β βββ generator.py # Multi-provider LLM wrapper | |
| β βββ crag.py # CRAG gate logic | |
| β βββ __init__.py | |
| β | |
| βββ evaluation/ # RAG evaluation metrics | |
| β βββ ragas_eval.py # RAGAS evaluator | |
| β βββ store.py # Evaluation database | |
| β βββ __init__.py | |
| β | |
| βββ ui/ # Streamlit frontend | |
| β βββ app.py # Main UI | |
| β βββ static/ # (Optional) HTML/CSS/JS | |
| β | |
| βββ data/ # Data storage | |
| β βββ documents/ # Input documents | |
| β βββ storage/ # Persistent storage | |
| β β βββ knowledge_graph.json | |
| β β βββ bm25_index.pkl | |
| β β βββ uploads/ | |
| β βββ synthetic_knowledge_items.txt | |
| β | |
| βββ config.py # Configuration & settings | |
| βββ requirements.txt # Python dependencies | |
| βββ Dockerfile # Docker image build | |
| βββ docker-compose.yml # Multi-container orchestration | |
| βββ test.py # Test suite | |
| βββ README.md # This file | |
| ``` | |
| --- | |
| ## π³ Docker & Deployment | |
| ### Docker Compose Quick Deploy | |
| ```bash | |
| # Start all services | |
| docker-compose up -d | |
| # View logs | |
| docker-compose logs -f api | |
| # Stop services | |
| docker-compose down | |
| ``` | |
| **Services:** | |
| - `milvus` β Vector database (port 19530) | |
| - `redis` β Caching layer (port 6379) | |
| - `api` β FastAPI backend (port 8000) | |
| - `ui` β Streamlit frontend (port 8501) | |
| ### Environment Variables in Compose | |
| Edit `docker-compose.yml` to customize: | |
| ```yaml | |
| services: | |
| api: | |
| environment: | |
| - GROQ_API_KEY=${GROQ_API_KEY} | |
| - GROQ_MODEL=llama-3.3-70b-versatile | |
| - MILVUS_HOST=milvus | |
| - REDIS_URL=redis://redis:6379 | |
| - GRAPH_EXTRACTOR=llm-filtered | |
| ``` | |
| ### Production Deployment | |
| For production, consider: | |
| 1. **Use HNSW index** instead of IVF_FLAT for better recall: | |
| ```env | |
| MILVUS_INDEX_TYPE=HNSW | |
| ``` | |
| 2. **Enable caching** for frequently asked questions: | |
| ```env | |
| REDIS_URL=redis://redis-prod:6379 | |
| ``` | |
| 3. **Use stronger embedding model** for higher quality: | |
| ```env | |
| EMBED_MODEL_NAME=BAAI/bge-base-en-v1.5 # 768-dim, better quality | |
| ``` | |
| 4. **Configure CRAG** for reliability: | |
| ```env | |
| CRAG_ENABLED=true | |
| CRAG_RELEVANCE_THRESHOLD=0.6 | |
| TAVILY_API_KEY=your_key | |
| ``` | |
| --- | |
| ## π Workflow Examples | |
| ### Example 1: Legal Document Q&A | |
| ```bash | |
| # 1. Ingest legal documents | |
| curl -X POST "http://localhost:8000/ingest" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "mode": "directory", | |
| "path": "/data/legal_documents" | |
| }' | |
| # 2. Query with graph enabled for relation extraction | |
| curl -X POST "http://localhost:8000/query" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "query": "What are the penalties for breach of contract?", | |
| "provider": "groq", | |
| "graph": true, | |
| "crag": true | |
| }' | |
| ``` | |
| ### Example 2: Research Paper Analysis | |
| ```bash | |
| # Ingest PDF papers | |
| python -c " | |
| from ingestion.pipeline import IngestionPipeline | |
| from retrieval.embedder import Embedder | |
| from retrieval.dense import MilvusStore | |
| embedder = Embedder() | |
| store = MilvusStore(embedder=embedder) | |
| pipeline = IngestionPipeline(embedder=embedder, store=store, bm25=None) | |
| pipeline.ingest('/data/papers', mode='pdf') | |
| " | |
| # Query for specific findings | |
| curl -X POST "http://localhost:8000/query/stream" \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "query": "What are the key findings about transformer performance?", | |
| "model": "gpt-4o" | |
| }' | |
| ``` | |
| ### Example 3: Customer Support Bot | |
| ```bash | |
| # 1. Ingest FAQ and documentation | |
| # 2. Set up CRAG with relevant threshold | |
| # 3. Route low-confidence queries to web search | |
| CRAG_RELEVANCE_THRESHOLD=0.6 | |
| TAVILY_API_KEY=your_key | |
| ``` | |
| --- | |
| ## π Advanced Features | |
| ### Knowledge Graph Extraction | |
| Three modes available: | |
| | Mode | Backend | Speed | Quality | Cost | | |
| |------|---------|-------|---------|------| | |
| | `rebel` | Local REBEL model | Fast | Good | Free | | |
| | `llm` | LLM (Groq/OpenAI) | Slower | Excellent | $$ | | |
| | `rebel-filtered` | REBEL + entity filtering | Fast | Good | Free | | |
| | `llm-filtered` | LLM + entity filtering | Slower | Excellent | $$ | | |
| Switch via config: | |
| ```env | |
| GRAPH_EXTRACTOR=llm-filtered | |
| ``` | |
| ### CRAG (Consistency-based RAG) | |
| Automatically: | |
| 1. Evaluates retrieval confidence | |
| 2. Assigns relevance grade (Correct/Partially-Correct/Missing) | |
| 3. Supplements low-confidence with web search via Tavily | |
| ```python | |
| from generation.crag import CRAGGate | |
| crag = CRAGGate() | |
| response = crag.evaluate(query, context, answer) | |
| # Returns: grade, supplemental_docs | |
| ``` | |
| ### Evaluation & Metrics | |
| RAGAS-based evaluation: | |
| ```python | |
| from evaluation.ragas_eval import RAGASEvaluator | |
| from evaluation.store import EvalStore | |
| evaluator = RAGASEvaluator(store=EvalStore()) | |
| metrics = evaluator.evaluate(query, context, answer) | |
| # Returns: answer_relevance, faithfulness, context_precision | |
| ``` | |
| ### Caching Strategy | |
| ```python | |
| from retrieval.cache import CachedRetriever | |
| retriever = CachedRetriever(base_retriever) | |
| # First call: 1000ms (database query) | |
| # Second call: 5ms (Redis cache hit, TTL: 1 hour) | |
| results = retriever.retrieve("machine learning basics") | |
| ``` | |
| --- | |
| ## βοΈ Performance Tuning | |
| ### For Speed | |
| ```env | |
| # Smaller embedding model | |
| EMBED_MODEL_NAME=BAAI/bge-small-en-v1.5 | |
| # Smaller chunks | |
| CHUNK_SIZE_TOKENS=128 | |
| PARENT_CHUNK_SIZE_TOKENS=512 | |
| # Faster index | |
| MILVUS_INDEX_TYPE=IVF_FLAT | |
| MILVUS_NPROBE=8 # Lower = faster | |
| # Enable cache | |
| REDIS_URL=redis://localhost:6379 | |
| # Fewer LLM tokens | |
| GROQ_MAX_TOKENS=512 | |
| ``` | |
| ### For Quality | |
| ```env | |
| # Larger embedding model | |
| EMBED_MODEL_NAME=BAAI/bge-base-en-v1.5 | |
| # Optimal chunks | |
| CHUNK_SIZE_TOKENS=512 | |
| PARENT_CHUNK_SIZE_TOKENS=2048 | |
| # More precise index | |
| MILVUS_INDEX_TYPE=HNSW | |
| MILVUS_NPROBE=32 | |
| # Better LLM | |
| GROQ_MODEL=llama-3.3-70b-versatile | |
| # Enable CRAG | |
| CRAG_ENABLED=true | |
| ``` | |
| --- | |
| ## π Troubleshooting | |
| ### Milvus Connection Failed | |
| ```bash | |
| # Check if Milvus is running | |
| curl http://localhost:19530/healthz | |
| # Restart Milvus | |
| docker-compose restart milvus | |
| # Verify in settings | |
| python -c "from config import get_settings; print(get_settings().milvus_host)" | |
| ``` | |
| ### Low Retrieval Quality | |
| 1. **Check chunk quality:** | |
| ```python | |
| from ingestion.chunker import SemanticChunker | |
| chunker = SemanticChunker() | |
| chunks = chunker.chunk("your document text") | |
| print([c.text for c in chunks[:3]]) | |
| ``` | |
| 2. **Verify embeddings:** | |
| ```python | |
| from retrieval.embedder import Embedder | |
| embedder = Embedder() | |
| emb = embedder.embed("test query") | |
| print(f"Embedding dim: {len(emb)}, sample: {emb[:5]}") | |
| ``` | |
| 3. **Enable CRAG** for automatic augmentation: | |
| ```env | |
| CRAG_ENABLED=true | |
| ``` | |
| ### Slow Response Times | |
| 1. Check cache hit rate | |
| 2. Reduce `MILVUS_NPROBE` | |
| 3. Use streaming endpoint (`/query/stream`) | |
| 4. Enable Redis caching | |
| ### Out of Memory | |
| ```env | |
| # Reduce batch sizes | |
| EMBED_BATCH_SIZE=16 | |
| # Reduce chunk sizes | |
| CHUNK_SIZE_TOKENS=128 | |
| # Switch to CPU if using GPU | |
| EMBED_DEVICE=cpu | |
| ``` | |
| --- | |
| ## π Monitoring & Evaluation | |
| ### Health Check | |
| ```bash | |
| curl http://localhost:8000/health | jq . | |
| ``` | |
| ### Collection Statistics | |
| ```python | |
| from retrieval.dense import MilvusStore | |
| from retrieval.embedder import Embedder | |
| store = MilvusStore(embedder=Embedder()) | |
| stats = store.get_stats() | |
| print(f"Documents: {stats['collection_count']}") | |
| ``` | |
| ### Query Evaluation | |
| ```python | |
| from evaluation.ragas_eval import RAGASEvaluator | |
| from evaluation.store import EvalStore | |
| evaluator = RAGASEvaluator(store=EvalStore(db_path="/data/storage/eval.db")) | |
| metrics = evaluator.evaluate(query, context, answer) | |
| print(f"Answer Relevance: {metrics['answer_relevance']:.2f}") | |
| print(f"Faithfulness: {metrics['faithfulness']:.2f}") | |
| print(f"Context Precision: {metrics['context_precision']:.2f}") | |
| ``` | |
| --- | |
| ## π€ Contributing | |
| Contributions welcome! Areas for enhancement: | |
| - [ ] Multi-language support | |
| - [ ] Fine-tuned domain-specific embeddings | |
| - [ ] Advanced reranking strategies | |
| - [ ] GraphQL API | |
| - [ ] Persistent trace logging | |
| - [ ] A/B testing framework | |
| --- | |
| ## π License | |
| MIT License β see LICENSE file for details | |
| --- | |
| ## π Resources | |
| - [Milvus Documentation](https://milvus.io/docs) | |
| - [FastAPI Guide](https://fastapi.tiangolo.com/) | |
| - [RAGAS Evaluation Framework](https://github.com/explorerx3/ragas) | |
| - [Groq API Reference](https://console.groq.com/docs/api-reference) | |
| - [CRAG Paper](https://arxiv.org/abs/2401.15884) | |
| --- | |
| **Questions?** Open an issue on GitHub or check the documentation. | |
| source .venv/bin/activate | |
| pip install -r requirements.txt | |
| python -m nltk.downloader punkt | |
| python -m spacy download en_core_web_sm | |
| ``` | |
| ### 2. Configure | |
| ```bash | |
| cp .env.example .env | |
| # Edit .env β set GROQ_API_KEY at minimum | |
| ``` | |
| Get a free Groq API key at https://console.groq.com | |
| ### 3. Start Milvus | |
| ```bash | |
| docker-compose up -d | |
| # Wait ~30s for Milvus to be healthy | |
| docker-compose ps # all three services should show "healthy" | |
| ``` | |
| ### 4. Ingest documents | |
| ```bash | |
| mkdir -p data/documents | |
| # Copy PDFs / HTML / TXT files into data/documents/ | |
| python -m ingestion.pipeline data/documents | |
| ``` | |
| Or use the CLI: | |
| ```bash | |
| python ingestion/pipeline.py data/documents | |
| python ingestion/pipeline.py data/documents/paper.pdf | |
| ``` | |
| ### 5. Start the API | |
| ```bash | |
| uvicorn api.main:app --reload --port 8000 | |
| ``` | |
| ### 6. Start the UI | |
| ```bash | |
| streamlit run ui/app.py | |
| ``` | |
| Open http://localhost:8501 in your browser. | |
| --- | |
| ## API endpoints | |
| | Method | Path | Description | | |
| |--------|------|-------------| | |
| | GET | `/health` | Component health check | | |
| | POST | `/ingest` | Trigger ingestion pipeline | | |
| | POST | `/query` | Blocking query (full JSON) | | |
| | POST | `/query/stream` | Streaming query (SSE) | | |
| ### Example β blocking query | |
| ```bash | |
| curl -X POST http://localhost:8000/query \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"query": "What is attention in transformers?", "top_k": 5}' | |
| ``` | |
| ### Example β streaming query | |
| ```bash | |
| curl -X POST http://localhost:8000/query/stream \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"query": "Explain PagedAttention", "stream": true}' | |
| ``` | |
| --- | |
| ## Key design decisions | |
| ### Semantic chunking | |
| Fixed-size chunking (e.g. 1000 chars with 200 overlap) splits mid-sentence | |
| and mid-concept. Semantic chunking detects topic boundaries using cosine | |
| similarity between consecutive sentence embeddings, producing chunks that | |
| align with natural concept transitions. Combined with a fallback on token | |
| count (child_max = 256 tokens), chunks are both semantically coherent and | |
| bounded in size. | |
| ### Parent-child hierarchy | |
| The child chunk (β256 tokens) is what gets embedded and indexed β small, | |
| precise, high-relevance. When a child chunk is retrieved, its parent chunk | |
| (β1024 tokens, centred on the child) is what goes into the LLM context. | |
| This separates the **retrieval granularity** from the **generation context | |
| width**, giving you the precision of small chunks with the coherence of | |
| large ones. | |
| ### BGE query prefix | |
| `BAAI/bge-small-en-v1.5` is trained to expect a task-specific prefix on | |
| query strings for retrieval tasks: | |
| `"Represent this sentence for searching relevant passages: <query>"` | |
| Documents are embedded as-is. Skipping this prefix typically costs 3-5 | |
| points on retrieval benchmarks. | |
| --- | |
| ## Phase roadmap | |
| | Phase | Status | What's added | | |
| |-------|--------|--------------| | |
| | 1 | β Done | Dense RAG, semantic chunking, parent-child, streaming UI | | |
| | 2 | β Done | BM25 sparse, query router, RRF fusion, cross-encoder reranking | | |
| | 3 | β Done | GraphRAG (spaCy NER + NetworkX), CRAG gate, web fallback | | |
| | 4 | β Done | RAGAS eval harness, Redis cache, evaluation dashboard | | |