Spaces:

kerdosdotio
/

kerdos-llm-rag-api

Running

App Files Files Community

Bhaskar Ram commited on Feb 26

Commit

b1a3dce

0 Parent(s):

feat: Kerdos AI RAG API v1.0

Browse files

Files changed (9) hide show

.env.example +19 -0
.gitignore +41 -0
Dockerfile +35 -0
README.md +119 -0
api.py +366 -0
models.py +73 -0
rag_core.py +313 -0
requirements.txt +27 -0
sessions.py +102 -0

.env.example ADDED Viewed

	@@ -0,0 +1,19 @@

+# ─── Kerdos AI RAG API — Environment Variables ───────────────────────────────
+# Your Hugging Face API token (Write access required for Llama 3.1)
+# Get yours at: https://huggingface.co/settings/tokens
+# You must also accept the Llama 3.1 license:
+# https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
+HF_TOKEN=hf_your_token_here
+# Session time-to-live in minutes (default: 60)
+SESSION_TTL_MINUTES=60
+# Maximum file size for uploads in megabytes (default: 50)
+MAX_UPLOAD_MB=50
+# Server bind address
+HOST=0.0.0.0
+# Server port
+PORT=8000

.gitignore ADDED Viewed

	@@ -0,0 +1,41 @@

+# Python
+__pycache__/
+*.py[cod]
+*.pyo
+*.pyd
+.Python
+*.egg-info/
+dist/
+build/
+*.egg
+# Env
+.env
+*.env.local
+# Test artifacts
+.pytest_cache/
+.coverage
+htmlcov/
+# IDEs
+.vscode/
+.idea/
+*.suo
+*.user
+# OS
+.DS_Store
+Thumbs.db
+# API test file (contains token)
+api.txt
+# Stray files from curl test commands
+files-@*
+# Sample doc (don't need in repo)
+sample_doc.txt
+# Uploaded files (never persisted, but just in case)
+uploads/

Dockerfile ADDED Viewed

	@@ -0,0 +1,35 @@

+# Read the doc: https://huggingface.co/docs/hub/spaces-sdks-docker
+# Kerdos AI — Custom LLM RAG API
+FROM python:3.11
+# Create non-root user as HF Spaces recommends
+RUN useradd -m -u 1000 user
+USER user
+ENV PATH="/home/user/.local/bin:$PATH"
+WORKDIR /app
+# Install OS-level dependency for faiss at runtime
+# (must be done before switching to non-root, but faiss-cpu binary wheel
+#  includes its own libgomp so extra system libs aren't needed on py3.11-slim)
+# Install Python dependencies first (Docker cache layer)
+COPY --chown=user requirements.txt .
+RUN pip install --no-cache-dir --upgrade pip \
+    && pip install --no-cache-dir -r requirements.txt
+# Pre-download embedding model at build time (avoids cold-start delay)
+RUN python -c "from sentence_transformers import SentenceTransformer; \
+    SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')"
+# Copy application source
+COPY --chown=user api.py models.py rag_core.py sessions.py ./
+# HF Spaces required port
+EXPOSE 7860
+ENV HOST=0.0.0.0 \
+    PORT=7860
+CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "7860"]

README.md ADDED Viewed

	@@ -0,0 +1,119 @@

+---
+title: Kerdos AI — Custom LLM RAG API
+emoji: 🤖
+colorFrom: blue
+colorTo: indigo
+sdk: docker
+app_port: 7860
+pinned: false
+license: mit
+tags:
+  - rag
+  - document-qa
+  - fastapi
+  - llama
+  - faiss
+  - nlp
+  - question-answering
+  - kerdos
+  - private-llm
+  - api
+---
+# 🤖 Kerdos AI — Custom LLM RAG API
+> **A REST API by [Kerdos Infrasoft Private Limited](https://kerdos.in)**
+> Upload documents. Ask questions. Get answers — strictly grounded in your data.
+---
+## ✨ Features
+|                      |                                                            |
+| -------------------- | ---------------------------------------------------------- |
+| 📄 **Multi-format**  | PDF, DOCX, TXT, MD, CSV                                    |
+| 🧠 **LLM**           | `meta-llama/Llama-3.1-8B-Instruct` via HF Inference Router |
+| 🔒 **Grounded**      | Answers only from your uploaded documents                  |
+| 💬 **Multi-turn**    | Conversation history per session                           |
+| ⚡ **Fast**          | `all-MiniLM-L6-v2` + FAISS in-memory                       |
+| 🔑 **Session-based** | Each client gets an isolated FAISS index                   |
+---
+## 📡 API Reference
+Interactive docs → `/docs` (Swagger UI)
+| Method   | Path                       | Description                         |
+| -------- | -------------------------- | ----------------------------------- |
+| `POST`   | `/sessions`                | Create a session → get `session_id` |
+| `GET`    | `/sessions/{id}`           | Session status                      |
+| `DELETE` | `/sessions/{id}`           | Delete session                      |
+| `POST`   | `/sessions/{id}/documents` | Upload & index files                |
+| `POST`   | `/sessions/{id}/chat`      | Ask a question                      |
+| `DELETE` | `/sessions/{id}/history`   | Clear chat history                  |
+| `GET`    | `/health`                  | Health check                        |
+---
+## 🔁 Typical Workflow
+```bash
+BASE=https://kerdosdotio-kerdos-llm-rag-api.hf.space
+# 1. Create session
+curl -X POST $BASE/sessions
+# 2. Upload a document
+curl -X POST "$BASE/sessions/{session_id}/documents" \
+  -F "files=@your_doc.pdf"
+# 3. Ask a question
+curl -X POST "$BASE/sessions/{session_id}/chat" \
+  -H "Content-Type: application/json" \
+  -d '{"question": "Summarise this document", "hf_token": "hf_..."}'
+```
+---
+## ⚙️ Environment / Secrets
+Set these in **Settings → Variables and secrets** of this Space:
+| Secret                | Description                                                        |
+| --------------------- | ------------------------------------------------------------------ |
+| `HF_TOKEN`            | Your HuggingFace token (Write access + Llama 3.1 licence accepted) |
+| `SESSION_TTL_MINUTES` | Session expiry (default: 60)                                       |
+| `MAX_UPLOAD_MB`       | Max upload size in MB (default: 50)                                |
+---
+## 🏗️ Architecture
+```
+FastAPI (api.py)
+  ├── SessionStore — UUID sessions, TTL, per-session lock
+  └── RAGSession
+        ├── parse_file()       — PDF/DOCX/TXT/CSV
+        ├── chunk_text()       — 512-char chunks, 64 overlap
+        ├── all-MiniLM-L6-v2   — embeddings
+        ├── FAISS              — in-memory vector search
+        └── call_llm()         — HF Router → Llama 3.1 8B
+```
+---
+## 💼 Enterprise Edition
+Interested in **private, on-premise** deployment?
+- 🔒 Private LLM Hosting
+- 🎛️ Custom Model Fine-tuning
+- 🛡️ Data Privacy Guarantees
+- 🏷️ White-label Deployments
+📧 [partnership@kerdos.in](mailto:partnership@kerdos.in) | 🌐 [kerdos.in/contact](https://kerdos.in/contact)
+---
+_© 2024–2025 Kerdos Infrasoft Private Limited | Bengaluru, Karnataka, India_

api.py ADDED Viewed

	@@ -0,0 +1,366 @@

+"""
+Kerdos AI — Custom LLM Chat REST API
+FastAPI application exposing the full RAG pipeline as HTTP endpoints.
+"""
+from __future__ import annotations
+import asyncio
+import logging
+import os
+import time
+from contextlib import asynccontextmanager
+from dotenv import load_dotenv
+from fastapi import FastAPI, File, HTTPException, Path, UploadFile, status
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import JSONResponse
+from models import (
+    ChatRequest,
+    ChatResponse,
+    HealthResponse,
+    IndexResponse,
+    MessageResponse,
+    SessionCreateResponse,
+    SessionStatusResponse,
+    Source,
+)
+from rag_core import call_llm
+from sessions import store
+load_dotenv()
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s | %(levelname)-8s | %(name)s — %(message)s",
+)
+logger = logging.getLogger("kerdos.api")
+_START_TIME = time.time()
+API_VERSION = "1.0.0"
+# ── Lifespan: background cleanup task ────────────────────────────────────────
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    """Start a background task that purges expired sessions every 10 minutes."""
+    async def _cleanup_loop():
+        while True:
+            await asyncio.sleep(600)
+            removed = store.cleanup_expired()
+            if removed:
+                logger.info(f"Cleaned up {removed} expired session(s).")
+    task = asyncio.create_task(_cleanup_loop())
+    logger.info("Kerdos AI RAG API started.")
+    yield
+    task.cancel()
+    logger.info("Kerdos AI RAG API shutting down.")
+# ── App ───────────────────────────────────────────────────────────────────────
+app = FastAPI(
+    title="Kerdos AI — Custom LLM RAG API",
+    description=(
+        "REST API for the Kerdos AI document Q&A system.\n\n"
+        "Upload your documents, index them, and ask questions — "
+        "answers are strictly grounded in your uploaded content.\n\n"
+        "**LLM**: `meta-llama/Llama-3.1-8B-Instruct` via HuggingFace Inference API  \n"
+        "**Embeddings**: `sentence-transformers/all-MiniLM-L6-v2`  \n"
+        "**Vector Store**: FAISS (in-memory, per-session)  \n\n"
+        "© 2024–2025 [Kerdos Infrasoft Private Limited](https://kerdos.in)"
+    ),
+    version=API_VERSION,
+    contact={
+        "name": "Kerdos Infrasoft",
+        "url": "https://kerdos.in/contact",
+        "email": "partnership@kerdos.in",
+    },
+    license_info={"name": "MIT"},
+    lifespan=lifespan,
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+MAX_UPLOAD_BYTES = int(os.getenv("MAX_UPLOAD_MB", "50")) * 1024 * 1024
+ALLOWED_EXTENSIONS = {".pdf", ".docx", ".txt", ".md", ".csv"}
+# ── Helpers ───────────────────────────────────────────────────────────────────
+def _get_session_or_404(session_id: str):
+    try:
+        return store.get(session_id)
+    except KeyError:
+        raise HTTPException(
+            status_code=status.HTTP_404_NOT_FOUND,
+            detail=f"Session '{session_id}' not found or has expired.",
+        )
+# ── Routes ────────────────────────────────────────────────────────────────────
+@app.get(
+    "/",
+    tags=["Info"],
+    summary="API root",
+    response_model=dict,
+)
+async def root():
+    return {
+        "name": "Kerdos AI RAG API",
+        "version": API_VERSION,
+        "docs": "/docs",
+        "health": "/health",
+        "website": "https://kerdos.in",
+    }
+@app.get(
+    "/health",
+    tags=["Info"],
+    summary="Health check",
+    response_model=HealthResponse,
+)
+async def health():
+    return HealthResponse(
+        status="ok",
+        version=API_VERSION,
+        uptime_seconds=round(time.time() - _START_TIME, 2),
+        active_sessions=store.active_count,
+    )
+# ── Sessions ──────────────────────────────────────────────────────────────────
+@app.post(
+    "/sessions",
+    tags=["Sessions"],
+    summary="Create a new RAG session",
+    response_model=SessionCreateResponse,
+    status_code=status.HTTP_201_CREATED,
+)
+async def create_session():
+    """
+    Creates a new isolated session with its own FAISS index and conversation history.
+    Returns a `session_id` that must be passed to all subsequent requests.
+    """
+    sid = store.create()
+    logger.info(f"Session created: {sid}")
+    return SessionCreateResponse(session_id=sid)
+@app.get(
+    "/sessions/{session_id}",
+    tags=["Sessions"],
+    summary="Get session status",
+    response_model=SessionStatusResponse,
+)
+async def get_session(session_id: str = Path(..., description="Session ID")):
+    """Returns metadata about the session: document count, chunk count, history length, TTL."""
+    rag, _ = _get_session_or_404(session_id)
+    meta = store.get_meta(session_id)
+    return SessionStatusResponse(
+        session_id=session_id,
+        document_count=rag.document_count,
+        chunk_count=rag.chunk_count,
+        history_length=len(rag.history),
+        created_at=meta["created_at"],
+        expires_at=meta["expires_at"],
+    )
+@app.delete(
+    "/sessions/{session_id}",
+    tags=["Sessions"],
+    summary="Delete a session",
+    response_model=MessageResponse,
+)
+async def delete_session(session_id: str = Path(...)):
+    """Immediately removes the session and frees all in-memory resources."""
+    deleted = store.delete(session_id)
+    if not deleted:
+        raise HTTPException(status_code=404, detail=f"Session '{session_id}' not found.")
+    logger.info(f"Session deleted: {session_id}")
+    return MessageResponse(message=f"Session '{session_id}' deleted.")
+# ── Documents ─────────────────────────────────────────────────────────────────
+@app.post(
+    "/sessions/{session_id}/documents",
+    tags=["Documents"],
+    summary="Upload and index documents",
+    response_model=IndexResponse,
+)
+async def upload_documents(
+    session_id: str = Path(..., description="Session ID"),
+    files: list[UploadFile] = File(..., description="Files to index (PDF, DOCX, TXT, MD, CSV)"),
+):
+    """
+    Upload one or more files to the session's FAISS index.
+    Supported formats: PDF, DOCX, TXT, MD, CSV.
+    Can be called multiple times to add more documents to an existing index.
+    """
+    rag, lock = _get_session_or_404(session_id)
+    file_pairs: list[tuple[str, bytes]] = []
+    oversized: list[str] = []
+    for upload in files:
+        content = await upload.read()
+        if len(content) > MAX_UPLOAD_BYTES:
+            oversized.append(upload.filename or "unknown")
+            continue
+        from pathlib import Path as P
+        ext = P(upload.filename or "").suffix.lower()
+        if ext not in ALLOWED_EXTENSIONS:
+            raise HTTPException(
+                status_code=status.HTTP_415_UNSUPPORTED_MEDIA_TYPE,
+                detail=f"File '{upload.filename}' has unsupported type '{ext}'. "
+                       f"Allowed: {', '.join(sorted(ALLOWED_EXTENSIONS))}",
+            )
+        file_pairs.append((upload.filename or "unnamed", content))
+    if oversized:
+        raise HTTPException(
+            status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE,
+            detail=f"Files exceed {os.getenv('MAX_UPLOAD_MB', '50')} MB limit: {oversized}",
+        )
+    if not file_pairs:
+        raise HTTPException(
+            status_code=status.HTTP_400_BAD_REQUEST,
+            detail="No valid files provided.",
+        )
+    # Index in a thread so we don't block the event loop (FAISS + embeddings are CPU-bound)
+    loop = asyncio.get_event_loop()
+    def _index():
+        with lock:
+            return rag.index_documents(file_pairs)
+    indexed, failed = await loop.run_in_executor(None, _index)
+    logger.info(f"[{session_id}] Indexed {len(indexed)} file(s), failed: {len(failed)}")
+    return IndexResponse(
+        session_id=session_id,
+        indexed_files=indexed,
+        failed_files=failed,
+        chunk_count=rag.chunk_count,
+    )
+# ── Chat ──────────────────────────────────────────────────────────────────────
+@app.post(
+    "/sessions/{session_id}/chat",
+    tags=["Chat"],
+    summary="Ask a question about your documents",
+    response_model=ChatResponse,
+)
+async def chat(
+    session_id: str = Path(..., description="Session ID"),
+    body: ChatRequest = ...,
+):
+    """
+    Retrieves the most relevant document chunks and uses Llama 3.1 8B to generate
+    an answer strictly grounded in those chunks.
+    **Requires a HuggingFace token** with Write access and acceptance of the
+    [Llama 3.1 license](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
+    """
+    rag, lock = _get_session_or_404(session_id)
+    loop = asyncio.get_event_loop()
+    def _run_rag():
+        with lock:
+            # 1. Retrieve relevant chunks
+            try:
+                top_chunks = rag.query(body.question, top_k=body.top_k)
+            except RuntimeError as exc:
+                raise HTTPException(
+                    status_code=status.HTTP_400_BAD_REQUEST,
+                    detail=str(exc),
+                )
+            # 2. Call LLM
+            try:
+                answer = call_llm(
+                    context_chunks=top_chunks,
+                    question=body.question,
+                    history=rag.history,
+                    hf_token=body.hf_token,
+                    temperature=body.temperature,
+                    max_new_tokens=body.max_new_tokens,
+                )
+            except ValueError as exc:
+                raise HTTPException(status_code=status.HTTP_401_UNAUTHORIZED, detail=str(exc))
+            except RuntimeError as exc:
+                raise HTTPException(status_code=status.HTTP_502_BAD_GATEWAY, detail=str(exc))
+            # 3. Persist to history
+            rag.add_turn(body.question, answer)
+            # 4. Build source citations
+            sources = [
+                Source(
+                    filename=c.filename,
+                    chunk_index=c.chunk_index,
+                    excerpt=c.text[:200] + ("…" if len(c.text) > 200 else ""),
+                )
+                for c in top_chunks
+            ]
+            return answer, sources
+    answer, sources = await loop.run_in_executor(None, _run_rag)
+    logger.info(f"[{session_id}] Q: {body.question[:60]}…")
+    return ChatResponse(
+        session_id=session_id,
+        question=body.question,
+        answer=answer,
+        sources=sources,
+    )
+@app.delete(
+    "/sessions/{session_id}/history",
+    tags=["Chat"],
+    summary="Clear conversation history",
+    response_model=MessageResponse,
+)
+async def clear_history(session_id: str = Path(...)):
+    """Clears the multi-turn conversation history for the session (keeps the FAISS index intact)."""
+    rag, lock = _get_session_or_404(session_id)
+    with lock:
+        rag.clear_history()
+    return MessageResponse(message="Conversation history cleared.")
+# ── Entry point ───────────────────────────────────────────────────────────────
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(
+        "api:app",
+        host=os.getenv("HOST", "0.0.0.0"),
+        port=int(os.getenv("PORT", "8000")),
+        reload=False,
+        log_level="info",
+    )

models.py ADDED Viewed

	@@ -0,0 +1,73 @@

+"""
+Pydantic request/response models for the Kerdos AI RAG API.
+"""
+from __future__ import annotations
+from typing import List, Optional
+from pydantic import BaseModel, Field
+# ─── Session ────────────────────────────────────────────────────────────────
+class SessionCreateResponse(BaseModel):
+    session_id: str = Field(..., description="Unique session identifier")
+    message: str = Field(default="Session created successfully")
+class SessionStatusResponse(BaseModel):
+    session_id: str
+    document_count: int = Field(..., description="Number of uploaded documents")
+    chunk_count: int = Field(..., description="Number of indexed text chunks")
+    history_length: int = Field(..., description="Number of turns in conversation history")
+    created_at: str
+    expires_at: str
+# ─── Documents ──────────────────────────────────────────────────────────────
+class IndexResponse(BaseModel):
+    session_id: str
+    indexed_files: List[str] = Field(..., description="Names of successfully indexed files")
+    failed_files: List[str] = Field(default_factory=list, description="Files that failed to parse")
+    chunk_count: int = Field(..., description="Total chunks in FAISS index")
+    message: str = Field(default="Documents indexed successfully")
+# ─── Chat ────────────────────────────────────────────────────────────────────
+class Source(BaseModel):
+    filename: str
+    chunk_index: int
+    excerpt: str = Field(..., description="Short preview of the retrieved chunk")
+class ChatRequest(BaseModel):
+    question: str = Field(..., min_length=1, description="The question to ask about your documents")
+    hf_token: str = Field(..., description="Hugging Face API token (Write access required for Llama 3)")
+    top_k: int = Field(default=5, ge=1, le=20, description="Number of chunks to retrieve")
+    temperature: float = Field(default=0.3, ge=0.0, le=1.0)
+    max_new_tokens: int = Field(default=512, ge=64, le=2048)
+class ChatResponse(BaseModel):
+    session_id: str
+    question: str
+    answer: str
+    sources: List[Source] = Field(default_factory=list)
+    model: str = Field(default="meta-llama/Llama-3.1-8B-Instruct")
+# ─── Health ──────────────────────────────────────────────────────────────────
+class HealthResponse(BaseModel):
+    status: str = "ok"
+    version: str
+    uptime_seconds: float
+    active_sessions: int
+# ─── Generic ─────────────────────────────────────────────────────────────────
+class MessageResponse(BaseModel):
+    message: str

rag_core.py ADDED Viewed

	@@ -0,0 +1,313 @@

+"""
+Core RAG engine: document parsing, chunking, embedding, FAISS indexing, and LLM querying.
+No Gradio dependency — pure Python, importable by the FastAPI layer.
+"""
+from __future__ import annotations
+import io
+import logging
+import textwrap
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import List, Optional, Tuple
+import numpy as np
+import requests
+logger = logging.getLogger(__name__)
+# ──────────────────────────────────────────────────────────────────────────────
+# Lazy imports (heavy libraries loaded only once at first use)
+# ──────────────────────────────────────────────────────────────────────────────
+_embedding_model = None
+_faiss = None
+def _get_embedding_model():
+    global _embedding_model
+    if _embedding_model is None:
+        from sentence_transformers import SentenceTransformer
+        logger.info("Loading SentenceTransformer model…")
+        _embedding_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
+        logger.info("Embedding model loaded.")
+    return _embedding_model
+def _get_faiss():
+    global _faiss
+    if _faiss is None:
+        import faiss as _faiss_module
+        _faiss = _faiss_module
+    return _faiss
+# ──────────────────────────────────────────────────────────────────────────────
+# Document parsing
+# ──────────────────────────────────────────────────────────────────────────────
+def _parse_pdf(data: bytes) -> str:
+    import fitz  # PyMuPDF
+    doc = fitz.open(stream=data, filetype="pdf")
+    return "\n\n".join(page.get_text() for page in doc)
+def _parse_docx(data: bytes) -> str:
+    from docx import Document
+    doc = Document(io.BytesIO(data))
+    return "\n\n".join(p.text for p in doc.paragraphs if p.text.strip())
+def _parse_txt(data: bytes) -> str:
+    for enc in ("utf-8", "latin-1", "cp1252"):
+        try:
+            return data.decode(enc)
+        except UnicodeDecodeError:
+            continue
+    return data.decode("utf-8", errors="replace")
+def _parse_csv(data: bytes) -> str:
+    import csv
+    rows = []
+    reader = csv.reader(io.StringIO(_parse_txt(data)))
+    for row in reader:
+        rows.append(", ".join(row))
+    return "\n".join(rows)
+PARSERS = {
+    ".pdf": _parse_pdf,
+    ".docx": _parse_docx,
+    ".txt": _parse_txt,
+    ".md": _parse_txt,
+    ".csv": _parse_csv,
+}
+def parse_file(filename: str, data: bytes) -> str:
+    """
+    Parse a file by extension and return its plain-text content.
+    Raises ValueError for unsupported extensions.
+    """
+    ext = Path(filename).suffix.lower()
+    parser = PARSERS.get(ext)
+    if parser is None:
+        raise ValueError(
+            f"Unsupported file type '{ext}'. "
+            f"Supported: {', '.join(PARSERS)}"
+        )
+    return parser(data)
+# ──────────────────────────────────────────────────────────────────────────────
+# Text chunking
+# ──────────────────────────────────────────────────────────────────────────────
+def chunk_text(
+    text: str,
+    chunk_size: int = 512,
+    overlap: int = 64,
+) -> List[str]:
+    """Split *text* into overlapping fixed-size character chunks."""
+    text = text.strip()
+    if not text:
+        return []
+    chunks: List[str] = []
+    start = 0
+    while start < len(text):
+        end = min(start + chunk_size, len(text))
+        chunk = text[start:end].strip()
+        if chunk:
+            chunks.append(chunk)
+        if end == len(text):
+            break
+        start = end - overlap
+    return chunks
+# ──────────────────────────────────────────────────────────────────────────────
+# RAG Session
+# ──────────────────────────────────────────────────────────────────────────────
+@dataclass
+class IndexedChunk:
+    text: str
+    filename: str
+    chunk_index: int  # global index inside this session's chunk list
+@dataclass
+class RAGSession:
+    """
+    Holds the FAISS vector index and conversation history for a single API session.
+    Thread-safety is the responsibility of the caller (sessions.py uses a per-session lock).
+    """
+    chunks: List[IndexedChunk] = field(default_factory=list)
+    history: List[Tuple[str, str]] = field(default_factory=list)  # [(user, assistant), …]
+    document_names: List[str] = field(default_factory=list)
+    _index = None  # faiss.IndexFlatL2
+    # ── Public helpers ────────────────────────────────────────────────────────
+    @property
+    def document_count(self) -> int:
+        return len(self.document_names)
+    @property
+    def chunk_count(self) -> int:
+        return len(self.chunks)
+    def index_documents(self, files: List[Tuple[str, bytes]]) -> Tuple[List[str], List[str]]:
+        """
+        Parse, chunk, and embed a list of (filename, bytes) pairs into the FAISS index.
+        Returns (indexed_names, failed_names).
+        """
+        model = _get_embedding_model()
+        faiss = _get_faiss()
+        new_chunks: List[IndexedChunk] = []
+        indexed: List[str] = []
+        failed: List[str] = []
+        for filename, data in files:
+            try:
+                text = parse_file(filename, data)
+                raw_chunks = chunk_text(text)
+                start_idx = len(self.chunks) + len(new_chunks)
+                for i, c in enumerate(raw_chunks):
+                    new_chunks.append(IndexedChunk(
+                        text=c,
+                        filename=filename,
+                        chunk_index=start_idx + i,
+                    ))
+                indexed.append(filename)
+                if filename not in self.document_names:
+                    self.document_names.append(filename)
+                logger.info(f"Indexed '{filename}': {len(raw_chunks)} chunks")
+            except Exception as exc:
+                logger.warning(f"Failed to parse '{filename}': {exc}")
+                failed.append(filename)
+        if not new_chunks:
+            return indexed, failed
+        # Embed all new chunks
+        texts = [c.text for c in new_chunks]
+        vectors = model.encode(texts, show_progress_bar=False).astype(np.float32)
+        dim = vectors.shape[1]
+        if self._index is None:
+            self._index = faiss.IndexFlatL2(dim)
+        self._index.add(vectors)
+        self.chunks.extend(new_chunks)
+        return indexed, failed
+    def query(self, question: str, top_k: int = 5) -> List[IndexedChunk]:
+        """
+        Run a similarity search and return the most relevant chunks.
+        Raises RuntimeError if no documents have been indexed yet.
+        """
+        if self._index is None or not self.chunks:
+            raise RuntimeError("No documents indexed. Upload documents first.")
+        model = _get_embedding_model()
+        q_vec = model.encode([question], show_progress_bar=False).astype(np.float32)
+        k = min(top_k, len(self.chunks))
+        _, indices = self._index.search(q_vec, k)
+        return [self.chunks[i] for i in indices[0] if i < len(self.chunks)]
+    def add_turn(self, question: str, answer: str) -> None:
+        self.history.append((question, answer))
+    def clear_history(self) -> None:
+        self.history.clear()
+# ──────────────────────────────────────────────────────────────────────────────
+# LLM call (HuggingFace Inference API)
+# ──────────────────────────────────────────────────────────────────────────────
+_HF_API_URL = "https://router.huggingface.co/v1/chat/completions"
+_SYSTEM_PROMPT = textwrap.dedent("""\
+    You are Kerdos AI, an expert document assistant.
+    Answer ONLY from the provided document excerpts.
+    If the answer is not in the excerpts, say:
+    "I could not find this information in the uploaded documents."
+    Be concise, factual, and cite which document your answer comes from.
+""")
+def call_llm(
+    context_chunks: List[IndexedChunk],
+    question: str,
+    history: List[Tuple[str, str]],
+    hf_token: str,
+    temperature: float = 0.3,
+    max_new_tokens: int = 512,
+) -> str:
+    """
+    Build a chat prompt and call the HF Inference API.
+    Returns the assistant's reply as a string.
+    """
+    # Build context block
+    context_parts = []
+    for chunk in context_chunks:
+        context_parts.append(
+            f"[Source: {chunk.filename}]\n{chunk.text}"
+        )
+    context_text = "\n\n---\n\n".join(context_parts)
+    # Build messages for the chat template
+    messages = [{"role": "system", "content": _SYSTEM_PROMPT}]
+    # Add recent history (last 6 turns to stay within context window)
+    for user_msg, asst_msg in history[-6:]:
+        messages.append({"role": "user", "content": user_msg})
+        messages.append({"role": "assistant", "content": asst_msg})
+    # Current turn with injected context
+    user_content = (
+        f"Document excerpts:\n\n{context_text}\n\n"
+        f"Question: {question}"
+    )
+    messages.append({"role": "user", "content": user_content})
+    payload = {
+        "model": "meta-llama/Llama-3.1-8B-Instruct",
+        "messages": messages,
+        "temperature": temperature,
+        "max_tokens": max_new_tokens,
+    }
+    headers = {
+        "Authorization": f"Bearer {hf_token}",
+        "Content-Type": "application/json",
+    }
+    try:
+        response = requests.post(
+            _HF_API_URL,
+            json=payload,
+            headers=headers,
+            timeout=120,
+        )
+        response.raise_for_status()
+        data = response.json()
+        return data["choices"][0]["message"]["content"].strip()
+    except requests.HTTPError as exc:
+        status = exc.response.status_code
+        if status == 401:
+            raise ValueError("Invalid HuggingFace token. Please check your HF_TOKEN.") from exc
+        if status == 403:
+            raise ValueError(
+                "Access denied. Your HF token needs 'Write' permission and you must accept "
+                "the Llama 3.1 license at https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct"
+            ) from exc
+        raise RuntimeError(f"HF API error {status}: {exc.response.text}") from exc
+    except requests.RequestException as exc:
+        raise RuntimeError(f"Network error calling HF API: {exc}") from exc

requirements.txt ADDED Viewed

	@@ -0,0 +1,27 @@

+# Kerdos AI RAG API — Python dependencies
+# Web framework & server
+fastapi>=0.110.0
+uvicorn[standard]>=0.27.0
+python-multipart>=0.0.9
+# Data validation
+pydantic>=2.0.0
+pydantic-settings>=2.0.0
+# AI / ML
+sentence-transformers>=2.6.0
+faiss-cpu>=1.7.4
+# Document parsing
+pymupdf>=1.23.0          # PDF via fitz
+python-docx>=1.1.0       # DOCX
+# HTTP client (for HF Inference API)
+requests>=2.31.0
+# Config
+python-dotenv>=1.0.0
+# Numerical
+numpy>=1.24.0

sessions.py ADDED Viewed

	@@ -0,0 +1,102 @@

+"""
+Thread-safe in-memory session store with TTL-based expiry.
+"""
+from __future__ import annotations
+import threading
+import uuid
+from dataclasses import dataclass, field
+from datetime import datetime, timedelta
+from typing import Dict, Optional
+from rag_core import RAGSession
+@dataclass
+class _SessionEntry:
+    session: RAGSession
+    lock: threading.Lock = field(default_factory=threading.Lock)
+    created_at: datetime = field(default_factory=datetime.utcnow)
+    expires_at: datetime = field(default_factory=datetime.utcnow)  # set in __post_init__
+    def __post_init__(self):
+        # will be overwritten by SessionStore with the real TTL
+        pass
+class SessionStore:
+    """
+    Global in-memory store for RAG sessions.
+    Each session has its own lock to allow concurrent requests on different sessions.
+    """
+    def __init__(self, ttl_minutes: int = 60):
+        self._sessions: Dict[str, _SessionEntry] = {}
+        self._store_lock = threading.Lock()
+        self._ttl = timedelta(minutes=ttl_minutes)
+    # ── Public API ────────────────────────────────────────────────────────────
+    def create(self) -> str:
+        """Create a new session and return its ID."""
+        sid = str(uuid.uuid4())
+        now = datetime.utcnow()
+        entry = _SessionEntry(
+            session=RAGSession(),
+            created_at=now,
+            expires_at=now + self._ttl,
+        )
+        with self._store_lock:
+            self._sessions[sid] = entry
+        return sid
+    def get(self, session_id: str) -> tuple[RAGSession, threading.Lock]:
+        """
+        Return (RAGSession, per-session Lock) or raise KeyError if not found/expired.
+        Also refreshes the TTL on access.
+        """
+        with self._store_lock:
+            entry = self._sessions.get(session_id)
+            if entry is None or datetime.utcnow() > entry.expires_at:
+                if entry is not None:
+                    del self._sessions[session_id]
+                raise KeyError(session_id)
+            # Refresh TTL on access
+            entry.expires_at = datetime.utcnow() + self._ttl
+            return entry.session, entry.lock
+    def get_meta(self, session_id: str) -> dict:
+        """Return metadata (created_at, expires_at) without refreshing TTL."""
+        with self._store_lock:
+            entry = self._sessions.get(session_id)
+            if entry is None or datetime.utcnow() > entry.expires_at:
+                raise KeyError(session_id)
+            return {
+                "created_at": entry.created_at.isoformat() + "Z",
+                "expires_at": entry.expires_at.isoformat() + "Z",
+            }
+    def delete(self, session_id: str) -> bool:
+        """Delete a session. Returns True if it existed."""
+        with self._store_lock:
+            return self._sessions.pop(session_id, None) is not None
+    def cleanup_expired(self) -> int:
+        """Remove all expired sessions. Returns the number removed."""
+        now = datetime.utcnow()
+        with self._store_lock:
+            expired = [sid for sid, e in self._sessions.items() if now > e.expires_at]
+            for sid in expired:
+                del self._sessions[sid]
+        return len(expired)
+    @property
+    def active_count(self) -> int:
+        with self._store_lock:
+            now = datetime.utcnow()
+            return sum(1 for e in self._sessions.values() if now <= e.expires_at)
+# Singleton — imported by api.py
+store = SessionStore()