Instructions to use aelgendy/QModel with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use aelgendy/QModel with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="aelgendy/QModel",
	filename="models/Qwen3-32B-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use aelgendy/QModel with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf aelgendy/QModel:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf aelgendy/QModel:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf aelgendy/QModel:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf aelgendy/QModel:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf aelgendy/QModel:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf aelgendy/QModel:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf aelgendy/QModel:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf aelgendy/QModel:Q4_K_M

Use Docker

docker model run hf.co/aelgendy/QModel:Q4_K_M

LM Studio
Jan
Ollama
How to use aelgendy/QModel with Ollama:
```
ollama run hf.co/aelgendy/QModel:Q4_K_M
```

Unsloth Studio new

How to use aelgendy/QModel with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for aelgendy/QModel to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for aelgendy/QModel to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for aelgendy/QModel to start chatting

Pi new

How to use aelgendy/QModel with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf aelgendy/QModel:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "aelgendy/QModel:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use aelgendy/QModel with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf aelgendy/QModel:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default aelgendy/QModel:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use aelgendy/QModel with Docker Model Runner:
```
docker model run hf.co/aelgendy/QModel:Q4_K_M
```

Lemonade

How to use aelgendy/QModel with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull aelgendy/QModel:Q4_K_M

Run and chat with the model

lemonade run user.QModel-Q4_K_M

List all available models

lemonade list

aelgendy commited on Mar 26

Commit

0eb0a7a

1 Parent(s): 0d33f13

Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

ARCHITECTURE.md +0 -334
DOCKER.md +0 -443
OPEN_WEBUI.md +0 -385
README.md +503 -80
SETUP.md +0 -590
app/routers/chat.py +43 -2
app/routers/ops.py +43 -4
main.py +4 -2

ARCHITECTURE.md DELETED Viewed

@@ -1,334 +0,0 @@
-# QModel v6 Architecture — Detailed System Design
-> For a quick overview, see [README.md](README.md#architecture-overview)
-## System Vision
-A RAG system specialized **exclusively** in authenticated Qur'an and Hadith. No hallucinations, no outside knowledge—only content from verified sources.
-## Core Capabilities
-### 1. **Quran Verse Lookup** (by partial text)
-- Text search: find any verse by typing part of its Arabic or English text
-- Exact substring + fuzzy word-overlap matching
-### 2. **Quran Topic Search**
-- Semantic hybrid search to find verses related to any topic
-- Full Tafsir-aware prompting
-### 3. **Quran Word Frequency & Analytics**
-- Count how many times a word appears across all 114 Surahs
-- Per-surah breakdown with example verses
-- Chapter-level analytics (verse count, revelation type)
-### 4. **Hadith Lookup** (by partial text)
-- Text search across 9 Hadith collections
-- Optional collection filter
-### 5. **Hadith Topic Search**
-- Semantic hybrid search for Hadiths by topic
-- Optional grade filter (sahih, hasan, etc.)
-### 6. **Hadith Authenticity Verification**
-- Dual-method verification: text search + semantic search
-- Grade inference from collection name when not explicitly provided
-- Sources: Bukhari, Muslim, Abu Dawud, Tirmidhi, Ibn Majah, Nasa'i, Malik, Ahmad, Darimi
-### 7. **Safety First**
-- **Confidence Gating**: Low-confidence queries return "not found" instead of LLM guess
-- **Source Attribution**: Every answer cites exact verse/Hadith with reference
-- **Grade Filtering**: Optional: only return Sahih-authenticated Hadiths
-- **Verbatim Quotes**: Copy text directly from data, no paraphrasing
-## Modular Architecture (v6)
-```
-main.py                    ← Thin launcher (73 lines)
-app/
-  config.py               ← Config class (env vars)
-  llm.py                  ← LLM providers (Ollama, HuggingFace)
-  cache.py                ← TTL-LRU async cache
-  arabic_nlp.py           ← Arabic normalisation, stemming, language detection
-  search.py               ← Hybrid FAISS+BM25, text search, query rewriting
-  analysis.py             ← Intent detection, analytics, counting
-  prompts.py              ← Prompt engineering (persona, task instructions)
-  models.py               ← Pydantic schemas
-  state.py                ← AppState, lifespan, RAG pipeline
-  routers/
-    quran.py              ← 6 Quran endpoints
-    hadith.py             ← 5 Hadith endpoints
-    chat.py               ← 2 OpenAI-compatible + inference endpoints
-    ops.py                ← 3 operational endpoints (health, models, debug)
-```
----
-## Data Pipeline
-The system follows a three-phase approach:
-**Metadata Schema** (47,179 entries: 6,236 Quran + 40,943 Hadith):
-```json
-{
-  "id": "surah:verse or hadith_prefix_number",
-  "arabic": "...",
-  "english": "...",
-  "source": "Surah Al-Baqarah 2:43 | Sahih al-Bukhari 1",
-  "type": "quran | hadith",
-  // Quran only
-  "surah_number": 2,
-  "surah_name_en": "Al-Baqarah",
-  "surah_name_ar": "البقرة",
-  "verse_number": 43,
-  // Hadith only
-  "collection": "Sahih al-Bukhari",
-  "grade": "Sahih",
-  "hadith_number": 1
-}
-```
-### Phase 2: Indexing
-```
-build_index.py
-├── Load Quran + Hadith JSON
-├── Encode all texts with multilingual-e5-large
-│   ├── Dual embeddings: Arabic + English per item
-│   └── Normalize before encoding
-└── Build FAISS IndexFlatIP for dense retrieval
-```
-### Phase 3: Retrieval & Ranking
-**Hybrid Search Algorithm** (`app/search.py`):
-1. Dense retrieval: FAISS semantic scoring
-2. Sparse retrieval: BM25 term-frequency ranking
-3. Fusion: 60% dense + 40% sparse
-4. Intent-aware boost: +0.08 to Hadith items when intent=hadith
-5. Type filter: Optional (quran_only / hadith_only / authenticated_only)
-6. Phrase matching: Exact phrase + word-overlap scoring for text search
----
-## Module Reference
-### `app/config.py` — Configuration
-- `Config` dataclass with all environment variables
-- Singleton `cfg` instance
-- Loads `.env` via dotenv
-### `app/llm.py` — LLM Providers
-- `LLMProvider` abstract base class
-- `OllamaProvider` — primary (3-model fallback chain)
-- `HuggingFaceProvider` — alternative local inference
-- `create_llm_provider()` factory dispatches on `LLM_BACKEND` env var
-### `app/cache.py` — TTL-LRU Cache
-- `TTLCache` with size limit (1024) and TTL (300s)
-- Pre-built instances: `search_cache`, `analysis_cache`, `rewrite_cache`
-### `app/arabic_nlp.py` — Arabic NLP
-- `normalize_arabic()` — tashkeel removal, hamza normalization
-- `light_stem()` — prefix/suffix stripping
-- `tokenize_ar()` — Arabic-aware tokenization
-- `detect_language()` / `language_instruction()` — route persona by language
-### `app/search.py` — Retrieval Engine
-- `rewrite_query()` — dual-language normalization, LLM-assisted rewriting
-- `hybrid_search()` — FAISS + BM25 fusion with intent-aware boosting
-- `text_search()` — exact substring + word-overlap matching (for verse/hadith lookup by partial text)
-- `build_context()` — format retrieved items for LLM prompt
-### `app/analysis.py` — Analytics & Intent Detection
-- `detect_analysis_intent()` — identifies count / analytics / chapter queries
-- `count_occurrences()` — word frequency across all Surahs
-- `get_quran_analytics()` — chapter-level stats
-- `get_hadith_analytics()` — collection-level stats
-- `get_chapter_info()` — single Surah metadata
-- `get_verse()` — exact verse by surah:ayah
-- `detect_surah_info()` / `lookup_surah_info()` — Surah name resolution
-### `app/prompts.py` — Prompt Engineering
-- `PERSONA` — Islamic scholar persona definition
-- `TASK_INSTRUCTIONS` — verbatim-quoting, anti-hallucination rules
-- `FORMAT_RULES` — citation box format
-- `build_messages()` — intent-aware system + user message construction
-- `not_found_answer()` — safe "not in dataset" response
-### `app/models.py` — Pydantic Schemas
-All request/response models:
-- `ChatMessage`, `ChatCompletionRequest/Response/Choice` — OpenAI-compatible
-- `AskResponse`, `AnalysisResult`, `SourceItem` — RAG pipeline
-- `HadithVerifyResponse` — authenticity verification
-- `VerseItem`, `HadithItem`, `TextSearchResponse` — text search
-- `ChapterResponse`, `QuranAnalyticsResponse`, `HadithAnalyticsResponse` — analytics
-- `WordFrequencyResponse` — word counting
-- `ModelInfo`, `ModelsListResponse` — OpenAI models list
-### `app/state.py` — Application State & Lifecycle
-- `AppState` — holds FAISS index, metadata, embedder, LLM provider
-- `lifespan()` — async startup (loads index, model, metadata)
-- `check_ready()` — dependency guard for endpoints
-- `run_rag_pipeline()` — full RAG: rewrite → search → context → LLM → response
-- `infer_hadith_grade()` — grade detection from collection name
----
-## API Endpoints (16 total)
-### Quran Router (`/quran/...`) — 6 endpoints
-| Endpoint | Method | Description |
-|----------|--------|-------------|
-| `/quran/search?q=...` | GET | Text search: find verses by partial Arabic/English text |
-| `/quran/topic?q=...&top_k=5` | GET | Semantic search: find verses related to a topic |
-| `/quran/word-frequency?word=...` | GET | Count word occurrences across all Surahs |
-| `/quran/analytics` | GET | Overall Quran stats (total verses, Surahs, types) |
-| `/quran/chapter/{number}` | GET | Single Surah metadata (name, verse count, type) |
-| `/quran/verse/{surah}:{ayah}` | GET | Exact verse lookup by reference |
-### Hadith Router (`/hadith/...`) — 5 endpoints
-| Endpoint | Method | Description |
-|----------|--------|-------------|
-| `/hadith/search?q=...&collection=...` | GET | Text search across collections |
-| `/hadith/topic?q=...&top_k=5&grade=...` | GET | Semantic search by topic with optional grade filter |
-| `/hadith/verify?q=...` | GET | Authenticity verification (text + semantic search) |
-| `/hadith/collection/{name}?limit=20` | GET | Browse a specific collection |
-| `/hadith/analytics` | GET | Collection-level statistics |
-### Chat Router — 2 endpoints
-| Endpoint | Method | Description |
-|----------|--------|-------------|
-| `/v1/chat/completions` | POST | OpenAI-compatible chat (SSE streaming supported) |
-| `/ask?q=...&top_k=5` | GET | Direct RAG query with full source attribution |
-### Ops Router — 3 endpoints
-| Endpoint | Method | Description |
-|----------|--------|-------------|
-| `/health` | GET | Readiness check |
-| `/v1/models` | GET | OpenAI-compatible model listing |
-| `/debug/scores?q=...&top_k=10` | GET | Raw retrieval scores (no LLM call) |
----
-## Anti-Hallucination Measures
-- Few-shot examples including "not found" refusal path
-- Hardcoded format rules (box/citation format required)
-- Verbatim copy rules (no reconstruction from memory)
-- Confidence threshold gating (default: 0.30)
-- Grade inference for Hadith authenticity (collection-based)
----
-## Configuration
-**`.env` variables**:
-```
-OLLAMA_HOST              # Ollama server URL
-LLM_MODEL                # Primary model (e.g. minimax-m2.7:cloud)
-LLM_BACKEND              # "ollama" (default) or "huggingface"
-EMBED_MODEL              # Embedding model (intfloat/multilingual-e5-large)
-FAISS_INDEX              # Path to QModel.index
-METADATA_FILE            # Path to metadata.json
-CONFIDENCE_THRESHOLD     # Min hybrid score for LLM call (default: 0.30)
-HADITH_BOOST             # Intent-aware boost for Hadith (default: 0.08)
-TOP_K_SEARCH             # Retrieval candidate pool (default: 20)
-TOP_K_RETURN             # Results returned to user (default: 5)
-TEMPERATURE              # LLM creativity (default: 0.2 for factual)
-```
----
-## Deployment
-### Local Development
-```bash
-python main.py
-# API at http://localhost:8000
-# Docs at http://localhost:8000/docs
-```
-### Docker
-```bash
-docker-compose up
-# Ollama on port 11434
-# QModel on port 8000
-```
----
-## Testing Examples
-### 1. Quran Verse Lookup (Capability 1)
-```bash
-curl "http://localhost:8000/quran/search?q=bismillah"
-```
-### 2. Quran Topic Search (Capability 2)
-```bash
-curl "http://localhost:8000/quran/topic?q=patience&top_k=5"
-```
-### 3. Word Frequency (Capability 3)
-```bash
-curl "http://localhost:8000/quran/word-frequency?word=mercy"
-# → Returns: count per surah + total + examples
-```
-### 4. Quran Analytics (Capability 3)
-```bash
-curl "http://localhost:8000/quran/analytics"
-curl "http://localhost:8000/quran/chapter/2"
-```
-### 5. Hadith Text Search (Capability 4)
-```bash
-curl "http://localhost:8000/hadith/search?q=actions+are+judged+by+intentions"
-```
-### 6. Hadith Topic Search (Capability 5)
-```bash
-curl "http://localhost:8000/hadith/topic?q=fasting&grade=sahih"
-```
-### 7. Hadith Authenticity Verification (Capability 6)
-```bash
-curl "http://localhost:8000/hadith/verify?q=Actions+are+judged+by+intentions"
-# → Returns: found=true, grade="Sahih", source="Sahih al-Bukhari 1"
-```
-### 8. Confidence Gate in Action (Safety)
-```
-Q: "Who was Muhammad's 7th wife?" (not in dataset)
-→ Retrieval score: 0.15 (below 0.30 threshold)
-→ Returns: "Not in available dataset"
-→ LLM not called (prevents hallucination)
-```
-### 9. OpenAI-Compatible Chat (Streaming)
-```bash
-curl -X POST http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{"model":"qmodel","messages":[{"role":"user","content":"What does Islam say about charity?"}],"stream":true}'
-```
----
-## Roadmap: v6+ Enhancements
-- [x] Grade-based filtering: `?grade=sahih` to return only authenticated Hadiths
-- [x] Streaming responses: SSE for long-form answers
-- [x] Modular architecture: Separate routers, models, and services
-- [x] Dual LLM backend: Ollama + HuggingFace support
-- [x] Text search: Exact substring + fuzzy word-overlap matching
-- [x] Expanded endpoints: 16 endpoints across 4 routers
-- [ ] Chain of narrators: Display Isnad with full narrator details
-- [ ] Synonym expansion: Better topic matching (e.g., "mercy" → "rahma, compassion")
-- [ ] Multi-Surah topics: Topics spanning multiple Surahs
-- [ ] Batch processing: Handle multiple questions in one request
-- [ ] Islamic calendar integration: Hijri date references
-- [ ] Tafsir integration: Dedicated Tafsir endpoint with scholar citations

DOCKER.md DELETED Viewed

@@ -1,443 +0,0 @@
-# QModel Docker Guide
-Complete guide for running QModel in Docker with both backend options.
-## Quick Start
-### Option 1: Docker Compose (Recommended)
-```bash
-# 1. Copy example config
-cp .env.example .env
-# 2. Edit .env and choose your backend (see below)
-nano .env
-# 3. Run with compose
-docker-compose up
-```
-API available at: `http://localhost:8000`
-### Option 2: Docker CLI
-```bash
-# Build image
-docker build -t qmodel .
-# Run with Ollama backend
-docker run -p 8000:8000 \
-  --env-file .env \
-  --add-host host.docker.internal:host-gateway \
-  qmodel
-# Or run with HuggingFace backend
-docker run -p 8000:8000 \
-  --env-file .env \
-  --env HF_TOKEN=your_token_here \
-  qmodel
-```
----
-## Backend Configuration
-Configure which backend to use via `.env` file:
-### Backend 1: Ollama (Local)
-**Best for**: Development, testing, Docker Desktop
-```bash
-# .env
-LLM_BACKEND=ollama
-OLLAMA_HOST=http://host.docker.internal:11434
-OLLAMA_MODEL=llama2
-```
-**Prerequisites**:
-- Ollama installed on host machine
-- Running: `ollama serve`
-- Model pulled: `ollama pull llama2`
-**Why**:
-- ✅ Fast setup
-- ✅ No GPU required
-- ✅ Works on Docker Desktop (Mac/Windows)
-- ❌ Requires host Ollama service
-### Backend 2: HuggingFace (Remote)
-**Best for**: Production, GPU servers, containerized environments
-```bash
-# .env
-LLM_BACKEND=hf
-HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct
-HF_DEVICE=auto
-```
-**Prerequisites**:
-- GPU (recommended) OR significant RAM
-- HuggingFace token (for gated models)
-**Passing HF Token**:
-```bash
-# Via docker-compose
-export HF_TOKEN=your_token_here
-docker-compose up
-# Via docker run
-docker run -p 8000:8000 \
-  --env-file .env \
-  --env HF_TOKEN=your_token_here \
-  qmodel
-```
----
-## Docker Compose Configuration
-The `docker-compose.yml` includes:
-| Setting | Value | Description |
-|---------|-------|-------------|
-| **Image** | Builds from `Dockerfile` | Python 3.11 + dependencies |
-| **Port** | `8000:8000` | API port mapping |
-| **Env File** | `.env` | Configuration source |
-| **HF Token** | From `.env` or `${HF_TOKEN}` | For HuggingFace auth |
-| **Ollama Host** | `host.docker.internal:11434` | Connect to host Ollama |
-| **Volumes** | `.:/app` | Code changes sync (dev mode) |
-| **HF Cache** | `/root/.cache/huggingface` | Persistent model cache |
-| **Networks** | `qmodel-network` | Internal network |
-| **Health Check** | `/health` endpoint | Auto-restart on failure |
-### For Production
-Modify `docker-compose.yml`:
-```yaml
-services:
-  qmodel:
-    # ... (same as above)
-    volumes:
-      # Remove live code volume
-      - huggingface_cache:/root/.cache/huggingface
-    restart: on-failure:5
-```
----
-## Examples
-### Development with Ollama
-```bash
-# Terminal 1: Start Ollama
-ollama serve
-# Terminal 2: Run QModel
-cat > .env << EOF
-LLM_BACKEND=ollama
-OLLAMA_HOST=http://host.docker.internal:11434
-OLLAMA_MODEL=llama2
-TEMPERATURE=0.2
-CONFIDENCE_THRESHOLD=0.30
-EOF
-docker-compose up
-```
-Access: `http://localhost:8000`
-### Production with HuggingFace
-```bash
-# Create .env for production
-cat > .env << EOF
-LLM_BACKEND=hf
-HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct
-HF_DEVICE=auto
-TEMPERATURE=0.1
-CONFIDENCE_THRESHOLD=0.35
-ALLOWED_ORIGINS=yourdomain.com
-EOF
-# Export HF token
-export HF_TOKEN=hf_xxxxxxxxxxxxx
-# Run
-docker-compose up -d
-docker-compose logs -f
-```
-### Detached Mode
-```bash
-# Run in background
-docker-compose up -d
-# View logs
-docker-compose logs -f
-# Check status
-docker-compose ps
-# Stop
-docker-compose down
-```
----
-## Troubleshooting
-### "Cannot connect to Ollama"
-**Symptom**: `ConnectionRefusedError` when using Ollama backend
-**Solution**:
-```bash
-# Ensure Ollama is running on host
-ollama serve
-# Verify in Docker container
-docker run --add-host host.docker.internal:host-gateway qmodel \
-  python -c "import requests; print(requests.get('http://host.docker.internal:11434/api/tags').json())"
-```
-### "HuggingFace model not found"
-**Symptom**: `OSError: ... not found`
-**Solution**:
-```bash
-# Check HF token is set
-echo $HF_TOKEN
-# If not set, export it
-export HF_TOKEN=hf_xxxxxxxxxxxxx
-docker-compose up
-```
-### "Out of memory"
-**Symptom**: Container exits with no error message
-**Solution**:
-- Use smaller model: `HF_MODEL_NAME=mistralai/Mistral-7B-Instruct-v0.2`
-- Use Ollama with `neural-chat` model
-- Increase Docker memory limits:
-```bash
-# Edit docker-compose.yml
-services:
-  qmodel:
-    deploy:
-      resources:
-        limits:
-          memory: 16G
-```
-### "Port already in use"
-**Symptom**: `Address already in use`
-**Solution**:
-```bash
-# Change port in docker-compose.yml
-ports:
-  - "8001:8000"
-# Or kill existing container
-docker-compose down
-docker system prune
-```
----
-## Building Custom Images
-### Build for Specific Backend
-No code changes needed - just use `.env` to configure.
-### Build with Custom Requirements
-```bash
-# Edit requirements.txt, then rebuild
-docker build -t qmodel:custom .
-```
-### Push to Registry
-```bash
-# Tag for registry
-docker tag qmodel myregistry/qmodel:v6.1
-# Push
-docker push myregistry/qmodel:v6.1
-# Run from registry
-docker run -p 8000:8000 \
-  --env-file .env \
-  myregistry/qmodel:v6.1
-```
----
-## Performance Tips
-### Docker Compose with GPU (Linux)
-```yaml
-services:
-  qmodel:
-    deploy:
-      resources:
-        reservations:
-          devices:
-            - driver: nvidia
-              count: 1
-              capabilities: [gpu]
-```
-Then set in `.env`:
-```bash
-HF_DEVICE=cuda
-```
-### Reduce Memory Usage
-```bash
-# In .env
-HF_MODEL_NAME=gpt2                  # Tiny model
-OLLAMA_MODEL=orca-mini              # Smaller Ollama model
-TOP_K_SEARCH=10                     # Fewer candidates
-```
-### Cache Management
-```bash
-# Clear HuggingFace cache
-docker-compose down
-docker volume rm qmodel_huggingface_cache
-# Or cleanup all
-docker system prune -a
-```
----
-## Docker Networking
-### Access QModel from Host
-```bash
-# Default (works)
-curl http://localhost:8000/health
-```
-### Custom Network
-```bash
-# Create network
-docker network create qmodel-net
-# Run with network
-docker-compose -f docker-compose.yml up
-```
-### Multiple Containers
-```yaml
-# docker-compose.yml
-services:
-  qmodel:
-    networks:
-      - custom-network
-  other-service:
-    networks:
-      - custom-network
-networks:
-  custom-network:
-    driver: bridge
-```
----
-## CI/CD Integration
-### GitHub Actions Example
-```yaml
-name: Deploy QModel
-on: [push]
-jobs:
-  deploy:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v2
-      - name: Build Docker image
-        run: docker build -t qmodel .
-      - name: Run tests
-        run: |
-          docker run -port 8000:8000 qmodel &
-          sleep 30
-          curl http://localhost:8000/health
-      - name: Push to registry
-        run: |
-          echo ${{ secrets.REGISTRY_TOKEN }} | docker login -u ${{ secrets.REGISTRY_USER }}
-          docker tag qmodel myregistry/qmodel:${{ github.sha }}
-          docker push myregistry/qmodel:${{ github.sha }}
-```
----
-## Security Considerations
-### Secrets Management
-```bash
-# Don't commit .env with real tokens
-echo ".env" >> .gitignore
-# Use Docker secrets (Swarm mode)
-docker secret create hf_token -
-# Then use in compose:
-# HF_TOKEN=${HF_TOKEN_FILE}
-```
-### CORS Configuration
-```bash
-# In .env (restrict in production)
-ALLOWED_ORIGINS=yourdomain.com,api.yourdomain.com
-```
-### Network Isolation
-```yaml
-# docker-compose.yml
-services:
-  qmodel:
-    networks:
-      - internal
-networks:
-  internal:
-    internal: true
-```
----
-## Reference
-- **Dockerfile**: Multi-stage build, health checks, proper layer caching
-- **docker-compose.yml**: Service definition, volumes, networking, health checks
-- **Environment**: Fully configurable via `.env`
-- **Backends**: Ollama (local) or HuggingFace (remote) via `LLM_BACKEND` variable

OPEN_WEBUI.md DELETED Viewed

@@ -1,385 +0,0 @@
-# Using QModel v6 with Open-WebUI
-QModel v6 is fully compatible with **Open-WebUI** thanks to its OpenAI-compatible API endpoints. This guide shows you how to integrate them.
-## Prerequisites
-1. **QModel running** on your local machine or server
-   ```bash
-   python main.py
-   # Runs on http://localhost:8000
-   ```
-2. **Open-WebUI installed** (Docker recommended)
-   ```bash
-   docker run -d -p 3000:8080 --name open-webui ghcr.io/open-webui/open-webui:latest
-   # Runs on http://localhost:3000
-   ```
----
-## Integration Steps
-### Step 1: Add QModel as a Custom OpenAI-Compatible Model
-In Open-WebUI:
-1. **Settings** → **Models** → **Manage Models**
-2. Click **"Connect to OpenAI-compatible API"**
-3. Enter:
-   - **API Base URL**: `http://localhost:8000/v1`
-   - **Model Name**: `QModel` (or `qmodel`)
-   - **API Key**: Leave blank (no auth required)
-4. Click **"Save & Test"**
-5. You should see: ✅ **Model connected successfully**
-### Step 2: Start Using QModel
-1. Open a **New Chat** in Open-WebUI
-2. Select **QModel** from the model dropdown
-3. Type your Islamic question:
-   ```
-   What does the Quran say about mercy?
-   ```
-4. Press Enter and get an Islamic-grounded RAG response with sources!
----
-## API Endpoints (OpenAI-Compatible)
-### POST `/v1/chat/completions`
-Standard OpenAI chat completions endpoint.
-**Request:**
-```json
-{
-  "model": "QModel",
-  "messages": [
-    {"role": "user", "content": "What does Islam say about patience?"}
-  ],
-  "temperature": 0.2,
-  "max_tokens": 2048,
-  "top_k": 5,
-  "stream": false
-}
-```
-**Response:**
-```json
-{
-  "id": "qmodel-1234567890",
-  "object": "chat.completion",
-  "created": 1234567890,
-  "model": "QModel",
-  "choices": [
-    {
-      "index": 0,
-      "message": {
-        "role": "assistant",
-        "content": "Islam emphasizes patience as a core virtue..."
-      },
-      "finish_reason": "stop"
-    }
-  ],
-  "x_metadata": {
-    "language": "english",
-    "intent": "general",
-    "top_score": 0.876,
-    "latency_ms": 342,
-    "sources": [
-      {
-        "source": "Surah Al-Imran 3:200",
-        "type": "quran",
-        "grade": null,
-        "score": 0.876
-      }
-    ]
-  }
-}
-```
-### GET `/v1/models`
-List available models.
-**Response:**
-```json
-{
-  "object": "list",
-  "data": [
-    {
-      "id": "QModel",
-      "object": "model",
-      "created": 1234567890,
-      "owned_by": "elgendy"
-    }
-  ]
-}
-```
----
-## Advanced Query Parameters (Open-WebUI Compatible)
-When using Open-WebUI, you can include special parameters:
-### Islamic-Specific Parameters
-**URL Query String:**
-```
-/v1/chat/completions?source_type=hadith&grade_filter=sahih&top_k=5
-```
-**Supported Parameters:**
-- `source_type`: `quran` | `hadith` | (both, default)
-- `grade_filter`: `sahih` | `hasan` | (all, default)
-- `top_k`: 1-20 (number of sources to retrieve)
-### Example Requests via curl
-```bash
-# 1. Basic query (both Quran + Hadith)
-curl -X POST http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "QModel",
-    "messages": [{"role": "user", "content": "What does Islam say about mercy?"}]
-  }'
-# 2. Quran-only query
-curl -X POST http://localhost:8000/v1/chat/completions?source_type=quran \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "QModel",
-    "messages": [{"role": "user", "content": "What does the Quran say about patience?"}]
-  }'
-# 3. Authenticated Hadiths only (Sahih grade)
-curl -X POST http://localhost:8000/v1/chat/completions?source_type=hadith&grade_filter=sahih \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "QModel",
-    "messages": [{"role": "user", "content": "Hadiths about prayer"}]
-  }'
-# 4. Streaming response
-curl -X POST http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "QModel",
-    "messages": [{"role": "user", "content": "Tell me about Zakat"}],
-    "stream": true
-  }'
-```
----
-## Open-WebUI Features Supported
-| Feature | Status | Notes |
-|---------|--------|-------|
-| **Chat** | ✅ Full support | Normal Q&A |
-| **Streaming** | ✅ Supported | Set `stream: true` in request |
-| **Context** | ✅ Multi-turn | Open-WebUI handles conversation history |
-| **Temperature** | ✅ Configurable | Via Open-WebUI settings |
-| **Token Limits** | ✅ Supported | Via `max_tokens` parameter |
-| **Model List** | ✅ Available | Via `/v1/models` endpoint |
-| **Source Attribution** | ✅ In metadata | Via `x_metadata.sources` |
----
-## Custom System Prompts in Open-WebUI
-To customize QModel for specific Islamic tasks, create a custom chatbot in Open-WebUI:
-1. **Home** → **+ New Chatbot**
-2. Configure:
-   - **Name**: "Islamic Scholar" (or your choice)
-   - **Model**: QModel
-   - **System Prompt**:
-     ```
-     You are an expert Islamic scholar specializing in Qur'an and Hadith.
-     Always cite sources exactly as provided.
-     Only answer from the provided Islamic context—never use outside knowledge.
-     If information is not in the dataset, say so clearly.
-     ```
-   - **Top K Sources**: 5
-   - **Temperature**: 0.1 (for consistency)
-3. **Save** and start chatting!
----
-## Troubleshooting
-### Issue: "Failed to connect to QModel"
-**Solutions:**
-1. Check QModel is running: `curl http://localhost:8000/health`
-2. Verify API Base URL is correct: `http://localhost:8000/v1`
-3. Check firewall: Port 8000 must be accessible
-4. Check logs: `python main.py` to see startup messages
-### Issue: "No sources in response"
-**Solutions:**
-1. Check `/debug/scores` endpoint directly:
-   ```bash
-   curl "http://localhost:8000/debug/scores?q=patience&top_k=10"
-   ```
-2. Adjust `CONFIDENCE_THRESHOLD` in `.env` if retrievals are low-quality
-3. Try synonyms: "mercy" instead of "compassion"
-### Issue: "Assistant returns 'Not found'"
-**This is expected behavior!** QModel has safety checks:
-1. If retrieval score is too low (< 0.30), it returns "not found"
-2. This prevents hallucinations
-3. Try more specific queries or adjust `CONFIDENCE_THRESHOLD`
----
-## Configuration for Open-WebUI
-### Recommended Settings
-For best results with Open-WebUI:
-```env
-# More conservative (fewer hallucinations)
-CONFIDENCE_THRESHOLD=0.40
-TEMPERATURE=0.1
-HADITH_BOOST=0.08
-# More liberal (more answers, higher hallucination risk)
-CONFIDENCE_THRESHOLD=0.20
-TEMPERATURE=0.3
-HADITH_BOOST=0.05
-```
-### Docker Compose Integration
-To run both QModel and Open-WebUI together:
-```yaml
-version: '3.8'
-services:
-  qmodel:
-    build: .
-    ports:
-      - "8000:8000"
-    environment:
-      - LLM_BACKEND=ollama
-      - OLLAMA_HOST=http://ollama:11434
-    depends_on:
-      - ollama
-  ollama:
-    image: ollama/ollama:latest
-    ports:
-      - "11434:11434"
-  web-ui:
-    image: ghcr.io/open-webui/open-webui:latest
-    ports:
-      - "3000:8080"
-    depends_on:
-      - qmodel
-```
-Run: `docker-compose up`
----
-## Using QModel in Open-WebUI Workflows
-### Example 1: Islamic Q&A Chatbot
-1. Create chatbot with system prompt about Islamic knowledge
-2. Select QModel as backend
-3. Set temperature to 0.1 for consistency
-4. Enable web search toggle (optional, for cross-verification)
-### Example 2: Hadith Research Tool
-1. Create chatbot: "Hadith Researcher"
-2. System prompt:
-   ```
-   You are a Hadith researcher. For each query:
-   1. Search authenticated Hadiths only (Sahih grade)
-   2. Display the full text with authenticity grade
-   3. Explain the Hadith's significance
-   4. Always cite the collection and number
-   ```
-3. Enable grade filtering: `grade_filter=sahih`
-### Example 3: Qur'anic Study Assistant
-1. Create chatbot: "Qur'an Tafsir"
-2. Set `source_type=quran` in parameters
-3. System prompt focusing on Qur'anic interpretation
-4. Enable multi-turn for deeper exploration
----
-## API Testing
-### Test with Open-WebUI's Developer Tools
-1. Open Open-WebUI console (F12)
-2. Go to **Network** tab
-3. Send a message to QModel
-4. Inspect the request/response to `/v1/chat/completions`
-### Test with cURL
-```bash
-# 1. Health check
-curl http://localhost:8000/health | jq
-# 2. List models
-curl http://localhost:8000/v1/models | jq
-# 3. Simple chat
-curl -X POST http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{"model":"QModel","messages":[{"role":"user","content":"Assalam alaikum"}]}' | jq
-```
----
-## Performance Tips
-### For Optimal Open-WebUI Experience
-1. **Use Ollama locally** for responsive chat (400-800ms per query)
-2. **Set `max_tokens=1024`** to avoid long waits
-3. **Use temperature=0.1** for reliable, consistent answers
-4. **Increase `CACHE_TTL`** for frequently asked questions
-5. **Reduce `TOP_K_SEARCH`** if queries are slow (default 20)
----
-## Security Notes
-### For Production Deployments
-1. **Restrict CORS**: Set `ALLOWED_ORIGINS=your-domain.com` in `.env`
-2. **Use HTTPS**: Proxy through nginx with TLS
-3. **Rate limit**: Add rate limiting middleware (not in v6, but recommended)
-4. **Authentication**: Consider adding API key validation layer
-5. **Network**: Don't expose QModel directly to the internet without auth
----
-## Support
-- 📖 Full setup guide: See `SETUP.md`
-- 🔍 Debugging: Use `/debug/scores` to inspect retrievals
-- 💬 Questions about Open-WebUI: See https://docs.openwebui.com
-- 🕌 Islamic knowledge: See `ARCHITECTURE.md` for system details
----
-**Happy chatting with QModel + Open-WebUI! 🕌**

README.md CHANGED Viewed

@@ -64,33 +64,218 @@ language:
 ---
-## Quick Start (5 minutes)
 ```bash
-# 1. Install
-git clone https://github.com/elgendy/QModel.git && cd QModel
 python3 -m venv .venv && source .venv/bin/activate
 pip install -r requirements.txt
-# 2. Configure (choose one)
-# For local development - Ollama:
 export LLM_BACKEND=ollama
 export OLLAMA_MODEL=llama2
 # Make sure Ollama is running: ollama serve
-# OR for production - HuggingFace:
 export LLM_BACKEND=hf
 export HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct
-# 3. Run
 python main.py
-# 4. Query
 curl "http://localhost:8000/ask?q=What%20does%20Islam%20say%20about%20mercy?"
 ```
 API docs: http://localhost:8000/docs
 ---
 ## Example Queries
@@ -105,134 +290,372 @@ curl "http://localhost:8000/ask?q=How%20many%20times%20is%20mercy%20mentioned?"
 # Authentic Hadiths only
 curl "http://localhost:8000/ask?q=prayer&source_type=hadith&grade_filter=sahih"
-# Verify Hadith
-curl "http://localhost:8000/hadith/verify?q=Actions%20are%20judged%20by%20intentions"
-```
----
-## Documentation
-| Document | Purpose |
-|----------|---------|
-| **[SETUP.md](SETUP.md)** | Installation, configuration (both backends), API endpoints, examples |
-| **[DOCKER.md](DOCKER.md)** | Docker deployment, production setup, troubleshooting |
-| **[ARCHITECTURE.md](ARCHITECTURE.md)** | System design, data pipeline, core components |
-| **[OPEN_WEBUI.md](OPEN_WEBUI.md)** | Integration with Open-WebUI chat interface |
 ---
-## Key Decisions
 ### Backend Selection
-- **Ollama** — Fast setup, no GPU, great for development, `LLM_BACKEND=ollama`
-- **HuggingFace** — Production-grade, better quality, GPU recommended, `LLM_BACKEND=hf`
-Both are equally supported via the same `.env` configuration. Just set `LLM_BACKEND` and restart.
-### Data
-- **47,626 documents**: 6,236 Quranic verses + 41,390 hadiths from 9 canonical collections
-- **Pre-built**: `metadata.json` and `QModel.index` included, ready to use
-- **Dual-language**: Arabic and English support
----
-## Open-WebUI Integration
-QModel integrates seamlessly with Open-WebUI for a chat interface:
 ```bash
-# Start QModel
-python main.py
-# Start Open-WebUI (Docker)
-docker run -p 3000:8080 ghcr.io/open-webui/open-webui:latest
-# In Open-WebUI: Settings → Models → Add OpenAI-compatible
-# API Base: http://localhost:8000/v1
-# Model: QModel
 ```
-See [OPEN_WEBUI.md](OPEN_WEBUI.md) for detailed integration guide.
 ---
-## API Reference (Quick)
-### Main Query
 ```
-GET /ask?q=<question>&top_k=5&source_type=<quran|hadith>&grade_filter=<sahih|hasan>
 ```
-**Response includes:**
-- AI-generated answer
-- Listed sources with scores
-- Language detection (Arabic/English)
-- Query intent classification
-### Other Endpoints
-- `GET /debug/scores?q=<question>&top_k=10` — Inspect raw retrieval scores
-- `GET /hadith/verify?q=<hadith_text>` — Check hadith authenticity
-- `POST /v1/chat/completions` — OpenAI-compatible endpoint
-- `GET /health` — Health check
-See [SETUP.md](SETUP.md) for full endpoint documentation.
 ---
-## Configuration
-All configuration via environment variables (no code changes needed):
-```bash
-# Backend (required)
-LLM_BACKEND=ollama              # or: hf
-# Ollama settings
-OLLAMA_HOST=http://localhost:11434
-OLLAMA_MODEL=llama2             # or: mistral, neural-chat
-# HuggingFace settings
-HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct
-HF_DEVICE=auto                  # auto, cuda, or cpu
-# Quality tuning
-TEMPERATURE=0.2                 # 0=deterministic, 1=creative
-CONFIDENCE_THRESHOLD=0.30       # Min score for LLM call
-TOP_K_RETURN=5                  # Results per query
 ```
-See [SETUP.md](SETUP.md) for comprehensive configuration reference.
 ---
-## Performance
 | Operation | Time | Backend |
 |-----------|------|---------|
 | Query (cached) | ~50ms | Both |
-| Query (Ollama) | 400-800ms | Ollama |
-| Query (HF GPU) | 500-1500ms | CUDA |
-| Query (HF CPU) | 2-5s | CPU |
 ---
-## Deployment
-### Local Development
 ```bash
-python main.py
 ```
-### Docker (with Ollama backend)
 ```bash
-docker-compose up
 ```
-### Docker (with HuggingFace backend)
-Set `LLM_BACKEND=hf` in `.env` then `docker-compose up`
-See [DOCKER.md](DOCKER.md) for production deployment, troubleshooting, and advanced configuration.
 ---

 ---
+## Quick Start
+### Prerequisites
+- Python 3.10+
+- 16 GB RAM minimum (for embeddings + LLM)
+- GPU recommended for HuggingFace backend
+- Ollama installed (for local development) OR internet access (for HuggingFace)
+### Installation
 ```bash
+# Clone and enter project
+git clone https://github.com/Logicsoft/QModel.git && cd QModel
 python3 -m venv .venv && source .venv/bin/activate
 pip install -r requirements.txt
+# Configure (choose one backend)
+# Option A — Ollama (local development):
 export LLM_BACKEND=ollama
 export OLLAMA_MODEL=llama2
 # Make sure Ollama is running: ollama serve
+# Option B — HuggingFace (production):
 export LLM_BACKEND=hf
 export HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct
+# Run
 python main.py
+# Query
 curl "http://localhost:8000/ask?q=What%20does%20Islam%20say%20about%20mercy?"
 ```
 API docs: http://localhost:8000/docs
+### Data & Index
+Pre-built data files are included:
+- `metadata.json` — 47,626 documents (6,236 Quran verses + 41,390 hadiths from 9 canonical collections)
+- `QModel.index` — FAISS search index
+To rebuild after dataset changes:
+```bash
+python build_index.py
+```
+---
+## API Reference (18 endpoints)
+### Inference
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/ask?q=...&top_k=5&source_type=&grade_filter=` | GET | Direct RAG query with full source attribution |
+| `/v1/chat/completions` | POST | OpenAI-compatible chat (SSE streaming supported) |
+### Quran (`/quran/...`)
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/quran/search?q=...&limit=10` | GET | Text search: find verses by partial Arabic/English text |
+| `/quran/topic?topic=...&top_k=10` | GET | Semantic search: find verses related to a topic |
+| `/quran/word-frequency?word=...` | GET | Count word occurrences across all Surahs |
+| `/quran/analytics` | GET | Overall Quran stats (total verses, Surahs, revelation types) |
+| `/quran/chapter/{number}` | GET | All verses and metadata for a specific Surah |
+| `/quran/verse/{surah}:{ayah}` | GET | Exact verse lookup by reference (e.g. `/quran/verse/2:255`) |
+### Hadith (`/hadith/...`)
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/hadith/search?q=...&collection=&limit=10` | GET | Text search across collections |
+| `/hadith/topic?topic=...&top_k=10&grade_filter=` | GET | Semantic search by topic with optional grade filter |
+| `/hadith/verify?q=...&collection=` | GET | Authenticity verification (text + semantic search) |
+| `/hadith/collection/{name}?limit=20&offset=0` | GET | Browse a specific collection |
+| `/hadith/analytics` | GET | Collection-level statistics |
+### Operations
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/health` | GET | Readiness check |
+| `/v1/models` | GET | OpenAI-compatible model listing |
+| `/debug/scores?q=...&top_k=10&source_type=` | GET | Raw retrieval scores (no LLM call) |
+---
+### GET `/ask` — Main Query
+```bash
+curl "http://localhost:8000/ask?q=What%20does%20Islam%20say%20about%20mercy?&top_k=5"
+```
+**Parameters:**
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `q` | *(required)* | Your Islamic question |
+| `top_k` | `5` | Number of sources to retrieve (1–20) |
+| `source_type` | both | `quran` or `hadith` |
+| `grade_filter` | all | `sahih` or `hasan` |
+**Response:**
+```json
+{
+  "question": "What does Islam say about mercy?",
+  "answer": "Islam emphasizes mercy as a core value...",
+  "language": "english",
+  "intent": "general",
+  "analysis": null,
+  "sources": [
+    {
+      "source": "Surah Al-Baqarah 2:178",
+      "type": "quran",
+      "grade": null,
+      "arabic": "...",
+      "english": "...",
+      "_score": 0.876
+    }
+  ],
+  "top_score": 0.876,
+  "latency_ms": 342
+}
+```
+### POST `/v1/chat/completions` — OpenAI-Compatible
+```bash
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "QModel",
+    "messages": [{"role": "user", "content": "What does Islam say about patience?"}],
+    "temperature": 0.2,
+    "max_tokens": 2048,
+    "top_k": 5,
+    "stream": false
+  }'
+```
+**Response:**
+```json
+{
+  "id": "qmodel-1234567890",
+  "object": "chat.completion",
+  "created": 1234567890,
+  "model": "QModel",
+  "choices": [
+    {
+      "index": 0,
+      "message": { "role": "assistant", "content": "Islam emphasizes patience..." },
+      "finish_reason": "stop"
+    }
+  ],
+  "x_metadata": {
+    "language": "english",
+    "intent": "general",
+    "top_score": 0.876,
+    "latency_ms": 342,
+    "sources": [{ "source": "Surah Al-Imran 3:200", "type": "quran", "score": 0.876 }]
+  }
+}
+```
+### GET `/hadith/verify` — Authenticity Check
+```bash
+curl "http://localhost:8000/hadith/verify?q=Actions%20are%20judged%20by%20intentions"
+```
+**Response:**
+```json
+{
+  "query": "Actions are judged by intentions",
+  "found": true,
+  "collection": "Sahih al-Bukhari",
+  "grade": "Sahih",
+  "reference": "Sahih al-Bukhari 1",
+  "arabic": "إنما الأعمال بالنيات",
+  "english": "Verily, actions are judged by intentions...",
+  "latency_ms": 156
+}
+```
+### GET `/debug/scores` — Retrieval Inspection
+```bash
+curl "http://localhost:8000/debug/scores?q=patience&top_k=10"
+```
+Use this to calibrate `CONFIDENCE_THRESHOLD`. If queries you expect to work have `_score < threshold`, lower the threshold.
+**Response:**
+```json
+{
+  "query": "patience",
+  "intent": "general",
+  "threshold": 0.3,
+  "count": 10,
+  "results": [
+    {
+      "rank": 1,
+      "source": "Surah Al-Baqarah 2:45",
+      "type": "quran",
+      "_dense": 0.8234,
+      "_sparse": 0.5421,
+      "_score": 0.7234
+    }
+  ]
+}
+```
 ---
 ## Example Queries
 # Authentic Hadiths only
 curl "http://localhost:8000/ask?q=prayer&source_type=hadith&grade_filter=sahih"
+# Quran text search
+curl "http://localhost:8000/quran/search?q=bismillah"
+# Quran topic search
+curl "http://localhost:8000/quran/topic?topic=patience&top_k=5"
+# Quran word frequency
+curl "http://localhost:8000/quran/word-frequency?word=mercy"
+# Single chapter
+curl "http://localhost:8000/quran/chapter/2"
+# Exact verse
+curl "http://localhost:8000/quran/verse/2:255"
+# Hadith text search
+curl "http://localhost:8000/hadith/search?q=actions+are+judged+by+intentions"
+# Hadith topic search (Sahih only)
+curl "http://localhost:8000/hadith/topic?topic=fasting&grade_filter=sahih"
+# Verify Hadith authenticity
+curl "http://localhost:8000/hadith/verify?q=Actions%20are%20judged%20by%20intentions"
+# Browse a collection
+curl "http://localhost:8000/hadith/collection/bukhari?limit=5"
+# Streaming (OpenAI-compatible)
+curl -X POST http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"QModel","messages":[{"role":"user","content":"What does Islam say about charity?"}],"stream":true}'
+```
 ---
+## Configuration
+All configuration via environment variables (`.env` file or exported directly):
 ### Backend Selection
+| Backend | Pros | Cons | When to Use |
+|---------|------|------|------------|
+| **Ollama** | Fast setup, no GPU, free | Smaller models | Development, testing |
+| **HuggingFace** | Larger models, better quality | Requires GPU or significant RAM | Production |
+### Ollama Backend (Development)
+```bash
+LLM_BACKEND=ollama
+OLLAMA_HOST=http://localhost:11434
+OLLAMA_MODEL=llama2              # or: mistral, neural-chat, orca-mini
+```
+Requires: `ollama serve` running and model pulled (`ollama pull llama2`).
+### HuggingFace Backend (Production)
 ```bash
+LLM_BACKEND=hf
+HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct
+HF_DEVICE=auto                   # auto | cuda | cpu
+HF_MAX_NEW_TOKENS=2048
+```
+### All Environment Variables
+| Variable | Default | Description |
+|----------|---------|-------------|
+| **Backend** | | |
+| `LLM_BACKEND` | `hf` | `ollama` or `hf` |
+| `OLLAMA_HOST` | `http://localhost:11434` | Ollama server URL |
+| `OLLAMA_MODEL` | `llama2` | Ollama model name |
+| `HF_MODEL_NAME` | `Qwen/Qwen2-7B-Instruct` | HuggingFace model ID |
+| `HF_DEVICE` | `auto` | `auto`, `cuda`, or `cpu` |
+| `HF_MAX_NEW_TOKENS` | `2048` | Max output length |
+| **Embedding & Data** | | |
+| `EMBED_MODEL` | `intfloat/multilingual-e5-large` | Embedding model |
+| `FAISS_INDEX` | `QModel.index` | Index file path |
+| `METADATA_FILE` | `metadata.json` | Dataset file |
+| **Retrieval** | | |
+| `TOP_K_SEARCH` | `20` | Candidate pool (5–100) |
+| `TOP_K_RETURN` | `5` | Results shown to user (1–20) |
+| `RERANK_ALPHA` | `0.6` | Dense vs Sparse weight (0.0–1.0) |
+| **Generation** | | |
+| `TEMPERATURE` | `0.2` | Creativity (0.0–1.0, use 0.1–0.2 for religious) |
+| `MAX_TOKENS` | `2048` | Max response length |
+| **Safety** | | |
+| `CONFIDENCE_THRESHOLD` | `0.30` | Min score to call LLM (higher = fewer hallucinations) |
+| `HADITH_BOOST` | `0.08` | Score boost for hadith on hadith queries |
+| **Other** | | |
+| `CACHE_SIZE` | `512` | Query response cache entries |
+| `CACHE_TTL` | `3600` | Cache expiry in seconds |
+| `ALLOWED_ORIGINS` | `*` | CORS origins |
+| `MAX_EXAMPLES` | `3` | Few-shot examples in system prompt |
+### Configuration Examples
+**Development (Ollama)**
+```bash
+LLM_BACKEND=ollama
+OLLAMA_HOST=http://localhost:11434
+OLLAMA_MODEL=llama2
+TEMPERATURE=0.2
+CONFIDENCE_THRESHOLD=0.30
+ALLOWED_ORIGINS=*
+```
+**Production (HuggingFace + GPU)**
+```bash
+LLM_BACKEND=hf
+HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct
+HF_DEVICE=cuda
+TOP_K_SEARCH=30
+TEMPERATURE=0.1
+CONFIDENCE_THRESHOLD=0.35
+ALLOWED_ORIGINS=yourdomain.com,api.yourdomain.com
 ```
+### Tuning Tips
+- **Better results**: Increase `TOP_K_SEARCH`, lower `CONFIDENCE_THRESHOLD`, use `TEMPERATURE=0.1`
+- **Faster performance**: Lower `TOP_K_SEARCH` and `TOP_K_RETURN`, reduce `MAX_TOKENS`, use Ollama
+- **More conservative**: Increase `CONFIDENCE_THRESHOLD`, lower `TEMPERATURE`
 ---
+## Docker Deployment
+### Docker Compose (Recommended)
+```bash
+cp .env.example .env   # Configure backend (see Configuration section)
+docker-compose up
 ```
+### Docker CLI
+```bash
+docker build -t qmodel .
+# With Ollama backend
+docker run -p 8000:8000 \
+  --env-file .env \
+  --add-host host.docker.internal:host-gateway \
+  qmodel
+# With HuggingFace backend
+docker run -p 8000:8000 \
+  --env-file .env \
+  --env HF_TOKEN=your_token_here \
+  qmodel
+```
+### Docker with Ollama
+```bash
+# .env
+LLM_BACKEND=ollama
+OLLAMA_HOST=http://host.docker.internal:11434
+OLLAMA_MODEL=llama2
 ```
+Requires Ollama running on the host (`ollama serve`).
+### Docker with HuggingFace
+```bash
+# .env
+LLM_BACKEND=hf
+HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct
+HF_DEVICE=auto
+# Pass HF token
+export HF_TOKEN=hf_xxxxxxxxxxxxx
+docker-compose up
+```
+### Docker Compose with GPU (Linux)
+```yaml
+services:
+  qmodel:
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+```
+### Production Tips
+- Remove dev volume mount (`.:/app`) in `docker-compose.yml`
+- Set `restart: on-failure:5`
+- Use specific `ALLOWED_ORIGINS` instead of `*`
 ---
+## Open-WebUI Integration
+QModel is fully OpenAI-compatible and works out of the box with Open-WebUI.
+### Setup
+```bash
+# Start QModel
+python main.py
+# Start Open-WebUI
+docker run -d -p 3000:8080 --name open-webui ghcr.io/open-webui/open-webui:latest
+```
+### Connect
+1. **Settings** → **Models** → **Manage Models**
+2. Click **"Connect to OpenAI-compatible API"**
+3. **API Base URL**: `http://localhost:8000/v1`
+4. **Model Name**: `QModel`
+5. **API Key**: Leave blank
+6. **Save & Test** → ✅ Connected
+### Docker Compose (QModel + Ollama + Open-WebUI)
+```yaml
+version: '3.8'
+services:
+  qmodel:
+    build: .
+    ports:
+      - "8000:8000"
+    environment:
+      - LLM_BACKEND=ollama
+      - OLLAMA_HOST=http://ollama:11434
+  ollama:
+    image: ollama/ollama:latest
+    ports:
+      - "11434:11434"
+  web-ui:
+    image: ghcr.io/open-webui/open-webui:latest
+    ports:
+      - "3000:8080"
+    depends_on:
+      - qmodel
 ```
+### Supported Features
+| Feature | Status |
+|---------|--------|
+| Chat | ✅ Full support |
+| Streaming | ✅ `stream: true` |
+| Multi-turn context | ✅ Handled by Open-WebUI |
+| Temperature | ✅ Configurable |
+| Token limits | ✅ `max_tokens` |
+| Model listing | ✅ `/v1/models` |
+| Source attribution | ✅ `x_metadata.sources` |
 ---
+## Architecture
+### Module Structure
+```
+main.py                    ← FastAPI app + router registration
+app/
+  config.py               ← Config class (env vars)
+  llm.py                  ← LLM providers (Ollama, HuggingFace)
+  cache.py                ← TTL-LRU async cache
+  arabic_nlp.py           ← Arabic normalization, stemming, language detection
+  search.py               ← Hybrid FAISS+BM25, text search, query rewriting
+  analysis.py             ← Intent detection, analytics, counting
+  prompts.py              ← Prompt engineering (persona, anti-hallucination)
+  models.py               ← Pydantic schemas
+  state.py                ← AppState, lifespan, RAG pipeline
+  routers/
+    quran.py              ← 6 Quran endpoints
+    hadith.py             ← 5 Hadith endpoints
+    chat.py               ← /ask + OpenAI-compatible chat
+    ops.py                ← health, models, debug scores
+```
+### Data Pipeline
+1. **Ingest**: 47,626 documents (6,236 Quran verses + 41,390 Hadiths from 9 collections)
+2. **Embed**: Encode with `multilingual-e5-large` (Arabic + English dual embeddings)
+3. **Index**: FAISS `IndexFlatIP` for dense retrieval
+### Retrieval & Ranking
+1. Dense retrieval (FAISS semantic scoring)
+2. Sparse retrieval (BM25 term-frequency)
+3. Fusion: 60% dense + 40% sparse
+4. Intent-aware boost (+0.08 to Hadith when intent=hadith)
+5. Type filter (quran_only / hadith_only / authenticated_only)
+6. Text search fallback (exact phrase + word-overlap)
+### Anti-Hallucination Measures
+- Few-shot examples including "not found" refusal path
+- Hardcoded citation format rules
+- Verbatim copy rules (no text reconstruction)
+- Confidence threshold gating (default: 0.30)
+- Post-generation citation verification
+- Grade inference from collection name
+### Performance
 | Operation | Time | Backend |
 |-----------|------|---------|
 | Query (cached) | ~50ms | Both |
+| Query (Ollama) | 400–800ms | Ollama |
+| Query (HF GPU) | 500–1500ms | CUDA |
+| Query (HF CPU) | 2–5s | CPU |
 ---
+## Troubleshooting
+### "Cannot connect to Ollama"
 ```bash
+ollama serve                      # Ensure Ollama is running on host
+# In Docker, use OLLAMA_HOST=http://host.docker.internal:11434
 ```
+### "HuggingFace model not found"
 ```bash
+export HF_TOKEN=hf_xxxxxxxxxxxxx  # Set token for gated models
 ```
+### "Out of memory"
+- Use smaller model: `HF_MODEL_NAME=mistralai/Mistral-7B-Instruct-v0.2`
+- Use Ollama with `neural-chat`
+- Reduce `MAX_TOKENS` to 1024
+- Increase Docker memory limit in `docker-compose.yml`
+### "Assistant returns 'Not found'"
+This is expected — QModel rejects low-confidence queries. Try:
+- More specific queries
+- Lower `CONFIDENCE_THRESHOLD` in `.env`
+- Check raw scores: `GET /debug/scores?q=your+query`
+### "Port already in use"
+```bash
+docker-compose down && docker system prune
+# Or change port: ports: ["8001:8000"]
+```
+---
+## Roadmap
+- [x] Grade-based filtering
+- [x] Streaming responses (SSE)
+- [x] Modular architecture (4 routers, 18 endpoints)
+- [x] Dual LLM backend (Ollama + HuggingFace)
+- [x] Text search (exact substring + fuzzy matching)
+- [ ] Chain of narrators (Isnad display)
+- [ ] Synonym expansion (mercy → rahma, compassion)
+- [ ] Batch processing (multiple questions per request)
+- [ ] Islamic calendar integration (Hijri dates)
+- [ ] Tafsir endpoint with scholar citations
 ---

SETUP.md DELETED Viewed

@@ -1,590 +0,0 @@
-# QModel v6 Setup & Deployment Guide
-## Quick Start
-### 1. Prerequisites
-- Python 3.10+
-- 16 GB RAM minimum (for embeddings + LLM)
-- GPU recommended for HuggingFace backend
-- Ollama installed (for local development) OR internet access (for HuggingFace)
-### 2. Installation
-```bash
-# Clone and enter project
-cd /Users/elgendy/Projects/QModel
-# Create virtual environment
-python3 -m venv .venv
-source .venv/bin/activate
-# Install dependencies
-pip install -r requirements.txt
-```
-### 3. Data & Index
-The project includes pre-built data files:
-- `metadata.json` — 47,626 documents (6,236 Quran verses + 41,390 hadiths from 9 canonical collections)
-- `QModel.index` — FAISS search index (pre-generated)
-If you need to rebuild the index after dataset changes:
-```bash
-python build_index.py
-```
----
-## Backend Configuration
-QModel supports two LLM backends. Choose based on your environment:
-| Backend | Pros | Cons | When to Use |
-|---------|------|------|------------|
-| **Ollama** (local) | Fast setup, no GPU needed, no model downloads, free | Smaller models, limited customization | Development, testing, resource-constrained |
-| **HuggingFace** (remote) | Larger models, better quality, full control | Requires GPU or significant RAM, slower downloads | Production, high-quality responses |
-### LLM Backend Selection
-**Option 1: Local Ollama (Development)**
-For development, testing, and when you already have Ollama running locally:
-```bash
-LLM_BACKEND=ollama
-OLLAMA_HOST=http://localhost:11434
-OLLAMA_MODEL=llama2              # or: mistral, neural-chat, orca-mini
-```
-**Available Ollama Models:**
-- `llama2` — Fast, good quality (default, recommended)
-- `mistral` — Better Arabic support
-- `neural-chat` — Good balance
-- `openchat` — Good instruction following
-- `orca-mini` — Lightweight
-**Option 2: Remote HuggingFace (Production)**
-For production deployments with better quality and control:
-```bash
-LLM_BACKEND=hf
-HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct    # Excellent Arabic support
-HF_DEVICE=auto                           # auto | cuda | cpu
-HF_MAX_NEW_TOKENS=2048
-```
-**Recommended HuggingFace Models:**
-- `Qwen/Qwen2-7B-Instruct` — Excellent Arabic, strong reasoning (default)
-- `mistralai/Mistral-7B-Instruct-v0.2` — Very capable, fast
-- `meta-llama/Llama-2-13b-chat-hf` — Larger, needs HF token
-**Device Options:**
-- `auto` — Auto-detect (GPU if available, else CPU)
-- `cuda` — Force GPU (requires NVIDIA GPU)
-- `cpu` — Force CPU (slower, but works everywhere)
-### Complete Environment Variables Reference
-#### Backend Selection
-| Variable | Default | Options | Example |
-|----------|---------|---------|---------|
-| `LLM_BACKEND` | `hf` | `ollama`, `hf` | `ollama` |
-#### Ollama Backend
-| Variable | Default | Description | Example |
-|----------|---------|-------------|---------|
-| `OLLAMA_HOST` | `http://localhost:11434` | Ollama server URL | `http://localhost:11434` |
-| `OLLAMA_MODEL` | `llama2` | Model name | `mistral` |
-#### HuggingFace Backend
-| Variable | Default | Description | Example |
-|----------|---------|-------------|---------|
-| `HF_MODEL_NAME` | `Qwen/Qwen2-7B-Instruct` | Model ID | `Qwen/Qwen2-7B-Instruct` |
-| `HF_DEVICE` | `auto` | Device to use | `cuda` |
-| `HF_MAX_NEW_TOKENS` | `2048` | Max output length | `2048` |
-#### Embedding & Data
-| Variable | Default | Description |
-|----------|---------|-------------|
-| `EMBED_MODEL` | `intfloat/multilingual-e5-large` | Embedding model (keep default) |
-| `FAISS_INDEX` | `QModel.index` | Index file path |
-| `METADATA_FILE` | `metadata.json` | Dataset file |
-#### Retrieval & Ranking
-| Variable | Default | Range | Purpose |
-|----------|---------|-------|---------|
-| `TOP_K_SEARCH` | `20` | 5-100 | Candidate pool (⬆️ = slower but more coverage) |
-| `TOP_K_RETURN` | `5` | 1-20 | Results shown to user |
-| `RERANK_ALPHA` | `0.6` | 0.0-1.0 | Dense (0.6) vs Sparse (0.4) weighting |
-#### Generation
-| Variable | Default | Range | Purpose |
-|----------|---------|-------|---------|
-| `TEMPERATURE` | `0.2` | 0.0-1.0 | 0.0=deterministic, 1.0=creative (use 0.1-0.2 for religious) |
-| `MAX_TOKENS` | `2048` | 512-4096 | Max response length |
-#### Safety & Quality
-| Variable | Default | Range | Purpose |
-|----------|---------|-------|---------|
-| `CONFIDENCE_THRESHOLD` | `0.30` | 0.0-1.0 | Min score to call LLM (⬆️ = fewer hallucinations) |
-| `HADITH_BOOST` | `0.08` | 0.0-1.0 | Score boost for hadith on hadith queries |
-#### Other Settings
-| Variable | Default | Description |
-|----------|---------|-------------|
-| `CACHE_SIZE` | `512` | Query response cache entries |
-| `CACHE_TTL` | `3600` | Cache expiry in seconds |
-| `ALLOWED_ORIGINS` | `*` | CORS origins (use specific domains in production) |
-| `MAX_EXAMPLES` | `3` | Few-shot examples in system prompt |
-### Configuration Examples
-**Development (Ollama) - Recommended for getting started**
-```bash
-LLM_BACKEND=ollama
-OLLAMA_HOST=http://localhost:11434
-OLLAMA_MODEL=llama2
-EMBED_MODEL=intfloat/multilingual-e5-large
-FAISS_INDEX=QModel.index
-METADATA_FILE=metadata.json
-TOP_K_SEARCH=20
-TOP_K_RETURN=5
-TEMPERATURE=0.2
-CONFIDENCE_THRESHOLD=0.30
-ALLOWED_ORIGINS=*
-```
-**Production (HuggingFace + GPU) - Best quality, uses GPU**
-```bash
-LLM_BACKEND=hf
-HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct
-HF_DEVICE=cuda
-EMBED_MODEL=intfloat/multilingual-e5-large
-FAISS_INDEX=QModel.index
-METADATA_FILE=metadata.json
-TOP_K_SEARCH=30         # More candidates for better quality
-TOP_K_RETURN=5
-TEMPERATURE=0.1         # More deterministic
-CONFIDENCE_THRESHOLD=0.35
-ALLOWED_ORIGINS=yourdomain.com,api.yourdomain.com
-```
-**Production (HuggingFace + CPU) - CPU-only, slower but no GPU required**
-```bash
-LLM_BACKEND=hf
-HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct
-HF_DEVICE=cpu
-TEMPERATURE=0.1
-MAX_TOKENS=1024         # Reduce for faster responses
-CONFIDENCE_THRESHOLD=0.35
-```
-### Tuning Tips
-**For Better Results:**
-- Increase `TOP_K_SEARCH` (costs slightly more compute)
-- Lower `CONFIDENCE_THRESHOLD` (may get some hallucinations)
-- Use larger model with more parameters
-- Set `TEMPERATURE=0.1` for most consistent answers
-**For Faster Performance:**
-- Lower `TOP_K_SEARCH` and `TOP_K_RETURN`
-- Use Ollama backend (faster inference)
-- Reduce `MAX_TOKENS`
-- Set `HF_DEVICE=cpu` if using HF (faster than auto-selecting)
-**For More Accurate/Conservative Answers:**
-- Increase `CONFIDENCE_THRESHOLD` (skip borderline queries)
-- Lower `TEMPERATURE` (more deterministic)
-- Use larger model (7B+ parameters)
-**For CPU-Only (No GPU Available):**
-- Use Ollama backend with `neural-chat` model
-- Set `HF_DEVICE=cpu` if using HF
-- Reduce `MAX_TOKENS` to 1024
----
-## Running QModel
-### Step-by-Step: Starting the API
-1. **Create `.env` file**:
-   ```bash
-   cp .env.example .env
-   # Edit .env and choose your backend (see Configuration section above)
-   ```
-2. **Start the backend service**:
-   **If using Ollama:**
-   ```bash
-   # Terminal 1: Start Ollama daemon
-   ollama serve
-   # Terminal 2: Pull a model (first time only)
-   ollama pull llama2    # or: mistral, neural-chat
-   ```
-   **If using HuggingFace:**
-   - No separate service needed, models download automatically
-3. **Start QModel API**:
-   ```bash
-   python main.py
-   ```
-API available at `http://localhost:8000`
-View interactive docs: `http://localhost:8000/docs`
-### Docker Option
-```bash
-# Configure your backend in .env (see Configuration section)
-cp .env.example .env
-nano .env               # Choose LLM_BACKEND=ollama or hf
-# Run with Docker Compose
-docker-compose up
-```
-For full Docker documentation (including production deployment, troubleshooting, and multi-container setup), see **[DOCKER.md](DOCKER.md)**.
----
-## API Endpoints
-### Main Query Endpoint
-```bash
-GET /ask?q=<question>&top_k=5&source_type=<filter>&grade_filter=<filter>
-```
-**Parameters:**
-- `q` (required): Your Islamic question
-- `top_k`: Number of sources to retrieve (1-20, default: 5)
-- `source_type`: Filter by source type
-  - `quran` — Quranic verses only
-  - `hadith` — Hadiths only
-  - `null` (default) — Both
-- `grade_filter`: Filter Hadith by authenticity grade
-  - `sahih` — Only Sahih-graded Hadiths
-  - `hasan` — Sahih + Hasan
-  - `null` (default) — All grades
-**Example Requests:**
-```bash
-# General question
-curl "http://localhost:8000/ask?q=What%20does%20Islam%20say%20about%20mercy?"
-# Quran-only with word frequency
-curl "http://localhost:8000/ask?q=How%20many%20times%20is%20mercy%20mentioned?&source_type=quran"
-# Authentic Hadiths only
-curl "http://localhost:8000/ask?q=Hadiths%20about%20prayer&source_type=hadith&grade_filter=sahih"
-```
-**Response:**
-```json
-{
-  "question": "What does Islam say about mercy?",
-  "answer": "Islam emphasizes mercy as a core value...",
-  "language": "english",
-  "intent": "general",
-  "analysis": null,
-  "sources": [
-    {
-      "source": "Surah Al-Baqarah 2:178",
-      "type": "quran",
-      "grade": null,
-      "arabic": "...",
-      "english": "...",
-      "_score": 0.876
-    }
-  ],
-  "top_score": 0.876,
-  "latency_ms": 342
-}
-```
----
-### Hadith Verification Endpoint
-```bash
-GET /hadith/verify?q=<hadith_text>&collection=<filter>
-```
-**Purpose:** Quick authenticity check for a Hadith
-**Example:**
-```bash
-curl "http://localhost:8000/hadith/verify?q=Actions%20are%20judged%20by%20intentions"
-```
-**Response:**
-```json
-{
-  "query": "Actions are judged by intentions",
-  "found": true,
-  "collection": "Sahih al-Bukhari",
-  "grade": "Sahih",
-  "reference": "Sahih al-Bukhari 1",
-  "arabic": "إنما الأعمال بالنيات",
-  "english": "Verily, actions are judged by intentions...",
-  "latency_ms": 156
-}
-```
----
-### Debug Endpoint
-```bash
-GET /debug/scores?q=<question>&top_k=10
-```
-**Purpose:** Inspect raw retrieval scores without LLM call. Use to calibrate `CONFIDENCE_THRESHOLD`.
-**Example:**
-```bash
-curl "http://localhost:8000/debug/scores?q=patience&top_k=10"
-```
-**Response:**
-```json
-{
-  "intent": "general",
-  "threshold": 0.3,
-  "results": [
-    {
-      "rank": 1,
-      "source": "Surah Al-Baqarah 2:45",
-      "type": "quran",
-      "grade": null,
-      "_dense": 0.8234,
-      "_sparse": 0.5421,
-      "_score": 0.7234
-    }
-  ]
-}
-```
-Use this to fine-tune `CONFIDENCE_THRESHOLD`. If queries you expect to work have `_score < threshold`, lower the threshold.
----
-### Health & Metadata
-```bash
-# Health check
-curl http://localhost:8000/health
-# List available models
-curl http://localhost:8000/v1/models
-# Interactive API docs
-http://localhost:8000/docs
-```
----
-## Query Examples
-### 1. Word Frequency Analysis
-**Question:** "How many times is the word 'mercy' mentioned in the Quran?"
-**System detects:** `intent=count`
-**Response includes:**
-```json
-{
-  "analysis": {
-    "keyword": "mercy",
-    "total_count": 87,
-    "by_surah": {
-      "2": {"name": "Al-Baqarah", "count": 12},
-      "7": {"name": "Al-A'raf", "count": 8},
-      ...
-    }
-  }
-}
-```
----
-### 2. Topic-Based Aya Retrieval
-**Question:** "What does the Quran say about patience?"
-**System detects:** `intent=tafsir`
-**Response:**
-- Retrieves top 5 verses about patience
-- LLM explains each with Tafsir
-- Shows interconnections between verses
----
-### 3. Hadith Authentication
-**Question:** "Is the Hadith 'Actions are judged by intentions' authentic?"
-**System detects:** `intent=auth`
-**LLM response:**
-- "Yes, this is found in Sahih al-Bukhari 1"
-- "Grade: Sahih (authentic)"
-- "Explanation: This Hadith establishes the principle of intention..."
----
-### 4. Bilingual Support
-**Arabic Question:** "ما أهمية الصبر في الإسلام؟"
-**System detects:** Language = arabic
-**Response:** Full Arabic response with proper vocalization
----
-## Tuning & Optimization
-### Confidence Threshold
-The `CONFIDENCE_THRESHOLD` (default 0.30) controls when to call the LLM:
-- **Too high (e.g., 0.70)**: Many queries rejected as "not found" (safer but less helpful)
-- **Too low (e.g., 0.10)**: LLM called on weak matches (more hallucinations)
-- **Sweet spot (0.30-0.50)**: Most queries get through, but low-quality matches rejected
-**To calibrate:**
-1. Run `/debug/scores` on representative queries
-2. Check what `_score` values are returned
-3. Adjust `CONFIDENCE_THRESHOLD` in `.env`
-4. Restart service
----
-### Temperature
-- **0.0**: Deterministic (best for factual Islamic answers)
-- **0.2**: Slightly creative (default)
-- **0.5+**: More creative (not recommended for religious content)
----
-### Model Selection
-#### For Development (Ollama)
-- **llama2** — Fastest, good quality, easy setup
-- **mistral** — Better Arabic, slightly slower
-- **neural-chat** — Good balance
-```bash
-ollama pull llama2
-OLLAMA_MODEL=llama2 python main.py
-```
-#### For Production (HuggingFace)
-- **Qwen/Qwen2-7B-Instruct** — Strong Arabic, 7B params
-- **mistralai/Mistral-7B-Instruct-v0.2** — Very capable
-- **meta-llama/Llama-2-13b-chat-hf** — Larger, better quality (requires HF token)
-```bash
-HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct python main.py
-```
----
-## Troubleshooting
-### Issue: "Service is still initialising"
-**Solution:** Wait 60-90 seconds for embedding model to load. Check logs:
-```bash
-tail -f <logfile>
-```
-### Issue: Low retrieval scores
-**Cause:** Queries don't match dataset language better
-**Solution:**
-1. Check `/debug/scores` output
-2. Ensure query is in Arabic or clear English
-3. Try synonyms (e.g., "mercy" vs "compassion")
-4. Lower `CONFIDENCE_THRESHOLD` in `.env`
-### Issue: LLM model not found (HF backend)
-**Solution:**
-```bash
-huggingface-cli login
-export HF_TOKEN=<your_token>
-```
-### Issue: Out of memory
-**Solution:**
-- Use `OLLAMA_MODEL=neural-chat` (smaller)
-- Set `HF_DEVICE=cpu` (slower but uses RAM instead of VRAM)
-- Reduce `TOP_K_SEARCH` in `.env`
----
-## Production Checklist
-- [ ] Test with at least 10 representative queries
-- [ ] Verify `/debug/scores` on low-confidence queries
-- [ ] Adjust `CONFIDENCE_THRESHOLD` to acceptable false-positive rate
-- [ ] Set `ALLOWED_ORIGINS` to your domain only (security)
-- [ ] Use production-grade LLM model (Qwen 7B+ or Mistral)
-- [ ] Set `TEMPERATURE=0.1` for maximum consistency
-- [ ] Monitor first 100 queries for quality
-- [ ] Enable access logging and error tracking
----
-## Architecture Files
-- **main.py** — Core API + RAG pipeline (LLM backend abstraction, retrieval, generation)
-- **build_index.py** — FAISS index generation from metadata
-- **enrich_dataset.py** — Dataset enrichment script (fetch hadith collections, deduplicate)
-- **metadata.json** — Combined dataset: 6,236 Quran verses + 41,390 hadiths
-- **QModel.index** — FAISS vector index (pre-built, ready to use)
-- **ARCHITECTURE.md** — Detailed system design
-- **requirements.txt** — Python dependencies
----
-## Next Steps
-After setup, consider:
-1. Grade filtering: Try `?grade_filter=sahih` for authenticated-only results
-2. Source filtering: Use `?source_type=quran` vs `?source_type=hadith`
-3. Batch processing: Add endpoint for multiple questions
-4. Webhook integration: Stream answers as they generate
-5. Caching improvements: Persistent Redis cache for production
----
-## Support
-For issues:
-1. Check logs: `python main.py` (stdout)
-2. Test endpoints: http://localhost:8000/docs
-3. Review `/debug/scores` for retrieval quality
-4. Check `.env` configuration
-Happy querying! 🕌

app/routers/chat.py CHANGED Viewed

@@ -1,16 +1,18 @@
-"""Chat / inference endpoints — OpenAI-compatible."""
 from __future__ import annotations
 import json
 import logging
 import time
-from fastapi import APIRouter, HTTPException
 from fastapi.responses import StreamingResponse
 from app.config import cfg
 from app.models import (
     ChatCompletionChoice,
     ChatCompletionMessage,
     ChatCompletionRequest,
@@ -23,6 +25,45 @@ logger = logging.getLogger("qmodel.chat")
 router = APIRouter(tags=["inference"])
 # ───────────────────────────────────────────────────────
 # POST /v1/chat/completions — OpenAI-compatible
 # ───────────────────────────────────────────────────────

+"""Chat / inference endpoints — OpenAI-compatible + convenience /ask."""
 from __future__ import annotations
 import json
 import logging
 import time
+from typing import Literal, Optional
+from fastapi import APIRouter, HTTPException, Query
 from fastapi.responses import StreamingResponse
 from app.config import cfg
 from app.models import (
+    AskResponse,
     ChatCompletionChoice,
     ChatCompletionMessage,
     ChatCompletionRequest,
 router = APIRouter(tags=["inference"])
+# ───────────────────────────────────────────────────────
+# GET /ask — convenience RAG query endpoint
+# ───────────────────────────────────────────────────────
+@router.get("/ask", response_model=AskResponse)
+async def ask(
+    q: str = Query(..., min_length=1, max_length=500, description="Your Islamic question"),
+    top_k: int = Query(5, ge=1, le=20, description="Number of sources to retrieve"),
+    source_type: Optional[Literal["quran", "hadith"]] = Query(None, description="Filter: quran | hadith"),
+    grade_filter: Optional[str] = Query(None, description="Hadith grade filter: sahih | hasan"),
+):
+    """Direct RAG query with full source attribution.
+    Returns an AI-generated answer grounded in Quran and Hadith sources,
+    with language detection, intent classification, and scored references.
+    """
+    check_ready()
+    result = await run_rag_pipeline(q, top_k=top_k, source_type=source_type, grade_filter=grade_filter)
+    return AskResponse(
+        question=q,
+        answer=result["answer"],
+        language=result["language"],
+        intent=result["intent"],
+        analysis=result.get("analysis"),
+        sources=[
+            {
+                "source":  s.get("source") or s.get("reference", ""),
+                "type":    s.get("type", ""),
+                "grade":   s.get("grade"),
+                "arabic":  s.get("arabic", ""),
+                "english": s.get("english", ""),
+                "_score":  round(s.get("_score", 0), 4),
+            }
+            for s in result.get("sources", [])
+        ],
+        top_score=round(result["top_score"], 4),
+        latency_ms=result["latency_ms"],
+    )
 # ───────────────────────────────────────────────────────
 # POST /v1/chat/completions — OpenAI-compatible
 # ───────────────────────────────────────────────────────

app/routers/ops.py CHANGED Viewed

@@ -1,14 +1,16 @@
-"""Operational endpoints — health, models."""
 from __future__ import annotations
 import time
-from fastapi import APIRouter
 from app.config import cfg
 from app.models import ModelInfo, ModelsListResponse
-from app.state import state
 router = APIRouter(tags=["ops"])
@@ -18,7 +20,7 @@ def health():
     """Health check endpoint."""
     return {
         "status":               "ok" if state.ready else "initialising",
-        "version":              "5.0.0",
         "llm_backend":          cfg.LLM_BACKEND,
         "dataset_size":         len(state.dataset) if state.dataset else 0,
         "faiss_total":          state.faiss_index.ntotal if state.faiss_index else 0,
@@ -35,3 +37,40 @@ def list_models():
             ModelInfo(id="qmodel",  created=int(time.time()), owned_by="elgendy"),
         ]
     )

+"""Operational endpoints — health, models, debug."""
 from __future__ import annotations
 import time
+from typing import Literal, Optional
+from fastapi import APIRouter, Query
 from app.config import cfg
 from app.models import ModelInfo, ModelsListResponse
+from app.search import hybrid_search, rewrite_query
+from app.state import check_ready, state
 router = APIRouter(tags=["ops"])
     """Health check endpoint."""
     return {
         "status":               "ok" if state.ready else "initialising",
+        "version":              "6.0.0",
         "llm_backend":          cfg.LLM_BACKEND,
         "dataset_size":         len(state.dataset) if state.dataset else 0,
         "faiss_total":          state.faiss_index.ntotal if state.faiss_index else 0,
             ModelInfo(id="qmodel",  created=int(time.time()), owned_by="elgendy"),
         ]
     )
+@router.get("/debug/scores", tags=["debug"])
+async def debug_scores(
+    q: str = Query(..., min_length=1, max_length=500, description="Query to inspect"),
+    top_k: int = Query(10, ge=1, le=50, description="Number of results"),
+    source_type: Optional[Literal["quran", "hadith"]] = Query(None, description="Filter: quran | hadith"),
+):
+    """Inspect raw retrieval scores without calling the LLM.
+    Use this to calibrate CONFIDENCE_THRESHOLD and debug search quality.
+    """
+    check_ready()
+    rewrite = await rewrite_query(q, state.llm)
+    results = await hybrid_search(
+        q, rewrite,
+        state.embed_model, state.faiss_index, state.dataset,
+        top_n=top_k, source_type=source_type,
+    )
+    return {
+        "query":     q,
+        "intent":    rewrite.get("intent", "general"),
+        "threshold": cfg.CONFIDENCE_THRESHOLD,
+        "count":     len(results),
+        "results": [
+            {
+                "rank":    i + 1,
+                "source":  r.get("source") or r.get("reference", ""),
+                "type":    r.get("type", ""),
+                "grade":   r.get("grade"),
+                "_dense":  round(r.get("_dense", 0), 4),
+                "_sparse": round(r.get("_sparse", 0), 4),
+                "_score":  round(r.get("_score", 0), 4),
+            }
+            for i, r in enumerate(results)
+        ],
+    }

main.py CHANGED Viewed

@@ -33,7 +33,7 @@ logging.basicConfig(
 from app.config import cfg
 from app.state import lifespan
-from app.routers import chat, ops
 # ═══════════════════════════════════════════════════════════════════════
 # FASTAPI APP
@@ -47,7 +47,7 @@ app = FastAPI(
         "- Streaming support\n"
         "- Islamic knowledge RAG pipeline"
     ),
-    version="5.0.0",
     lifespan=lifespan,
 )
@@ -62,6 +62,8 @@ app.add_middleware(
 # Register routers
 app.include_router(ops.router)
 app.include_router(chat.router)
 if __name__ == "__main__":

 from app.config import cfg
 from app.state import lifespan
+from app.routers import chat, hadith, ops, quran
 # ═══════════════════════════════════════════════════════════════════════
 # FASTAPI APP
         "- Streaming support\n"
         "- Islamic knowledge RAG pipeline"
     ),
+    version="6.0.0",
     lifespan=lifespan,
 )
 # Register routers
 app.include_router(ops.router)
 app.include_router(chat.router)
+app.include_router(quran.router)
+app.include_router(hadith.router)
 if __name__ == "__main__":