QModel / README.md

Upload folder using huggingface_hub

605bb90 4 days ago

15.7 kB

title: QModel
emoji: 🕌
colorFrom: green
colorTo: blue
sdk: docker
app_port: 8000
license: mit
tags:
  - quran
  - hadith
  - islamic
  - rag
  - faiss
  - nlp
  - arabic
language:
  - ar
  - en

QModel v6 — Islamic RAG System

Specialized Qur'an & Hadith Knowledge System with Dual LLM Support

A production-ready Retrieval-Augmented Generation system specialized exclusively in authenticated Islamic knowledge. No hallucinations, no outside knowledge—only content from verified sources.

Features

📖 Qur'an Capabilities

Verse Lookup: Find verses by topic or keyword
Word Frequency: Count occurrences with Surah breakdown
Bilingual: Full Arabic + English translation support
Tafsir Integration: AI-powered contextual interpretation

📚 Hadith Capabilities

Authenticity Verification: Check if Hadith is in authenticated collections
Grade Display: Show Sahih/Hasan/Da'if authenticity levels
Topic Search: Find relevant Hadiths across 9 major collections
Collection Navigation: Filter by Bukhari, Muslim, Abu Dawud, etc.

🛡️ Safety Features

Confidence Gating: Low-confidence queries return "not found" instead of guesses
Source Attribution: Every answer cites exact verse/Hadith reference
Verbatim Quotes: Text copied directly from data, never paraphrased
Anti-Hallucination: Hardened prompts with few-shot "not found" examples

🚀 Integration

OpenAI-Compatible API: Use with Open-WebUI, Langchain, or any OpenAI client
OpenAI Schema: Full support for /v1/chat/completions and /v1/models
Streaming Responses: SSE streaming for long-form answers

⚙️ Technical

Dual LLM Backend: Ollama (dev) + HuggingFace (prod)
Hybrid Search: Dense (FAISS) + Sparse (BM25) scoring
Async API: FastAPI with async/await throughout
Caching: TTL-based LRU cache for frequent queries
Scale: 6,236 Quranic verses + 41,390 Hadiths indexed

Quick Start

Prerequisites

Python 3.10+
16 GB RAM minimum (for embeddings + LLM)
GPU recommended for HuggingFace backend
Ollama installed (for local development) OR internet access (for HuggingFace)

Installation

# Clone and enter project
git clone https://github.com/Logicsoft/QModel.git && cd QModel
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Configure (choose one backend)
# Option A — Ollama (local development):
export LLM_BACKEND=ollama
export OLLAMA_MODEL=llama2
# Make sure Ollama is running: ollama serve

# Option B — HuggingFace (production):
export LLM_BACKEND=hf
export HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct

# Run
python main.py

# Query
curl "http://localhost:8000/ask?q=What%20does%20Islam%20say%20about%20mercy?"

API docs: http://localhost:8000/docs

Data & Index

Pre-built data files are included:

metadata.json — 47,626 documents (6,236 Quran verses + 41,390 hadiths from 9 canonical collections)
QModel.index — FAISS search index

To rebuild after dataset changes:

python build_index.py

Example Queries

# Basic question
curl "http://localhost:8000/ask?q=What%20does%20Islam%20say%20about%20mercy?"

# Word frequency
curl "http://localhost:8000/ask?q=How%20many%20times%20is%20mercy%20mentioned?"

# Authentic Hadiths only
curl "http://localhost:8000/ask?q=prayer&source_type=hadith&grade_filter=sahih"

# Quran text search
curl "http://localhost:8000/quran/search?q=bismillah"

# Quran topic search
curl "http://localhost:8000/quran/topic?topic=patience&top_k=5"

# Quran word frequency
curl "http://localhost:8000/quran/word-frequency?word=mercy"

# Single chapter
curl "http://localhost:8000/quran/chapter/2"

# Exact verse
curl "http://localhost:8000/quran/verse/2:255"

# Hadith text search
curl "http://localhost:8000/hadith/search?q=actions+are+judged+by+intentions"

# Hadith topic search (Sahih only)
curl "http://localhost:8000/hadith/topic?topic=fasting&grade_filter=sahih"

# Verify Hadith authenticity
curl "http://localhost:8000/hadith/verify?q=Actions%20are%20judged%20by%20intentions"

# Browse a collection
curl "http://localhost:8000/hadith/collection/bukhari?limit=5"

# Streaming (OpenAI-compatible)
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"QModel","messages":[{"role":"user","content":"What does Islam say about charity?"}],"stream":true}'

Configuration

All configuration via environment variables (.env file or exported directly):

Backend Selection

Backend	Pros	Cons	When to Use
Ollama	Fast setup, no GPU, free	Smaller models	Development, testing
HuggingFace	Larger models, better quality	Requires GPU or significant RAM	Production

Ollama Backend (Development)

LLM_BACKEND=ollama
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=llama2              # or: mistral, neural-chat, orca-mini

Requires: ollama serve running and model pulled (ollama pull llama2).

HuggingFace Backend (Production)

LLM_BACKEND=hf
HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct
HF_DEVICE=auto                   # auto | cuda | cpu
HF_MAX_NEW_TOKENS=2048

All Environment Variables

Variable	Default	Description
Backend
`LLM_BACKEND`	`hf`	`ollama` or `hf`
`OLLAMA_HOST`	`http://localhost:11434`	Ollama server URL
`OLLAMA_MODEL`	`llama2`	Ollama model name
`HF_MODEL_NAME`	`Qwen/Qwen2-7B-Instruct`	HuggingFace model ID
`HF_DEVICE`	`auto`	`auto`, `cuda`, or `cpu`
`HF_MAX_NEW_TOKENS`	`2048`	Max output length
Embedding & Data
`EMBED_MODEL`	`intfloat/multilingual-e5-large`	Embedding model
`FAISS_INDEX`	`QModel.index`	Index file path
`METADATA_FILE`	`metadata.json`	Dataset file
Retrieval
`TOP_K_SEARCH`	`20`	Candidate pool (5–100)
`TOP_K_RETURN`	`5`	Results shown to user (1–20)
`RERANK_ALPHA`	`0.6`	Dense vs Sparse weight (0.0–1.0)
Generation
`TEMPERATURE`	`0.2`	Creativity (0.0–1.0, use 0.1–0.2 for religious)
`MAX_TOKENS`	`2048`	Max response length
Safety
`CONFIDENCE_THRESHOLD`	`0.30`	Min score to call LLM (higher = fewer hallucinations)
`HADITH_BOOST`	`0.08`	Score boost for hadith on hadith queries
Other
`CACHE_SIZE`	`512`	Query response cache entries
`CACHE_TTL`	`3600`	Cache expiry in seconds
`ALLOWED_ORIGINS`	`*`	CORS origins
`MAX_EXAMPLES`	`3`	Few-shot examples in system prompt

Configuration Examples

Development (Ollama)

LLM_BACKEND=ollama
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=llama2
TEMPERATURE=0.2
CONFIDENCE_THRESHOLD=0.30
ALLOWED_ORIGINS=*

Production (HuggingFace + GPU)

LLM_BACKEND=hf
HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct
HF_DEVICE=cuda
TOP_K_SEARCH=30
TEMPERATURE=0.1
CONFIDENCE_THRESHOLD=0.35
ALLOWED_ORIGINS=yourdomain.com,api.yourdomain.com

Tuning Tips

Better results: Increase TOP_K_SEARCH, lower CONFIDENCE_THRESHOLD, use TEMPERATURE=0.1
Faster performance: Lower TOP_K_SEARCH and TOP_K_RETURN, reduce MAX_TOKENS, use Ollama
More conservative: Increase CONFIDENCE_THRESHOLD, lower TEMPERATURE

Docker Deployment

Docker Compose (Recommended)

cp .env.example .env   # Configure backend (see Configuration section)
docker-compose up

Docker CLI

docker build -t qmodel .

# With Ollama backend
docker run -p 8000:8000 \
  --env-file .env \
  --add-host host.docker.internal:host-gateway \
  qmodel

# With HuggingFace backend
docker run -p 8000:8000 \
  --env-file .env \
  --env HF_TOKEN=your_token_here \
  qmodel

Docker with Ollama

# .env
LLM_BACKEND=ollama
OLLAMA_HOST=http://host.docker.internal:11434
OLLAMA_MODEL=llama2

Requires Ollama running on the host (ollama serve).

Docker with HuggingFace

# .env
LLM_BACKEND=hf
HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct
HF_DEVICE=auto

# Pass HF token
export HF_TOKEN=hf_xxxxxxxxxxxxx
docker-compose up

Docker Compose with GPU (Linux)

services:
  qmodel:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Production Tips

Remove dev volume mount (.:/app) in docker-compose.yml
Set restart: on-failure:5
Use specific ALLOWED_ORIGINS instead of *

Open-WebUI Integration

QModel is fully OpenAI-compatible and works out of the box with Open-WebUI.

Setup

# Start QModel
python main.py

# Start Open-WebUI
docker run -d -p 3000:8080 --name open-webui ghcr.io/open-webui/open-webui:latest

Connect

Settings → Models → Manage Models
Click "Connect to OpenAI-compatible API"
API Base URL: http://localhost:8000/v1
Model Name: QModel
API Key: Leave blank
Save & Test → ✅ Connected

Docker Compose (QModel + Ollama + Open-WebUI)

version: '3.8'
services:
  qmodel:
    build: .
    ports:
      - "8000:8000"
    environment:
      - LLM_BACKEND=ollama
      - OLLAMA_HOST=http://ollama:11434

  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"

  web-ui:
    image: ghcr.io/open-webui/open-webui:latest
    ports:
      - "3000:8080"
    depends_on:
      - qmodel

Supported Features

Feature	Status
Chat	✅ Full support
Streaming	✅ `stream: true`
Multi-turn context	✅ Handled by Open-WebUI
Temperature	✅ Configurable
Token limits	✅ `max_tokens`
Model listing	✅ `/v1/models`
Source attribution	✅ `x_metadata.sources`

Architecture

Module Structure

main.py                    ← FastAPI app + router registration
app/
  config.py               ← Config class (env vars)
  llm.py                  ← LLM providers (Ollama, HuggingFace)
  cache.py                ← TTL-LRU async cache
  arabic_nlp.py           ← Arabic normalization, stemming, language detection
  search.py               ← Hybrid FAISS+BM25, text search, query rewriting
  analysis.py             ← Intent detection, analytics, counting
  prompts.py              ← Prompt engineering (persona, anti-hallucination)
  models.py               ← Pydantic schemas
  state.py                ← AppState, lifespan, RAG pipeline
  routers/
    quran.py              ← 6 Quran endpoints
    hadith.py             ← 5 Hadith endpoints
    chat.py               ← /ask + OpenAI-compatible chat
    ops.py                ← health, models, debug scores

Data Pipeline

Ingest: 47,626 documents (6,236 Quran verses + 41,390 Hadiths from 9 collections)
Embed: Encode with multilingual-e5-large (Arabic + English dual embeddings)
Index: FAISS IndexFlatIP for dense retrieval

Retrieval & Ranking

Dense retrieval (FAISS semantic scoring)
Sparse retrieval (BM25 term-frequency)
Fusion: 60% dense + 40% sparse
Intent-aware boost (+0.08 to Hadith when intent=hadith)
Type filter (quran_only / hadith_only / authenticated_only)
Text search fallback (exact phrase + word-overlap)

Anti-Hallucination Measures

Few-shot examples including "not found" refusal path
Hardcoded citation format rules
Verbatim copy rules (no text reconstruction)
Confidence threshold gating (default: 0.30)
Post-generation citation verification
Grade inference from collection name

Performance

Operation	Time	Backend
Query (cached)	~50ms	Both
Query (Ollama)	400–800ms	Ollama
Query (HF GPU)	500–1500ms	CUDA
Query (HF CPU)	2–5s	CPU

Troubleshooting

"Cannot connect to Ollama"

ollama serve                      # Ensure Ollama is running on host
# In Docker, use OLLAMA_HOST=http://host.docker.internal:11434

"HuggingFace model not found"

export HF_TOKEN=hf_xxxxxxxxxxxxx  # Set token for gated models

"Out of memory"

Use smaller model: HF_MODEL_NAME=mistralai/Mistral-7B-Instruct-v0.2
Use Ollama with neural-chat
Reduce MAX_TOKENS to 1024
Increase Docker memory limit in docker-compose.yml

"Assistant returns 'Not found'"

This is expected — QModel rejects low-confidence queries. Try:

More specific queries
Lower CONFIDENCE_THRESHOLD in .env
Check raw scores: GET /debug/scores?q=your+query

"Port already in use"

docker-compose down && docker system prune
# Or change port: ports: ["8001:8000"]

Roadmap

Grade-based filtering
Streaming responses (SSE)
Modular architecture (4 routers, 16 endpoints)
Dual LLM backend (Ollama + HuggingFace)
Text search (exact substring + fuzzy matching)
Chain of narrators (Isnad display)
Synonym expansion (mercy → rahma, compassion)
Batch processing (multiple questions per request)
Islamic calendar integration (Hijri dates)
Tafsir endpoint with scholar citations

Data Sources

Qur'an: risan/quran-json — 114 Surahs, 6,236 verses
Hadith: AhmedBaset/hadith-json — 9 canonical collections, 41,390 hadiths

Architecture Overview

User Query
    ↓
Query Rewriting & Intent Detection
    ↓
Hybrid Search (FAISS dense + BM25 sparse)
    ↓
Filtering & Ranking
    ↓
Confidence Gate (skip LLM if low-scoring)
    ↓
LLM Generation (Ollama or HuggingFace)
    ↓
Formatted Response with Sources

See ARCHITECTURE.md for detailed system design.

Troubleshooting

Issue	Solution
"Service is initialising"	Wait 60-90s for embeddings model to load
Low retrieval scores	Check `/debug/scores`, try synonyms, lower threshold
"Model not found" (HF)	Run `huggingface-cli login`
Out of memory	Use smaller model or CPU backend
No results	Verify data files exist: `metadata.json` and `QModel.index`

See SETUP.md and DOCKER.md for more detailed troubleshooting.

What's New in v6

✨ Dual LLM Backend — Ollama (dev) + HuggingFace (prod) ✨ Grade Filtering — Return only Sahih/Hasan authenticated Hadiths ✨ Source Filtering — Quran-only or Hadith-only queries ✨ Hadith Verification — /hadith/verify endpoint ✨ Enhanced Frequency — Word counts by Surah ✨ OpenAI Compatible — Use with any OpenAI client ✨ Production Ready — Structured logging, error handling, async throughout

Next Steps

Get Started: See SETUP.md
Integrate with Open-WebUI: See OPEN_WEBUI.md
Deploy with Docker: See DOCKER.md
Understand Architecture: See ARCHITECTURE.md

License

This project uses open-source data from:

Qur'an JSON — Open source
Hadith API — Open source

See individual repositories for license details.

Made with ❤️ for Islamic scholarship.

Version 4.0.0 | March 2025 | Production-Ready