QModel / README.md

Upload folder using huggingface_hub

605bb90 4 days ago

15.7 kB

	---
	title: QModel
	emoji: 🕌
	colorFrom: green
	colorTo: blue
	sdk: docker
	app_port: 8000
	license: mit
	tags:
	- quran
	- hadith
	- islamic
	- rag
	- faiss
	- nlp
	- arabic
	language:
	- ar
	- en
	---

	# QModel v6 — Islamic RAG System
	Specialized Qur'an & Hadith Knowledge System with Dual LLM Support

	> A production-ready Retrieval-Augmented Generation system specialized exclusively in authenticated Islamic knowledge. No hallucinations, no outside knowledge—only content from verified sources.

	![Version](https://img.shields.io/badge/version-6.0.0-blue)
	![Backend](https://img.shields.io/badge/backend-ollama%20%7C%20huggingface-green)
	![Status](https://img.shields.io/badge/status-production--ready-success)

	---

	## Features

	### 📖 Qur'an Capabilities
	- Verse Lookup: Find verses by topic or keyword
	- Word Frequency: Count occurrences with Surah breakdown
	- Bilingual: Full Arabic + English translation support
	- Tafsir Integration: AI-powered contextual interpretation

	### 📚 Hadith Capabilities
	- Authenticity Verification: Check if Hadith is in authenticated collections
	- Grade Display: Show Sahih/Hasan/Da'if authenticity levels
	- Topic Search: Find relevant Hadiths across 9 major collections
	- Collection Navigation: Filter by Bukhari, Muslim, Abu Dawud, etc.

	### 🛡️ Safety Features
	- Confidence Gating: Low-confidence queries return "not found" instead of guesses
	- Source Attribution: Every answer cites exact verse/Hadith reference
	- Verbatim Quotes: Text copied directly from data, never paraphrased
	- Anti-Hallucination: Hardened prompts with few-shot "not found" examples

	### 🚀 Integration
	- OpenAI-Compatible API: Use with Open-WebUI, Langchain, or any OpenAI client
	- OpenAI Schema: Full support for `/v1/chat/completions` and `/v1/models`
	- Streaming Responses: SSE streaming for long-form answers

	### ⚙️ Technical
	- Dual LLM Backend: Ollama (dev) + HuggingFace (prod)
	- Hybrid Search: Dense (FAISS) + Sparse (BM25) scoring
	- Async API: FastAPI with async/await throughout
	- Caching: TTL-based LRU cache for frequent queries
	- Scale: 6,236 Quranic verses + 41,390 Hadiths indexed

	---

	## Quick Start

	### Prerequisites
	- Python 3.10+
	- 16 GB RAM minimum (for embeddings + LLM)
	- GPU recommended for HuggingFace backend
	- Ollama installed (for local development) OR internet access (for HuggingFace)

	### Installation

	```bash
	# Clone and enter project
	git clone https://github.com/Logicsoft/QModel.git && cd QModel
	python3 -m venv .venv && source .venv/bin/activate
	pip install -r requirements.txt

	# Configure (choose one backend)
	# Option A — Ollama (local development):
	export LLM_BACKEND=ollama
	export OLLAMA_MODEL=llama2
	# Make sure Ollama is running: ollama serve

	# Option B — HuggingFace (production):
	export LLM_BACKEND=hf
	export HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct

	# Run
	python main.py

	# Query
	curl "http://localhost:8000/ask?q=What%20does%20Islam%20say%20about%20mercy?"
	```

	API docs: http://localhost:8000/docs

	### Data & Index

	Pre-built data files are included:
	- `metadata.json` — 47,626 documents (6,236 Quran verses + 41,390 hadiths from 9 canonical collections)
	- `QModel.index` — FAISS search index

	To rebuild after dataset changes:
	```bash
	python build_index.py
	```

	---

	## Example Queries

	```bash
	# Basic question
	curl "http://localhost:8000/ask?q=What%20does%20Islam%20say%20about%20mercy?"

	# Word frequency
	curl "http://localhost:8000/ask?q=How%20many%20times%20is%20mercy%20mentioned?"

	# Authentic Hadiths only
	curl "http://localhost:8000/ask?q=prayer&source_type=hadith&grade_filter=sahih"

	# Quran text search
	curl "http://localhost:8000/quran/search?q=bismillah"

	# Quran topic search
	curl "http://localhost:8000/quran/topic?topic=patience&top_k=5"

	# Quran word frequency
	curl "http://localhost:8000/quran/word-frequency?word=mercy"

	# Single chapter
	curl "http://localhost:8000/quran/chapter/2"

	# Exact verse
	curl "http://localhost:8000/quran/verse/2:255"

	# Hadith text search
	curl "http://localhost:8000/hadith/search?q=actions+are+judged+by+intentions"

	# Hadith topic search (Sahih only)
	curl "http://localhost:8000/hadith/topic?topic=fasting&grade_filter=sahih"

	# Verify Hadith authenticity
	curl "http://localhost:8000/hadith/verify?q=Actions%20are%20judged%20by%20intentions"

	# Browse a collection
	curl "http://localhost:8000/hadith/collection/bukhari?limit=5"

	# Streaming (OpenAI-compatible)
	curl -X POST http://localhost:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{"model":"QModel","messages":[{"role":"user","content":"What does Islam say about charity?"}],"stream":true}'
	```

	---

	## Configuration

	All configuration via environment variables (`.env` file or exported directly):

	### Backend Selection

	\| Backend \| Pros \| Cons \| When to Use \|
	\|---------\|------\|------\|------------\|
	\| Ollama \| Fast setup, no GPU, free \| Smaller models \| Development, testing \|
	\| HuggingFace \| Larger models, better quality \| Requires GPU or significant RAM \| Production \|

	### Ollama Backend (Development)

	```bash
	LLM_BACKEND=ollama
	OLLAMA_HOST=http://localhost:11434
	OLLAMA_MODEL=llama2 # or: mistral, neural-chat, orca-mini
	```

	Requires: `ollama serve` running and model pulled (`ollama pull llama2`).

	### HuggingFace Backend (Production)

	```bash
	LLM_BACKEND=hf
	HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct
	HF_DEVICE=auto # auto \| cuda \| cpu
	HF_MAX_NEW_TOKENS=2048
	```

	### All Environment Variables

	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| Backend \| \| \|
	\| `LLM_BACKEND` \| `hf` \| `ollama` or `hf` \|
	\| `OLLAMA_HOST` \| `http://localhost:11434` \| Ollama server URL \|
	\| `OLLAMA_MODEL` \| `llama2` \| Ollama model name \|
	\| `HF_MODEL_NAME` \| `Qwen/Qwen2-7B-Instruct` \| HuggingFace model ID \|
	\| `HF_DEVICE` \| `auto` \| `auto`, `cuda`, or `cpu` \|
	\| `HF_MAX_NEW_TOKENS` \| `2048` \| Max output length \|
	\| Embedding & Data \| \| \|
	\| `EMBED_MODEL` \| `intfloat/multilingual-e5-large` \| Embedding model \|
	\| `FAISS_INDEX` \| `QModel.index` \| Index file path \|
	\| `METADATA_FILE` \| `metadata.json` \| Dataset file \|
	\| Retrieval \| \| \|
	\| `TOP_K_SEARCH` \| `20` \| Candidate pool (5–100) \|
	\| `TOP_K_RETURN` \| `5` \| Results shown to user (1–20) \|
	\| `RERANK_ALPHA` \| `0.6` \| Dense vs Sparse weight (0.0–1.0) \|
	\| Generation \| \| \|
	\| `TEMPERATURE` \| `0.2` \| Creativity (0.0–1.0, use 0.1–0.2 for religious) \|
	\| `MAX_TOKENS` \| `2048` \| Max response length \|
	\| Safety \| \| \|
	\| `CONFIDENCE_THRESHOLD` \| `0.30` \| Min score to call LLM (higher = fewer hallucinations) \|
	\| `HADITH_BOOST` \| `0.08` \| Score boost for hadith on hadith queries \|
	\| Other \| \| \|
	\| `CACHE_SIZE` \| `512` \| Query response cache entries \|
	\| `CACHE_TTL` \| `3600` \| Cache expiry in seconds \|
	\| `ALLOWED_ORIGINS` \| `*` \| CORS origins \|
	\| `MAX_EXAMPLES` \| `3` \| Few-shot examples in system prompt \|

	### Configuration Examples

	Development (Ollama)
	```bash
	LLM_BACKEND=ollama
	OLLAMA_HOST=http://localhost:11434
	OLLAMA_MODEL=llama2
	TEMPERATURE=0.2
	CONFIDENCE_THRESHOLD=0.30
	ALLOWED_ORIGINS=*
	```

	Production (HuggingFace + GPU)
	```bash
	LLM_BACKEND=hf
	HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct
	HF_DEVICE=cuda
	TOP_K_SEARCH=30
	TEMPERATURE=0.1
	CONFIDENCE_THRESHOLD=0.35
	ALLOWED_ORIGINS=yourdomain.com,api.yourdomain.com
	```

	### Tuning Tips

	- Better results: Increase `TOP_K_SEARCH`, lower `CONFIDENCE_THRESHOLD`, use `TEMPERATURE=0.1`
	- Faster performance: Lower `TOP_K_SEARCH` and `TOP_K_RETURN`, reduce `MAX_TOKENS`, use Ollama
	- More conservative: Increase `CONFIDENCE_THRESHOLD`, lower `TEMPERATURE`

	---

	## Docker Deployment

	### Docker Compose (Recommended)

	```bash
	cp .env.example .env # Configure backend (see Configuration section)
	docker-compose up
	```

	### Docker CLI

	```bash
	docker build -t qmodel .

	# With Ollama backend
	docker run -p 8000:8000 \
	--env-file .env \
	--add-host host.docker.internal:host-gateway \
	qmodel

	# With HuggingFace backend
	docker run -p 8000:8000 \
	--env-file .env \
	--env HF_TOKEN=your_token_here \
	qmodel
	```

	### Docker with Ollama

	```bash
	# .env
	LLM_BACKEND=ollama
	OLLAMA_HOST=http://host.docker.internal:11434
	OLLAMA_MODEL=llama2
	```

	Requires Ollama running on the host (`ollama serve`).

	### Docker with HuggingFace

	```bash
	# .env
	LLM_BACKEND=hf
	HF_MODEL_NAME=Qwen/Qwen2-7B-Instruct
	HF_DEVICE=auto

	# Pass HF token
	export HF_TOKEN=hf_xxxxxxxxxxxxx
	docker-compose up
	```

	### Docker Compose with GPU (Linux)

	```yaml
	services:
	qmodel:
	deploy:
	resources:
	reservations:
	devices:
	- driver: nvidia
	count: 1
	capabilities: [gpu]
	```

	### Production Tips

	- Remove dev volume mount (`.:/app`) in `docker-compose.yml`
	- Set `restart: on-failure:5`
	- Use specific `ALLOWED_ORIGINS` instead of `*`

	---

	## Open-WebUI Integration

	QModel is fully OpenAI-compatible and works out of the box with Open-WebUI.

	### Setup

	```bash
	# Start QModel
	python main.py

	# Start Open-WebUI
	docker run -d -p 3000:8080 --name open-webui ghcr.io/open-webui/open-webui:latest
	```

	### Connect

	1. Settings → Models → Manage Models
	2. Click "Connect to OpenAI-compatible API"
	3. API Base URL: `http://localhost:8000/v1`
	4. Model Name: `QModel`
	5. API Key: Leave blank
	6. Save & Test → ✅ Connected

	### Docker Compose (QModel + Ollama + Open-WebUI)

	```yaml
	version: '3.8'
	services:
	qmodel:
	build: .
	ports:
	- "8000:8000"
	environment:
	- LLM_BACKEND=ollama
	- OLLAMA_HOST=http://ollama:11434

	ollama:
	image: ollama/ollama:latest
	ports:
	- "11434:11434"

	web-ui:
	image: ghcr.io/open-webui/open-webui:latest
	ports:
	- "3000:8080"
	depends_on:
	- qmodel
	```

	### Supported Features

	\| Feature \| Status \|
	\|---------\|--------\|
	\| Chat \| ✅ Full support \|
	\| Streaming \| ✅ `stream: true` \|
	\| Multi-turn context \| ✅ Handled by Open-WebUI \|
	\| Temperature \| ✅ Configurable \|
	\| Token limits \| ✅ `max_tokens` \|
	\| Model listing \| ✅ `/v1/models` \|
	\| Source attribution \| ✅ `x_metadata.sources` \|

	---

	## Architecture

	### Module Structure

	```
	main.py ← FastAPI app + router registration
	app/
	config.py ← Config class (env vars)
	llm.py ← LLM providers (Ollama, HuggingFace)
	cache.py ← TTL-LRU async cache
	arabic_nlp.py ← Arabic normalization, stemming, language detection
	search.py ← Hybrid FAISS+BM25, text search, query rewriting
	analysis.py ← Intent detection, analytics, counting
	prompts.py ← Prompt engineering (persona, anti-hallucination)
	models.py ← Pydantic schemas
	state.py ← AppState, lifespan, RAG pipeline
	routers/
	quran.py ← 6 Quran endpoints
	hadith.py ← 5 Hadith endpoints
	chat.py ← /ask + OpenAI-compatible chat
	ops.py ← health, models, debug scores
	```

	### Data Pipeline

	1. Ingest: 47,626 documents (6,236 Quran verses + 41,390 Hadiths from 9 collections)
	2. Embed: Encode with `multilingual-e5-large` (Arabic + English dual embeddings)
	3. Index: FAISS `IndexFlatIP` for dense retrieval

	### Retrieval & Ranking

	1. Dense retrieval (FAISS semantic scoring)
	2. Sparse retrieval (BM25 term-frequency)
	3. Fusion: 60% dense + 40% sparse
	4. Intent-aware boost (+0.08 to Hadith when intent=hadith)
	5. Type filter (quran_only / hadith_only / authenticated_only)
	6. Text search fallback (exact phrase + word-overlap)

	### Anti-Hallucination Measures

	- Few-shot examples including "not found" refusal path
	- Hardcoded citation format rules
	- Verbatim copy rules (no text reconstruction)
	- Confidence threshold gating (default: 0.30)
	- Post-generation citation verification
	- Grade inference from collection name

	### Performance

	\| Operation \| Time \| Backend \|
	\|-----------\|------\|---------\|
	\| Query (cached) \| ~50ms \| Both \|
	\| Query (Ollama) \| 400–800ms \| Ollama \|
	\| Query (HF GPU) \| 500–1500ms \| CUDA \|
	\| Query (HF CPU) \| 2–5s \| CPU \|

	---

	## Troubleshooting

	### "Cannot connect to Ollama"
	```bash
	ollama serve # Ensure Ollama is running on host
	# In Docker, use OLLAMA_HOST=http://host.docker.internal:11434
	```

	### "HuggingFace model not found"
	```bash
	export HF_TOKEN=hf_xxxxxxxxxxxxx # Set token for gated models
	```

	### "Out of memory"
	- Use smaller model: `HF_MODEL_NAME=mistralai/Mistral-7B-Instruct-v0.2`
	- Use Ollama with `neural-chat`
	- Reduce `MAX_TOKENS` to 1024
	- Increase Docker memory limit in `docker-compose.yml`

	### "Assistant returns 'Not found'"
	This is expected — QModel rejects low-confidence queries. Try:
	- More specific queries
	- Lower `CONFIDENCE_THRESHOLD` in `.env`
	- Check raw scores: `GET /debug/scores?q=your+query`

	### "Port already in use"
	```bash
	docker-compose down && docker system prune
	# Or change port: ports: ["8001:8000"]
	```

	---

	## Roadmap

	- [x] Grade-based filtering
	- [x] Streaming responses (SSE)
	- [x] Modular architecture (4 routers, 16 endpoints)
	- [x] Dual LLM backend (Ollama + HuggingFace)
	- [x] Text search (exact substring + fuzzy matching)
	- [ ] Chain of narrators (Isnad display)
	- [ ] Synonym expansion (mercy → rahma, compassion)
	- [ ] Batch processing (multiple questions per request)
	- [ ] Islamic calendar integration (Hijri dates)
	- [ ] Tafsir endpoint with scholar citations

	---

	## Data Sources

	- Qur'an: [risan/quran-json](https://github.com/risan/quran-json) — 114 Surahs, 6,236 verses
	- Hadith: [AhmedBaset/hadith-json](https://github.com/AhmedBaset/hadith-json) — 9 canonical collections, 41,390 hadiths

	---

	## Architecture Overview

	```
	User Query
	↓
	Query Rewriting & Intent Detection
	↓
	Hybrid Search (FAISS dense + BM25 sparse)
	↓
	Filtering & Ranking
	↓
	Confidence Gate (skip LLM if low-scoring)
	↓
	LLM Generation (Ollama or HuggingFace)
	↓
	Formatted Response with Sources
	```

	See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed system design.

	---

	## Troubleshooting

	\| Issue \| Solution \|
	\|-------\|----------\|
	\| "Service is initialising" \| Wait 60-90s for embeddings model to load \|
	\| Low retrieval scores \| Check `/debug/scores`, try synonyms, lower threshold \|
	\| "Model not found" (HF) \| Run `huggingface-cli login` \|
	\| Out of memory \| Use smaller model or CPU backend \|
	\| No results \| Verify data files exist: `metadata.json` and `QModel.index` \|

	See [SETUP.md](SETUP.md) and [DOCKER.md](DOCKER.md) for more detailed troubleshooting.

	---

	## What's New in v6

	✨ Dual LLM Backend — Ollama (dev) + HuggingFace (prod)
	✨ Grade Filtering — Return only Sahih/Hasan authenticated Hadiths
	✨ Source Filtering — Quran-only or Hadith-only queries
	✨ Hadith Verification — `/hadith/verify` endpoint
	✨ Enhanced Frequency — Word counts by Surah
	✨ OpenAI Compatible — Use with any OpenAI client
	✨ Production Ready — Structured logging, error handling, async throughout

	---

	## Next Steps

	1. Get Started: See [SETUP.md](SETUP.md)
	2. Integrate with Open-WebUI: See [OPEN_WEBUI.md](OPEN_WEBUI.md)
	3. Deploy with Docker: See [DOCKER.md](DOCKER.md)
	4. Understand Architecture: See [ARCHITECTURE.md](ARCHITECTURE.md)

	---

	## License

	This project uses open-source data from:
	- [Qur'an JSON](https://github.com/risan/quran-json) — Open source
	- [Hadith API](https://github.com/AhmedBaset/hadith-json) — Open source

	See individual repositories for license details.

	---

	Made with ❤️ for Islamic scholarship.

	Version 4.0.0 \| March 2025 \| Production-Ready