Hadith_Search / README.md
NightPrince's picture
Add impressive README with architecture, API docs, and cross-project links
e36606a
---
title: Hadith Search
emoji: ๐Ÿ“œ
colorFrom: indigo
colorTo: green
sdk: docker
pinned: false
license: mit
---
<div align="center">
# ๐Ÿ“œ Hadith Search
**Semantic search across thousands of Prophetic traditions โ€” find the Hadith closest to your question by meaning, not just keywords.**
[![HuggingFace Space](https://img.shields.io/badge/๐Ÿค—%20HuggingFace-Live%20Demo-yellow?style=for-the-badge)](https://huggingface.co/spaces/NightPrince/Hadith_Search)
[![License: MIT](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)](LICENSE)
[![Python 3.10](https://img.shields.io/badge/Python-3.10-blue?style=for-the-badge&logo=python)](https://python.org)
[![FastAPI](https://img.shields.io/badge/FastAPI-teal?style=for-the-badge&logo=fastapi)](https://fastapi.tiangolo.com)
</div>
---
## What Is This?
A hybrid AI-powered search engine over a comprehensive corpus of Islamic Hadith (prophetic traditions). It combines neural semantic embeddings with classical BM25 and anchor-based retrieval to surface the most relevant traditions โ€” even when your query uses different wording than the Hadith itself.
Each result includes the full Hadith text, its chain of narration (Isnad), topic classification, and a direct source link.
---
## Demo
๐Ÿ”— **[Live on HuggingFace Spaces โ†’](https://huggingface.co/spaces/NightPrince/Hadith_Search)**
---
## How It Works
```
User Query (Arabic)
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Arabic Preprocessing โ”‚
โ”‚ Remove tashkeel ยท Normalize letters โ”‚
โ”‚ Unicode variant unification โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Hybrid Search (3 signals) โ”‚
โ”‚ โ”‚
โ”‚ โ‘  Anchor 40% โ€” hadith entity match โ”‚
โ”‚ โ‘ก Semantic 35% โ€” neural meaning match โ”‚
โ”‚ โ‘ข BM25 25% โ€” keyword precision โ”‚
โ”‚ โ”‚
โ”‚ Model: paraphrase-multilingual-MiniLM-L12 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚
โ–ผ
Top-K ranked Hadiths
(text ยท isnad ยท topic ยท source URL)
```
The **anchor signal** is weighted higher here (40%) because Hadith have strong named-entity anchors (narrators, topics, keywords) that are highly discriminative โ€” making entity-aware matching the dominant signal.
---
## Features
- **Anchor-weighted hybrid** โ€” prioritizes entity matching (40%) over pure semantics
- **Full Hadith metadata** โ€” text, Isnad chain, topic classification, source URL
- **Arabic-native** โ€” built for Arabic queries with proper diacritic handling
- **RTL Arabic UI** โ€” responsive glassmorphism design
- **Fast cold start** โ€” model baked into Docker image at build time
- **Cached embeddings** โ€” TTL-based in-memory cache for repeated queries
---
## Tech Stack
| Layer | Technology |
|---|---|
| Backend | FastAPI + Uvicorn |
| Embeddings | `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` |
| Vector Search | FAISS (CPU) |
| Keyword Search | BM25 (`rank-bm25`) |
| Frontend | Vanilla HTML/CSS/JS โ€” RTL Arabic |
| Deployment | Docker on HuggingFace Spaces |
---
## Project Structure
```
โ”œโ”€โ”€ app.py # FastAPI entrypoint, /api/search endpoint
โ”œโ”€โ”€ hadith_mcp.py # Search orchestrator, RAG initialization
โ”œโ”€โ”€ retrieval.py # Hybrid search: BM25 + semantic + anchor
โ”œโ”€โ”€ hf_model.py # Thread-safe SentenceTransformer + TTL cache
โ”œโ”€โ”€ utils.py # Arabic text utilities (tashkeel, normalization)
โ”œโ”€โ”€ index.html # Frontend UI
โ”œโ”€โ”€ assets/
โ”‚ โ”œโ”€โ”€ script.js # Fetch + render result cards
โ”‚ โ””โ”€โ”€ style.css # Glassmorphism RTL design
โ”œโ”€โ”€ data/
โ”‚ โ”œโ”€โ”€ hadith.csv # Hadith corpus (text, isnad, title, topic, url)
โ”‚ โ”œโ”€โ”€ hadith_embeddings.npy # Pre-computed embeddings
โ”‚ โ”œโ”€โ”€ bm25.pkl # BM25 index
โ”‚ โ”œโ”€โ”€ faiss_anchor.index # FAISS anchor index
โ”‚ โ”œโ”€โ”€ anchor_dict.pkl # anchor โ†’ hadith row indices
โ”‚ โ””โ”€โ”€ unique_anchor_texts.pkl # Ordered anchor list
โ””โ”€โ”€ Dockerfile
```
---
## API
### `POST /api/search`
```json
// Request
{ "query": "ุฅู†ู…ุง ุงู„ุฃุนู…ุงู„ ุจุงู„ู†ูŠุงุช", "top_k": 5 }
// Response
{
"results": [
{
"rank": 1,
"title": "ุญุฏูŠุซ ุงู„ู†ูŠุฉ",
"text": "ุนูŽู†ู’ ุนูู…ูŽุฑูŽ ุจู’ู†ู ุงู„ู’ุฎูŽุทูŽู‘ุงุจู ู‚ูŽุงู„ูŽ ุณูŽู…ูุนู’ุชู ุฑูŽุณููˆู„ูŽ ุงู„ู„ูŽู‘ู‡ู...",
"topic": "ุงู„ู†ูŠุฉ ูˆุงู„ุฅุฎู„ุงุต",
"source_url": "https://..."
}
]
}
```
`top_k` accepts 1โ€“10.
---
## Local Setup
```bash
pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860 --reload
# open http://localhost:7860
```
---
## Built by
**ูŠุญูŠู‰ ุงู„ู†ูˆุณุงู†ูŠ** โ€” [HuggingFace](https://huggingface.co/NightPrince)
---
*Part of a series of Islamic knowledge retrieval engines. See also: [Tafsir Search](https://github.com/NightPrinceY/Tafsir_Search) ยท [Quran Semantic Retrieval](https://github.com/NightPrinceY/Quran-Semantic-Retrieval)*