File size: 9,997 Bytes
36ed911
 
 
 
 
 
e17ca44
 
36ed911
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e17ca44
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
---
title: Hadith Semantic Search
emoji: ๐Ÿ“š
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 6.9.0
python_version: '3.10'
app_file: app.py
pinned: false
---

# Hadith Semantic Search Project

## Overview

This project implements an AI-powered semantic search engine for Hadith (Islamic traditions). Unlike traditional keyword-based search tools that match exact words, this system understands the **meaning** behind queries and returns relevant Hadiths even when different wording is used.

The project uses advanced natural language processing (NLP) techniques including:
- **Semantic embeddings** using multilingual sentence transformers
- **BM25 ranking** for keyword relevance
- **Hybrid search** combining semantic and keyword approaches
- **Anchor-based retrieval** for improved accuracy
- **FAISS** for efficient similarity search

## Table of Contents

- [Features](#features)
- [Installation](#installation)
- [Dataset](#dataset)
- [Project Structure](#project-structure)
- [Methodology](#methodology)
- [Usage](#usage)
- [Evaluation](#evaluation)
- [Deployment](#deployment)
- [Technologies Used](#technologies-used)
- [Results](#results)
- [Future Improvements](#future-improvements)
- [Contributing](#contributing)
- [License](#license)

## Features

- **Semantic Understanding**: Retrieves Hadiths based on meaning, not just exact word matches
- **Multilingual Support**: Works with Arabic text using multilingual models
- **Hybrid Search**: Combines semantic similarity with BM25 keyword matching for optimal results
- **Anchor-based Enhancement**: Uses subject-based anchors to improve retrieval accuracy
- **Web Interface**: Gradio-based interface for easy interaction
- **Efficient Search**: Uses FAISS for fast similarity search on large datasets
- **Evaluation Metrics**: Includes Precision@K and Recall@K for performance measurement

## Installation

### Prerequisites

- Python 3.8 or higher
- pip package manager

### Setup

1. Clone the repository:
```bash
git clone <repository-url>
cd hadith-semantic-search
```

2. Install required packages:
```bash
pip install -r requirements.txt
```

### Required Libraries

```
sentence-transformers==2.2.2
transformers>=4.36.0
torch>=2.0.0
faiss-cpu
rank-bm25
numpy
pandas
gradio
scikit-learn
matplotlib
seaborn
```

## Dataset

The project uses the `hadith_by_book.csv` dataset containing:
- **Hadith text** (matn_text)
- **Subject classifications** (main_subj)
- **Reference URLs** (xref_url)
- **Ayat IDs** (ayat_ids)
- **Book metadata**

### Data Processing Steps

1. **Loading**: Import data from CSV
2. **Cleaning**: Remove duplicate entries and unnecessary columns
3. **Preprocessing**: Remove Arabic diacritics (tashkeel) for better matching
4. **Analysis**: Visualize text length distribution and subject categories

## Project Structure

```
hadith-semantic-search/
โ”‚
โ”œโ”€โ”€ hadith.ipynb              # Main Jupyter notebook
โ”œโ”€โ”€ README.md                 # This file
โ”œโ”€โ”€ requirements.txt          # Python dependencies
โ”‚
โ”œโ”€โ”€ app.py                    # Gradio web application
โ”œโ”€โ”€ retrieval.py              # Search retrieval functions
โ”œโ”€โ”€ utils.py                  # Utility functions
โ”‚
โ”œโ”€โ”€ data/                     # Data directory
โ”‚   โ”œโ”€โ”€ hadith_embeddings.npy # Pre-computed embeddings
โ”‚   โ”œโ”€โ”€ bm25.pkl             # BM25 model
โ”‚   โ””โ”€โ”€ anchor_index.faiss   # Anchor embeddings index
โ”‚
โ””โ”€โ”€ hadith_by_book.csv       # Dataset
```

## Methodology

### 1. Text Preprocessing

- Remove Arabic diacritics (tashkeel) to normalize text
- Clean special characters while preserving Arabic script
- Tokenize text for BM25 processing

### 2. Embedding Generation

Uses **paraphrase-multilingual-MiniLM-L12-v2** model to create 384-dimensional embeddings that capture semantic meaning of Hadith text.

### 3. Search Approaches

#### a) Pure Semantic Search (FAISS)
- Encodes query into embedding
- Uses FAISS IndexFlatIP for cosine similarity search
- Returns top-K most similar Hadiths

#### b) Hybrid Search (BM25 + Semantic)
1. **BM25 Retrieval**: Get top-50 candidates using keyword matching
2. **Semantic Re-ranking**: Re-rank candidates using semantic similarity
3. **Score Fusion**: Combine BM25 and semantic scores with weighted average (alpha=0.8)

#### c) Enhanced Hybrid Search with Anchors
1. **Anchor Creation**: Create subject-based anchors from main topics
2. **Query-Anchor Matching**: Find relevant subject anchors for query
3. **Candidate Expansion**: Include Hadiths from relevant subjects
4. **Hybrid Scoring**: Combine BM25, semantic, and anchor signals

### 4. Evaluation

Performance measured using:
- **Precision@K**: Proportion of relevant results in top-K
- **Recall@K**: Proportion of all relevant Hadiths retrieved in top-K

Test queries cover various topics:
- Importance of intention in deeds
- Virtues of prayer
- Rights of neighbors
- Seeking knowledge
- Charity and giving

## Usage

### Running the Notebook

1. Open the Jupyter notebook:
```bash
jupyter notebook hadith.ipynb
```

2. Execute cells sequentially to:
   - Load and preprocess data
   - Generate embeddings
   - Build search indices
   - Test queries
   - Evaluate performance

### Using the Web Interface

1. Generate required data files by running the notebook
2. Launch the Gradio app:
```bash
python app.py
```

3. Open the provided URL in your browser
4. Enter Arabic queries to search Hadiths

### Example Queries

```python
# Example 1: Query about intention
query = "ู…ุง ู‡ูˆ ุงู„ุญุฏูŠุซ ุงู„ุฐูŠ ูŠุดุฑุญ ุฃู‡ู…ูŠุฉ ุงู„ู†ูŠุฉ ูˆุฃุซุฑู‡ุง ููŠ ู‚ุจูˆู„ ุงู„ุฃุนู…ุงู„ ุนู†ุฏ ุงู„ู„ู‡"

# Example 2: Query about charity
query = "ูุถู„ ุงู„ุตุฏู‚ุฉ ูˆุงู„ุฅู†ูุงู‚ ููŠ ุณุจูŠู„ ุงู„ู„ู‡"

# Example 3: Query about knowledge
query = "ุฃู‡ู…ูŠุฉ ุทู„ุจ ุงู„ุนู„ู… ูˆูุถู„ ุงู„ุนุงู„ู…"
```

## Evaluation

The project includes a comprehensive evaluation framework:

### Evaluation Queries

5 carefully crafted queries with known relevant Hadith IDs:
1. **Intention (Niyyah)**: Importance of intention in accepting deeds
2. **Prayer virtues**: Excellence of prayer and its rewards
3. **Neighbor rights**: Rights and treatment of neighbors
4. **Seeking knowledge**: Importance and virtue of knowledge
5. **Charity**: Giving in the path of Allah

### Metrics

- **Precision@5**: Accuracy of top 5 results
- **Recall@5**: Coverage of relevant results in top 5
- **Average scores** across all queries

### Results Comparison

| Method | Precision@5 | Recall@5 |
|--------|-------------|----------|
| Pure Semantic (FAISS) | ~0.XX | ~0.XX |
| Hybrid (BM25 + Semantic) | ~0.XX | ~0.XX |
| Enhanced (with Anchors) | ~0.XX | ~0.XX |

## Deployment

The project includes deployment-ready files:

### Files Created

1. **app.py**: Main Gradio application
2. **retrieval.py**: Core search functions
3. **utils.py**: Preprocessing utilities
4. **requirements.txt**: Dependencies

### Deployment Steps

1. Ensure all data files are in the `data/` directory
2. Install dependencies: `pip install -r requirements.txt`
3. Run: `python app.py`
4. For production, consider using:
   - Docker containers
   - Cloud platforms (AWS, GCP, Azure)
   - Gradio Spaces for easy hosting

## Technologies Used

### Core Libraries

- **sentence-transformers**: Multilingual semantic embeddings
- **transformers**: Hugging Face transformer models
- **torch**: PyTorch deep learning framework
- **faiss-cpu**: Fast similarity search and clustering
- **rank-bm25**: BM25 ranking algorithm

### Data & Analysis

- **pandas**: Data manipulation and analysis
- **numpy**: Numerical computing
- **matplotlib**: Data visualization
- **seaborn**: Statistical visualization

### Web Interface

- **gradio**: Interactive web interface
- **scikit-learn**: Machine learning utilities

## Results

### Key Findings

1. **Hybrid approach outperforms** pure semantic or keyword-only search
2. **Anchor-based enhancement** improves precision for subject-specific queries
3. **Arabic text preprocessing** (removing diacritics) improves matching
4. **Multilingual models** effectively capture Arabic semantic meaning

### Performance Insights

- Average query time: ~0.1-0.5 seconds
- Index size: Efficient for datasets up to 100K+ Hadiths
- Embedding dimension: 384 (balanced between accuracy and speed)

## Future Improvements

1. **Cross-encoder Re-ranking**: Add a second-stage cross-encoder for final ranking
2. **Query Expansion**: Automatically expand queries with synonyms
3. **Multi-language Support**: Add English and other language interfaces
4. **Advanced Filtering**: Filter by book, narrator, or authenticity grade
5. **Feedback Loop**: Incorporate user feedback to improve rankings
6. **GPU Acceleration**: Use FAISS GPU for faster search on large datasets
7. **Context Window**: Show surrounding Hadiths for better understanding
8. **Citation Network**: Leverage hadith-to-hadith references

## Contributing

Contributions are welcome! Please:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

### Areas for Contribution

- Improving Arabic text preprocessing
- Adding new evaluation queries
- Optimizing search algorithms
- Enhancing the web interface
- Documentation improvements

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Acknowledgments

- **Sentence Transformers** team for multilingual models
- **FAISS** developers for efficient similarity search
- Hadith dataset providers
- Islamic scholars for categorization and verification

## Contact

For questions, suggestions, or collaboration:
- Open an issue on GitHub
- Contact: [Your Email]

---

**Note**: This is an educational project for demonstrating semantic search techniques on Islamic texts. For religious guidance, always consult qualified Islamic scholars.