Token Classification
Transformers
Safetensors
Arabic
bert
hadith
sanad
matn
hadith-separator
hadith_separator
islam
hadithBERT
Instructions to use SHK4K/hadith-segmentation-bert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SHK4K/hadith-segmentation-bert with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="SHK4K/hadith-segmentation-bert")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("SHK4K/hadith-segmentation-bert") model = AutoModelForTokenClassification.from_pretrained("SHK4K/hadith-segmentation-bert") - Notebooks
- Google Colab
- Kaggle
File size: 4,967 Bytes
5da9c0c 9068306 5da9c0c f92b565 5da9c0c 187651b 5da9c0c ca6dfe2 5da9c0c a7f20cd 5da9c0c 41f8d77 5da9c0c 41f8d77 5da9c0c 41f8d77 5da9c0c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | ---
license: mit
language:
- ar
base_model:
- aubmindlab/bert-base-arabertv02
pipeline_tag: token-classification
tags:
- hadith
- sanad
- matn
- hadith-separator
- hadith_separator
- islam
- hadithBERT
library_name: transformers
---
# Arabic Hadith Segmentation BERT (Sanad & Matn Parser)
This model is a fine-tuned version of **AraBERT** (`aubmindlab/bert-base-arabertv02`) optimized for structural token classification in classical Islamic texts. Its primary task is sequence labelingโspecifically identifying and drawing the boundary between the **Sanad** (ุณูุฏ - the chain of narrators) and the **Matn** (ู
ุชู - the actual prophetic saying or text) within a raw, unsegmented Hadith string.
## Model Description
Classical Arabic prophetic texts lack native punctuation marks or structural delimiters to explicitly isolate who narrated a saying from the saying itself. This model treats boundary segmentation as a **Named Entity Recognition (NER) / Token Classification** task using custom-mapped IOB tags.
Given a sequence of words, the model classifies each token into one of the following category IDs:
* `0`: `B-SANAD` (Beginning of the Narrator Chain)
* `1`: `I-SANAD` (Inside the Narrator Chain)
* `2`: `B-MATN` (Beginning of the Core Saying)
* `3`: `I-MATN` (Inside the Core Saying)
---
## Intended Uses & Limitations
### How to Use
You can easily download and use this model directly in your Python projects using the Hugging Face `transformers` library.
```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
# 1. Load model and tokenizer directly from the Hub
model_name = "SHK4K/hadith-segmentation-bert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Set model to evaluation mode
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
model.eval()
# ID mapping dict matching model configuration
id2label = {0: "B-SANAD", 1: "I-SANAD", 2: "B-MATN", 3: "I-MATN"}
# 2. Input your raw, unsegmented Hadith text
raw_hadith = 'ุญุฏุซูุง ุงูุญู
ูุฏู ุนุจุฏ ุงููู ุจู ุงูุฒุจูุฑ ุ ูุงู : ุญุฏุซูุง ุณููุงู ุ ูุงู : ุญุฏุซูุง ูุญูู ุจู ุณุนูุฏ ุงูุฃูุตุงุฑู ุ ูุงู : ุฃุฎุจุฑูู ู
ุญู
ุฏ ุจู ุฅุจุฑุงููู
ุงูุชูู
ู ุ ุฃูู ุณู
ุน ุนููู
ุฉ ุจู ููุงุต ุงูููุซู ุ ูููู : ุณู
ุนุช ุนู
ุฑ ุจู ุงูุฎุทุงุจ ุฑุถู ุงููู ุนูู ุนูู ุงูู
ูุจุฑุ ูุงู : ุณู
ุนุช ุฑุณูู ุงููู ุตูู ุงููู ุนููู ูุณูู
ุ ูููู : " ุฅูู
ุง ุงูุฃุนู
ุงู ุจุงูููุงุชุ ูุฅูู
ุง ููู ุงู
ุฑุฆ ู
ุง ูููุ ูู
ู ูุงูุช ูุฌุฑุชู ุฅูู ุฏููุง ูุตูุจูุง ุฃู ุฅูู ุงู
ุฑุฃุฉ ูููุญูุงุ ููุฌุฑุชู ุฅูู ู
ุง ูุงุฌุฑ ุฅููู'
# Tokenize raw text string
inputs = tokenizer(raw_hadith, return_tensors="pt", truncation=True, max_length=512)
inputs = {k: v.to(device) for k, v in inputs.items()}
# 3. Predict Token Categories
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)[0].cpu().tolist()
input_tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
# 4. Extract and Group Tokens based on Predicted Labels
sanad_tokens = []
matn_tokens = []
for token, pred_id in zip(input_tokens, predictions):
if token in ["[CLS]", "[SEP]", "[PAD]"]:
continue
label = id2label.get(pred_id, "O")
if "SANAD" in label:
sanad_tokens.append(token)
elif "MATN" in label:
matn_tokens.append(token)
# Reconstruct clean component strings
final_sanad = tokenizer.convert_tokens_to_string(sanad_tokens)
final_matn = tokenizer.convert_tokens_to_string(matn_tokens)
print("--- Extracted Components ---")
print(f"SANAD: {final_sanad.strip()}\n")
print(f"MATN: {final_matn.strip()}")
```
### Limitations & Biases
* **Vocalization (Harakat):** Text performance might fluctuate slightly depending on whether your dataset utilizes full diacritics or completely normalized text. For extreme edge cases, it is recommended to apply text normalization (such as stripping excess tashkeel) prior to inference.
* **Length Constraints:** The model is capped at a maximum sequence sequence context length of 512 subword tokens due to BERT base limitations.
---
## Training Data & Methodology
* **Base Pretrained Architecture:** `aubmindlab/bert-base-arabertv02`
* **Task:** Token Classification (NER style Sequence Labeling)
* **Optimization Framework:** Hugging Face `Trainer` API compiled with `DataCollatorForTokenClassification` for safe subword token label padding (`ignore_index=-100`).
* **Hyperparameters:**
* Learning Rate: `2e-5`
* Weight Decay: `0.01`
* Batch Size: `16`
* Training Epochs: `3`
---
## Technical Specifications & Requirements
To set up the development space or fine-tune this model further locally, ensure you have the following packages updated:
```bash
pip install torch transformers datasets accelerate
``` |