Update README.md

187651b verified 1 day ago

4.97 kB

	---
	license: mit
	language:
	- ar
	base_model:
	- aubmindlab/bert-base-arabertv02
	pipeline_tag: token-classification
	tags:
	- hadith
	- sanad
	- matn
	- hadith-separator
	- hadith_separator
	- islam
	- hadithBERT
	library_name: transformers
	---
	# Arabic Hadith Segmentation BERT (Sanad & Matn Parser)

	This model is a fine-tuned version of AraBERT (`aubmindlab/bert-base-arabertv02`) optimized for structural token classification in classical Islamic texts. Its primary task is sequence labeling—specifically identifying and drawing the boundary between the Sanad (سند - the chain of narrators) and the Matn (متن - the actual prophetic saying or text) within a raw, unsegmented Hadith string.

	## Model Description

	Classical Arabic prophetic texts lack native punctuation marks or structural delimiters to explicitly isolate who narrated a saying from the saying itself. This model treats boundary segmentation as a Named Entity Recognition (NER) / Token Classification task using custom-mapped IOB tags.

	Given a sequence of words, the model classifies each token into one of the following category IDs:

	* `0`: `B-SANAD` (Beginning of the Narrator Chain)
	* `1`: `I-SANAD` (Inside the Narrator Chain)
	* `2`: `B-MATN` (Beginning of the Core Saying)
	* `3`: `I-MATN` (Inside the Core Saying)

	---

	## Intended Uses & Limitations

	### How to Use

	You can easily download and use this model directly in your Python projects using the Hugging Face `transformers` library.

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	# 1. Load model and tokenizer directly from the Hub
	model_name = "SHK4K/hadith-segmentation-bert"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)

	# Set model to evaluation mode
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model.to(device)
	model.eval()

	# ID mapping dict matching model configuration
	id2label = {0: "B-SANAD", 1: "I-SANAD", 2: "B-MATN", 3: "I-MATN"}

	# 2. Input your raw, unsegmented Hadith text
	raw_hadith = 'حدثنا الحميدي عبد الله بن الزبير ، قال : حدثنا سفيان ، قال : حدثنا يحيى بن سعيد الأنصاري ، قال : أخبرني محمد بن إبراهيم التيمي ، أنه سمع علقمة بن وقاص الليثي ، يقول : سمعت عمر بن الخطاب رضي الله عنه على المنبر، قال : سمعت رسول الله صلى الله عليه وسلم، يقول : " إنما الأعمال بالنيات، وإنما لكل امرئ ما نوى، فمن كانت هجرته إلى دنيا يصيبها أو إلى امرأة ينكحها، فهجرته إلى ما هاجر إليه'

	# Tokenize raw text string
	inputs = tokenizer(raw_hadith, return_tensors="pt", truncation=True, max_length=512)
	inputs = {k: v.to(device) for k, v in inputs.items()}

	# 3. Predict Token Categories
	with torch.no_grad():
	outputs = model(**inputs)

	predictions = torch.argmax(outputs.logits, dim=-1)[0].cpu().tolist()
	input_tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

	# 4. Extract and Group Tokens based on Predicted Labels
	sanad_tokens = []
	matn_tokens = []

	for token, pred_id in zip(input_tokens, predictions):
	if token in ["[CLS]", "[SEP]", "[PAD]"]:
	continue

	label = id2label.get(pred_id, "O")

	if "SANAD" in label:
	sanad_tokens.append(token)
	elif "MATN" in label:
	matn_tokens.append(token)

	# Reconstruct clean component strings
	final_sanad = tokenizer.convert_tokens_to_string(sanad_tokens)
	final_matn = tokenizer.convert_tokens_to_string(matn_tokens)

	print("--- Extracted Components ---")
	print(f"SANAD: {final_sanad.strip()}\n")
	print(f"MATN: {final_matn.strip()}")

	```

	### Limitations & Biases

	* Vocalization (Harakat): Text performance might fluctuate slightly depending on whether your dataset utilizes full diacritics or completely normalized text. For extreme edge cases, it is recommended to apply text normalization (such as stripping excess tashkeel) prior to inference.
	* Length Constraints: The model is capped at a maximum sequence sequence context length of 512 subword tokens due to BERT base limitations.

	---

	## Training Data & Methodology

	* Base Pretrained Architecture: `aubmindlab/bert-base-arabertv02`
	* Task: Token Classification (NER style Sequence Labeling)
	* Optimization Framework: Hugging Face `Trainer` API compiled with `DataCollatorForTokenClassification` for safe subword token label padding (`ignore_index=-100`).
	* Hyperparameters:
	* Learning Rate: `2e-5`
	* Weight Decay: `0.01`
	* Batch Size: `16`
	* Training Epochs: `3`



	---

	## Technical Specifications & Requirements

	To set up the development space or fine-tune this model further locally, ensure you have the following packages updated:

	```bash
	pip install torch transformers datasets accelerate

	```