AhmedNabil1
/

arabic_ner_qwen_model

@@ -5,19 +5,180 @@ tags:
 - text-generation-inference
 - transformers
 - unsloth
-- qwen2
 - trl
 license: apache-2.0
 language:
 - ar
 ---
-# Uploaded  model
-- **Developed by:** AhmedNabil1
-- **License:** apache-2.0
-- **Finetuned from model :** unsloth/qwen2.5-0.5b-instruct
-This qwen2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
-[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

 - text-generation-inference
 - transformers
 - unsloth
 - trl
+- NER
+- qwen2.5
+- QLoRA
 license: apache-2.0
 language:
 - ar
 ---
+# Arabic NER Model - Qwen2.5-0.5B Fine-tuned on Wojood Dataset
+## Model Description
+This model is a fine-tuned version of [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) for Arabic Named Entity Recognition (NER). It was trained on a sample of the **Wojood dataset** provided by SinaLab.
+## Dataset
+**Original Source**: [SinaLab/ArabicNER](https://github.com/SinaLab/ArabicNER)<br>
+**Important**: This dataset represents only a sample of the full Wojood dataset, as SinaLab has not released the complete dataset publicly.
+**Processed Dataset**: [AhmedNabil1/wojood-arabic-ner](https://huggingface.co/datasets/AhmedNabil1/wojood-arabic-ner)<br>
+The data has been processed and converted into JSON format, structured specifically for fine-tuning NER tasks with proper formatting and tokenization.
+## Supported Entity Types
+**PERS** (Person), **ORG**, **GPE** (Geopolitical entities, countries, cities), **LOC** (Locations), **DATE**, **TIME**, **CARDINAL**, **ORDINAL**, **PERCENT**, **MONEY**, **QUANTITY**, **EVENT**, **FAC** (Facilities), **NORP** (Nationalities, religious/political groups), **OCC** (Occupations), **LANGUAGE**, **WEBSITE**, **UNIT** (Units of measurement), **LAW** (Legal documents), **PRODUCT**, **CURR** (Currencies)
+## Training Details
+**Base Model**: [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)<br>
+Fine-tuned using [**Unsloth**](https://github.com/unslothai/unsloth) with **QLoRA**.
+## Usage
+### Installation
+```bash
+pip install torch transformers unsloth
+```
+### Loading the Model
+```python
+from unsloth import FastLanguageModel
+# Load model and tokenizer
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name="AhmedNabil1/arabic_ner_qwen_model",
+    max_seq_length=2048,
+    dtype=None,
+    load_in_4bit=True,
+)
+# Enable inference mode
+model = FastLanguageModel.for_inference(model)
+```
+### Entity Extraction Function
+```python
+# Define entity types and schema
+from pydantic import BaseModel, Field
+from typing import List, Literal
+EntityType = Literal[
+    "PERS", "NORP", "OCC", "ORG", "GPE", "LOC", "FAC", "EVENT",
+    "DATE", "TIME", "CARDINAL", "ORDINAL", "PERCENT", "LANGUAGE",
+    "QUANTITY", "WEBSITE", "UNIT", "LAW", "MONEY", "PRODUCT", "CURR"
+]
+class NEREntity(BaseModel):
+    entity_value: str = Field(..., description="The actual named entity found in the text.")
+    entity_type: EntityType = Field(..., description="The entity type")
+class NERData(BaseModel):
+    story_entities: List[NEREntity] = Field(..., description="A list of entities found in the text.")
+def extract_entities_from_story(story, model, tokenizer):
+    """
+    Extract named entities from Arabic text.
+    This function demonstrates the recommended approach for optimal results.
+    """
+    entities_extraction_messages = [
+        {
+            "role": "system",
+            "content": "\n".join([
+                "You are an advanced NLP entity extraction assistant.",
+                "Your task is to extract named entities from Arabic text according to a given Pydantic schema.",
+                "Ensure that the extracted entities exactly match how they appear in the text, without modifications.",
+                "Follow the schema strictly, maintaining the correct entity types and structure.",
+                "Output the extracted entities in JSON format, structured according to the provided Pydantic schema.",
+                "Do not add explanations, introductions, or extra text, Only return the formatted JSON output."
+            ])
+        },
+        {
+            "role": "user",
+            "content": "\n".join([
+                "## Text:",
+                story.strip(),
+                "",
+                "## Pydantic Schema:",
+                json.dumps(NERData.model_json_schema(), ensure_ascii=False, indent=2),
+                "",
+                "## Text Entities:",
+                "```json"
+            ])
+        }
+    ]
+    # Apply chat template
+    text = tokenizer.apply_chat_template(
+        entities_extraction_messages,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    # Generate response
+    model_inputs = tokenizer([text], return_tensors="pt").to("cuda")
+    generated_ids = model.generate(
+        model_inputs.input_ids,
+        max_new_tokens=1024,
+        do_sample=False,
+    )
+    # Decode response
+    generated_ids = [
+        output_ids[len(input_ids):]
+        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+    ]
+    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+    return response
+```
+### Example Usage
+```python
+# Example Arabic text
+story = """
+مضابط بلدية نابلس عام ( 1308 ) هجري مضبط رقم 435 .
+"""
+# Extract entities
+response = extract_entities_from_story(story, model, tokenizer)
+print(response)
+# Parse JSON response
+import json
+entities = json.loads(response)
+print(entities)
+```
+**Output:**
+```json
+{
+  "story_entities": [
+    {"entity_value": "بلدية نابلس", "entity_type": "ORG"},
+    {"entity_value": "نابلس", "entity_type": "GPE"},
+    {"entity_value": "عام ( 1308 ) هجري", "entity_type": "DATE"},
+    {"entity_value": "435", "entity_type": "ORDINAL"}
+  ]
+}
+```
+## Model Performance
+The model performs well on Arabic NER tasks within the scope of the available training data.
+It was trained on a limited sample of the Wojood dataset. The available sample exhibits some class imbalance across different entity types, which may result in varying recognition accuracy for certain entities.
+## Citation
+- Wojood dataset: [SinaLab/ArabicNER](https://github.com/SinaLab/ArabicNER)
+- Base Qwen2.5 model: [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
+## License
+This model follows the license terms of the base Qwen2.5 model and the Wojood dataset.