AhmedNabil1 commited on
Commit
bb9ec05
·
verified ·
1 Parent(s): 0eb4882

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +168 -7
README.md CHANGED
@@ -5,19 +5,180 @@ tags:
5
  - text-generation-inference
6
  - transformers
7
  - unsloth
8
- - qwen2
9
  - trl
 
 
 
10
  license: apache-2.0
11
  language:
12
  - ar
13
  ---
14
 
15
- # Uploaded model
16
 
17
- - **Developed by:** AhmedNabil1
18
- - **License:** apache-2.0
19
- - **Finetuned from model :** unsloth/qwen2.5-0.5b-instruct
20
 
21
- This qwen2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
22
 
23
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - text-generation-inference
6
  - transformers
7
  - unsloth
 
8
  - trl
9
+ - NER
10
+ - qwen2.5
11
+ - QLoRA
12
  license: apache-2.0
13
  language:
14
  - ar
15
  ---
16
 
17
+ # Arabic NER Model - Qwen2.5-0.5B Fine-tuned on Wojood Dataset
18
 
19
+ ## Model Description
 
 
20
 
21
+ This model is a fine-tuned version of [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) for Arabic Named Entity Recognition (NER). It was trained on a sample of the **Wojood dataset** provided by SinaLab.
22
 
23
+ ## Dataset
24
+
25
+ **Original Source**: [SinaLab/ArabicNER](https://github.com/SinaLab/ArabicNER)<br>
26
+ **Important**: This dataset represents only a sample of the full Wojood dataset, as SinaLab has not released the complete dataset publicly.
27
+
28
+ **Processed Dataset**: [AhmedNabil1/wojood-arabic-ner](https://huggingface.co/datasets/AhmedNabil1/wojood-arabic-ner)<br>
29
+ The data has been processed and converted into JSON format, structured specifically for fine-tuning NER tasks with proper formatting and tokenization.
30
+
31
+ ## Supported Entity Types
32
+
33
+ **PERS** (Person), **ORG**, **GPE** (Geopolitical entities, countries, cities), **LOC** (Locations), **DATE**, **TIME**, **CARDINAL**, **ORDINAL**, **PERCENT**, **MONEY**, **QUANTITY**, **EVENT**, **FAC** (Facilities), **NORP** (Nationalities, religious/political groups), **OCC** (Occupations), **LANGUAGE**, **WEBSITE**, **UNIT** (Units of measurement), **LAW** (Legal documents), **PRODUCT**, **CURR** (Currencies)
34
+
35
+ ## Training Details
36
+
37
+ **Base Model**: [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)<br>
38
+ Fine-tuned using [**Unsloth**](https://github.com/unslothai/unsloth) with **QLoRA**.
39
+
40
+ ## Usage
41
+
42
+ ### Installation
43
+
44
+ ```bash
45
+ pip install torch transformers unsloth
46
+ ```
47
+
48
+ ### Loading the Model
49
+
50
+ ```python
51
+ from unsloth import FastLanguageModel
52
+
53
+ # Load model and tokenizer
54
+ model, tokenizer = FastLanguageModel.from_pretrained(
55
+ model_name="AhmedNabil1/arabic_ner_qwen_model",
56
+ max_seq_length=2048,
57
+ dtype=None,
58
+ load_in_4bit=True,
59
+ )
60
+
61
+ # Enable inference mode
62
+ model = FastLanguageModel.for_inference(model)
63
+ ```
64
+
65
+ ### Entity Extraction Function
66
+
67
+ ```python
68
+ # Define entity types and schema
69
+ from pydantic import BaseModel, Field
70
+ from typing import List, Literal
71
+
72
+ EntityType = Literal[
73
+ "PERS", "NORP", "OCC", "ORG", "GPE", "LOC", "FAC", "EVENT",
74
+ "DATE", "TIME", "CARDINAL", "ORDINAL", "PERCENT", "LANGUAGE",
75
+ "QUANTITY", "WEBSITE", "UNIT", "LAW", "MONEY", "PRODUCT", "CURR"
76
+ ]
77
+
78
+ class NEREntity(BaseModel):
79
+ entity_value: str = Field(..., description="The actual named entity found in the text.")
80
+ entity_type: EntityType = Field(..., description="The entity type")
81
+
82
+ class NERData(BaseModel):
83
+ story_entities: List[NEREntity] = Field(..., description="A list of entities found in the text.")
84
+
85
+ def extract_entities_from_story(story, model, tokenizer):
86
+ """
87
+ Extract named entities from Arabic text.
88
+ This function demonstrates the recommended approach for optimal results.
89
+ """
90
+ entities_extraction_messages = [
91
+ {
92
+ "role": "system",
93
+ "content": "\n".join([
94
+ "You are an advanced NLP entity extraction assistant.",
95
+ "Your task is to extract named entities from Arabic text according to a given Pydantic schema.",
96
+ "Ensure that the extracted entities exactly match how they appear in the text, without modifications.",
97
+ "Follow the schema strictly, maintaining the correct entity types and structure.",
98
+ "Output the extracted entities in JSON format, structured according to the provided Pydantic schema.",
99
+ "Do not add explanations, introductions, or extra text, Only return the formatted JSON output."
100
+ ])
101
+ },
102
+ {
103
+ "role": "user",
104
+ "content": "\n".join([
105
+ "## Text:",
106
+ story.strip(),
107
+ "",
108
+ "## Pydantic Schema:",
109
+ json.dumps(NERData.model_json_schema(), ensure_ascii=False, indent=2),
110
+ "",
111
+ "## Text Entities:",
112
+ "```json"
113
+ ])
114
+ }
115
+ ]
116
+
117
+ # Apply chat template
118
+ text = tokenizer.apply_chat_template(
119
+ entities_extraction_messages,
120
+ tokenize=False,
121
+ add_generation_prompt=True
122
+ )
123
+
124
+ # Generate response
125
+ model_inputs = tokenizer([text], return_tensors="pt").to("cuda")
126
+ generated_ids = model.generate(
127
+ model_inputs.input_ids,
128
+ max_new_tokens=1024,
129
+ do_sample=False,
130
+ )
131
+
132
+ # Decode response
133
+ generated_ids = [
134
+ output_ids[len(input_ids):]
135
+ for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
136
+ ]
137
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
138
+
139
+ return response
140
+ ```
141
+
142
+ ### Example Usage
143
+
144
+ ```python
145
+ # Example Arabic text
146
+ story = """
147
+ مضابط بلدية نابلس عام ( 1308 ) هجري مضبط رقم 435 .
148
+ """
149
+
150
+ # Extract entities
151
+ response = extract_entities_from_story(story, model, tokenizer)
152
+ print(response)
153
+
154
+ # Parse JSON response
155
+ import json
156
+ entities = json.loads(response)
157
+ print(entities)
158
+ ```
159
+
160
+ **Output:**
161
+ ```json
162
+ {
163
+ "story_entities": [
164
+ {"entity_value": "بلدية نابلس", "entity_type": "ORG"},
165
+ {"entity_value": "نابلس", "entity_type": "GPE"},
166
+ {"entity_value": "عام ( 1308 ) هجري", "entity_type": "DATE"},
167
+ {"entity_value": "435", "entity_type": "ORDINAL"}
168
+ ]
169
+ }
170
+ ```
171
+
172
+ ## Model Performance
173
+
174
+ The model performs well on Arabic NER tasks within the scope of the available training data.
175
+ It was trained on a limited sample of the Wojood dataset. The available sample exhibits some class imbalance across different entity types, which may result in varying recognition accuracy for certain entities.
176
+
177
+ ## Citation
178
+
179
+ - Wojood dataset: [SinaLab/ArabicNER](https://github.com/SinaLab/ArabicNER)
180
+ - Base Qwen2.5 model: [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
181
+
182
+ ## License
183
+
184
+ This model follows the license terms of the base Qwen2.5 model and the Wojood dataset.