Muhsabrys commited on
Commit
8c6acc1
ยท
verified ยท
1 Parent(s): 1791be7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +221 -0
README.md CHANGED
@@ -41,3 +41,224 @@ model-index:
41
  year: 2025
42
  url: https://aclanthology.org/2025.finnlp-1.20
43
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  year: 2025
42
  url: https://aclanthology.org/2025.finnlp-1.20
43
  ---
44
+
45
+ ```
46
+
47
+ # AMWAL: Arabic Financial Named Entity Recognition (NER)
48
+
49
+ ## Quick Start
50
+
51
+ ### Install (recommended)
52
+
53
+ ```bash
54
+ pip install git+https://huggingface.co/Muhsabrys/AMWAL-ner-arabic
55
+ ```
56
+
57
+ ```python
58
+ from amwal import load_ner
59
+
60
+ ner = load_ner()
61
+
62
+ text = "ุฃุนู„ู† ุตู†ุฏูˆู‚ ู‚ุทุฑ ุงู„ุณูŠุงุฏูŠ ุนู† ุงุณุชุซู…ุงุฑ ุจู‚ูŠู…ุฉ 500 ู…ู„ูŠูˆู† ุฏูˆู„ุงุฑ ุฃู…ุฑูŠูƒูŠ ููŠ ุณู†ุฏุงุช ุญูƒูˆู…ูŠุฉ ูŠุงุจุงู†ูŠุฉ ู…ู‚ูˆู…ุฉ ุจุงู„ูŠู† ููŠ ุทูˆูƒูŠูˆ."
63
+ result = ner(text)
64
+
65
+ print(result["entities"])
66
+ ```
67
+
68
+ ---
69
+
70
+ ## Model Summary
71
+
72
+ **AMWAL** is a **spaCy-based Named Entity Recognition (NER) system** designed for extracting **financial entities from Arabic text**, with a primary focus on **Arabic financial news and reports**.
73
+
74
+ The model addresses challenges specific to Arabic financial NLP, including orthographic variation, domain-specific terminology, and the scarcity of annotated financial resources for Arabic.
75
+
76
+ ---
77
+
78
+ ## Intended Use
79
+
80
+ AMWAL is intended for:
81
+
82
+ * Arabic financial news analysis
83
+ * Information extraction from financial reports
84
+ * Financial text preprocessing
85
+ * Academic research in Arabic NLP and finance
86
+ * Data enrichment for financial knowledge graphs
87
+
88
+ It is **not intended** for:
89
+
90
+ * General-purpose Arabic NER
91
+ * Non-financial domains
92
+ * Direct use with Hugging Face Transformers APIs
93
+
94
+ ---
95
+
96
+ ## Data Collection and Annotation
97
+
98
+ A specialized Arabic financial corpus was constructed from **three major Arabic financial newspapers**, covering the period **2000โ€“2023**.
99
+
100
+ The annotation process followed a **semi-automatic workflow**:
101
+
102
+ 1. Automatic candidate entity extraction
103
+ 2. Manual annotation
104
+ 3. Expert review and correction
105
+
106
+ The final dataset contains:
107
+
108
+ * **17.1K annotated entity tokens**
109
+ * **21 financial entity categories**
110
+ * Consistent domain coverage across multiple time periods
111
+
112
+ ---
113
+
114
+ ## Entity Schema and Standardization
115
+
116
+ Entity categories were standardized using concepts from the
117
+ **Financial Industry Business Ontology (FIBO, 2020)** to ensure conceptual consistency and compatibility with structured financial representations.
118
+
119
+ ---
120
+
121
+ ## Model Architecture and Training
122
+
123
+ * **Framework:** spaCy
124
+ * **Pipeline:** Custom Named Entity Recognition (NER)
125
+ * **Domain:** Arabic financial text
126
+
127
+ The model was trained on the annotated corpus using spaCyโ€™s NER pipeline.
128
+ To mitigate sparsity caused by Arabic orthographic variation, normalization was applied consistently during training and inference.
129
+
130
+ ---
131
+
132
+ ## Arabic Normalization
133
+
134
+ The following normalization steps are applied **internally during inference**, matching the training setup:
135
+
136
+ * Removal of all diacritics
137
+ * Character normalization:
138
+
139
+ * `ุฅ`, `ุฃ`, `ุข` โ†’ `ุง`
140
+ * `ุค`, `ุฆ` โ†’ `ุก`
141
+ * `ุฉ` โ†’ `ู‡`
142
+ * `ู‰` โ†’ `ูŠ`
143
+
144
+ The original input text is always preserved and returned as `raw_text`.
145
+
146
+ ---
147
+
148
+ ## Entity Types
149
+
150
+ The model recognizes **21 financial entity types**, including (but not limited to):
151
+
152
+ * `COUNTRY`
153
+ * `CITY`
154
+ * `CURRENCY`
155
+ * `FINANCIAL_INSTRUMENT`
156
+ * `BANK`
157
+ * `ORGANIZATION`
158
+ * `NATIONALITY`
159
+ * `EVENT`
160
+ * `TIME`
161
+ * `QUANTITY_OR_UNIT`
162
+
163
+ ---
164
+
165
+ ## Evaluation Results
166
+
167
+ The model was evaluated on a held-out test set using standard NER metrics:
168
+
169
+ | Metric | Score |
170
+ | --------- | ---------- |
171
+ | Precision | **96.08%** |
172
+ | Recall | **95.87%** |
173
+ | F1-score | **95.97%** |
174
+
175
+ These results are competitive with reported financial NER systems in other languages, despite the additional challenges posed by Arabic morphology and orthography.
176
+
177
+ ---
178
+
179
+ ## Usage
180
+
181
+ AMWAL supports **two officially supported usage modes**.
182
+
183
+ ### Option 1 โ€” Install via `pip` (recommended)
184
+
185
+ ```bash
186
+ pip install git+https://huggingface.co/Muhsabrys/AMWAL-ner-arabic
187
+ ```
188
+
189
+ ```python
190
+ from amwal import load_ner
191
+
192
+ ner = load_ner()
193
+ result = ner("ู†ุต ุนุฑุจูŠ ู…ุงู„ูŠ")
194
+ ```
195
+
196
+ ---
197
+
198
+ ### Option 2 โ€” Use directly from Hugging Face (no installation)
199
+
200
+ ```python
201
+ from huggingface_hub import snapshot_download
202
+ import sys
203
+
204
+ repo_path = snapshot_download("Muhsabrys/AMWAL-ner-arabic")
205
+ sys.path.append(repo_path)
206
+
207
+ from amwal import load_ner
208
+
209
+ ner = load_ner(local_path=repo_path)
210
+ result = ner("ู†ุต ุนุฑุจูŠ ู…ุงู„ูŠ")
211
+ ```
212
+
213
+ ---
214
+
215
+ ## Output Format
216
+
217
+ ```json
218
+ {
219
+ "raw_text": "...",
220
+ "normalized_text": "...",
221
+ "entities": [
222
+ {
223
+ "text": "ู‚ุทุฑ",
224
+ "label": "COUNTRY",
225
+ "start": 11,
226
+ "end": 14
227
+ }
228
+ ]
229
+ }
230
+ ```
231
+
232
+ ---
233
+
234
+ ## Limitations
235
+
236
+ * Domain-specific to financial text
237
+ * Not suitable for general-purpose Arabic NER
238
+ * Does not model relations between entities
239
+ * Not compatible with Hugging Face Transformers APIs
240
+
241
+ ---
242
+
243
+ ## Future Work
244
+
245
+ Planned future directions include:
246
+
247
+ * Expanding the annotated corpus
248
+ * Introducing hierarchical entity structures
249
+ * Modeling relations between financial entities
250
+ * Constructing an Arabic financial knowledge graph
251
+
252
+ ---
253
+
254
+ ## Citation
255
+
256
+ ```bibtex
257
+ @inproceedings{abdo2025amwal,
258
+ title={AMWAL: Named Entity Recognition for Arabic Financial News},
259
+ author={Abdo, Muhammad S and Hatekar, Yash and {\'C}avar, Damir},
260
+ booktitle={Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal)},
261
+ pages={207--213},
262
+ year={2025}
263
+ }
264
+ ```