Muhsabrys commited on
Commit
3c70de3
·
verified ·
1 Parent(s): b4b63ae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +184 -97
README.md CHANGED
@@ -1,180 +1,266 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
 
2
  # AMWAL: Arabic Financial Named Entity Recognition (NER)
3
 
4
- ## Overview
 
 
5
 
6
- **AMWAL** is a **spaCy-based Named Entity Recognition (NER) system** designed specifically for **Arabic financial news and reports**.
7
- It targets the extraction of structured financial entities from unstructured Arabic text, addressing the lack of high-quality Arabic financial NLP resources.
 
8
 
9
- This repository provides:
 
10
 
11
- * A trained **spaCy NER pipeline**
12
- * An integrated **Arabic normalization layer**
13
- * A simple Python API for inference
 
 
 
 
14
 
15
- > ⚠️ This is **not a Transformers / BERT model**.
16
- > Usage is via **spaCy**, not `AutoModelForTokenClassification`.
17
 
18
  ---
19
 
20
- ## Key Features
 
 
21
 
22
- * Domain-specific Arabic **financial entity recognition**
23
- * Robust handling of **Arabic orthographic variation**
24
- * Fine-grained financial entity schema (21 types)
25
- * Ready-to-use inference via Hugging Face
26
- * Suitable for research and downstream financial NLP tasks
27
 
28
  ---
29
 
30
- ## Installation
31
 
32
- ```bash
33
- pip install spacy huggingface_hub
34
- ```
 
 
 
 
 
 
 
 
 
 
35
 
36
  ---
37
 
38
- ## Usage (Recommended)
39
 
40
- ### Load from Hugging Face
41
 
42
- ```python
43
- from amwal import load_ner
44
 
45
- ner = load_ner() # downloads model from Hugging Face
 
 
46
 
47
- text = "أعلن صندوق قطر السيادي عن استثمار بقيمة 500 مليون دولار أمريكي في سندات حكومية يابانية مقومة بالين في طوكيو."
48
- output = ner(text)
49
 
50
- print(output)
51
- ```
 
52
 
53
- ### Output Format
54
 
55
- ```json
56
- {
57
- "raw_text": "...",
58
- "normalized_text": "...",
59
- "entities": [
60
- {
61
- "text": "قطر",
62
- "label": "COUNTRY",
63
- "start": 11,
64
- "end": 14
65
- },
66
- {
67
- "text": "دولار",
68
- "label": "CURRENCY",
69
- "start": 50,
70
- "end": 55
71
- }
72
- ]
73
- }
74
- ```
75
 
76
  ---
77
 
78
  ## Arabic Normalization
79
 
80
- Inference applies **the same normalization used during training**, including:
81
 
82
- * Removal of diacritics
83
- * Orthographic normalization:
84
 
85
- * `إ، أ، آا`
86
- * `ؤ، ئء`
87
- * ه`
88
- * ي`
89
 
90
- Normalization is applied **internally only**.
91
- The original input text is always preserved in `raw_text`.
92
 
93
  ---
94
 
95
  ## Entity Types
96
 
97
- The model recognizes **21 financial entity categories**, including:
98
 
99
  * `COUNTRY`
100
  * `CITY`
101
  * `CURRENCY`
102
  * `FINANCIAL_INSTRUMENT`
103
- * `ORGANIZATION`
104
  * `BANK`
 
105
  * `NATIONALITY`
106
  * `EVENT`
107
  * `TIME`
108
  * `QUANTITY_OR_UNIT`
109
- * *(and others)*
110
 
111
  ---
112
 
113
- ## Data Collection and Annotation
 
 
114
 
115
- We constructed a **specialized Arabic financial corpus** sourced from **three major Arabic financial newspapers** covering the period **2000–2023**.
 
 
 
 
116
 
117
- Entity annotation followed a **semi-automatic workflow**:
118
 
119
- 1. Automatic candidate extraction
120
- 2. Manual annotation
121
- 3. Expert review and correction
122
 
123
- The final dataset contains:
124
 
125
- * **17.1K annotated entity tokens**
126
- * **21 entity categories**
127
- * High inter-annotator consistency
128
 
129
  ---
130
 
131
- ## Entity Standardization
 
 
 
 
 
 
 
132
 
133
- Entity categories were aligned with concepts from the **Financial Industry Business Ontology (FIBO, 2020)** to ensure conceptual consistency and compatibility with financial knowledge systems.
 
 
134
 
135
  ---
136
 
137
- ## Model Development
138
 
139
- * Framework: **spaCy (custom NER pipeline)**
140
- * Architecture: **spaCy NER with contextual embeddings**
141
- * Training focused on **domain-specific financial language**
142
- * Integrated normalization to reduce Arabic sparsity effects
143
 
144
- > Note: While AraBERT resources informed preprocessing decisions, this release is a **spaCy pipeline**, not a Transformers model.
 
145
 
146
- ---
147
 
148
- ## Evaluation
 
 
149
 
150
- The model was evaluated on a held-out test set with the following results:
151
 
152
- | Metric | Score |
153
- | --------- | ---------- |
154
- | Precision | **96.08%** |
155
- | Recall | **95.87%** |
156
- | F1-score | **95.97%** |
157
 
158
- These results are competitive with, and in some cases exceed, reported financial NER systems in other languages.
 
 
 
 
 
 
 
 
 
 
 
 
 
159
 
160
  ---
161
 
162
  ## Limitations
163
 
164
- * The model is **domain-specific** (financial news and reports)
165
- * It is **not suitable for general-purpose Arabic NER**
166
- * Not compatible with `transformers.AutoModel*` APIs
 
167
 
168
  ---
169
 
170
  ## Future Work
171
 
172
- Planned extensions include:
173
 
174
- * Expanding the corpus size and temporal coverage
175
- * Introducing **hierarchical entity structures**
176
- * Modeling **relations between financial entities**
177
- * Developing an **Arabic financial knowledge graph**
178
 
179
  ---
180
 
@@ -191,4 +277,5 @@ If you use AMWAL in your research, please cite:
191
  year={2025}
192
  }
193
  ```
194
-
 
 
1
+ ---
2
+ language:
3
+ - ar
4
+ license: apache-2.0 # change if needed
5
+ pipeline_tag: token-classification
6
+ library_name: spacy
7
+ tags:
8
+ - arabic
9
+ - named-entity-recognition
10
+ - ner
11
+ - finance
12
+ - financial-ner
13
+ - spacy
14
+ - information-extraction
15
+ - ontology-aligned
16
+ datasets:
17
+ - custom
18
+ ---
19
+
20
+ Below is a **final, clean, complete Hugging Face model card**, ready to **paste directly** into the HF **Model Card editor** or commit as `README.md`.
21
+
22
+ It is:
23
+
24
+ * HF-compliant
25
+ * spaCy-correct
26
+ * discoverable
27
+ * paper-ready
28
+ * honest about usage
29
+ * supports **both usage modes**
30
+ * avoids Transformers confusion
31
+
32
+ Nothing extra, nothing missing.
33
+
34
+ ---
35
+
36
+ ```yaml
37
+ ---
38
+ language:
39
+ - ar
40
+ license: apache-2.0 # change if needed
41
+ pipeline_tag: token-classification
42
+ library_name: spacy
43
+ tags:
44
+ - arabic
45
+ - named-entity-recognition
46
+ - ner
47
+ - finance
48
+ - financial-ner
49
+ - spacy
50
+ - information-extraction
51
+ - ontology-aligned
52
+ datasets:
53
+ - custom
54
+ ---
55
+ ```
56
 
57
  # AMWAL: Arabic Financial Named Entity Recognition (NER)
58
 
59
+ ## Quick Start
60
+
61
+ ### Install (recommended)
62
 
63
+ ```bash
64
+ pip install git+https://huggingface.co/Muhsabrys/AMWAL-ner-arabic
65
+ ```
66
 
67
+ ```python
68
+ from amwal import load_ner
69
 
70
+ ner = load_ner()
71
+
72
+ text = "أعلن صندوق قطر السيادي عن استثمار بقيمة 500 مليون دولار أمريكي في سندات حكومية يابانية مقومة بالين في طوكيو."
73
+ result = ner(text)
74
+
75
+ print(result["entities"])
76
+ ```
77
 
 
 
78
 
79
  ---
80
 
81
+ ## Model Summary
82
+
83
+ **AMWAL** is a **spaCy-based Named Entity Recognition (NER) system** designed for extracting **financial entities from Arabic text**, with a primary focus on **Arabic financial news and reports**.
84
 
85
+ The model addresses challenges specific to Arabic financial NLP, including orthographic variation, domain-specific terminology, and the scarcity of annotated financial resources for Arabic.
 
 
 
 
86
 
87
  ---
88
 
89
+ ## Intended Use
90
 
91
+ AMWAL is intended for:
92
+
93
+ * Arabic financial news analysis
94
+ * Information extraction from financial reports
95
+ * Financial text preprocessing
96
+ * Academic research in Arabic NLP and finance
97
+ * Data enrichment for financial knowledge graphs
98
+
99
+ It is **not intended** for:
100
+
101
+ * General-purpose Arabic NER
102
+ * Non-financial domains
103
+ * Direct use with Hugging Face Transformers APIs
104
 
105
  ---
106
 
107
+ ## Data Collection and Annotation
108
 
109
+ A specialized Arabic financial corpus was constructed from **three major Arabic financial newspapers**, covering the period **2000–2023**.
110
 
111
+ The annotation process followed a **semi-automatic workflow**:
 
112
 
113
+ 1. Automatic candidate entity extraction
114
+ 2. Manual annotation
115
+ 3. Expert review and correction
116
 
117
+ The final dataset contains:
 
118
 
119
+ * **17.1K annotated entity tokens**
120
+ * **21 financial entity categories**
121
+ * Consistent domain coverage across multiple time periods
122
 
123
+ ---
124
 
125
+ ## Entity Schema and Standardization
126
+
127
+ Entity categories were standardized using concepts from the
128
+ **Financial Industry Business Ontology (FIBO, 2020)** to ensure conceptual consistency and compatibility with structured financial representations.
129
+
130
+ ---
131
+
132
+ ## Model Architecture and Training
133
+
134
+ * **Framework:** spaCy
135
+ * **Pipeline:** Custom Named Entity Recognition (NER)
136
+ * **Domain:** Arabic financial text
137
+
138
+ The model was trained on the annotated corpus using spaCy’s NER pipeline.
139
+ To mitigate sparsity caused by Arabic orthographic variation, normalization was applied consistently during training and inference.
 
 
 
 
 
140
 
141
  ---
142
 
143
  ## Arabic Normalization
144
 
145
+ The following normalization steps are applied **internally during inference**, matching the training setup:
146
 
147
+ * Removal of all diacritics
148
+ * Character normalization:
149
 
150
+ * `إ`, `أ`, `آ``ا`
151
+ * `ؤ`, `ئ``ء`
152
+ * `ة``ه`
153
+ * `ى``ي`
154
 
155
+ The original input text is always preserved and returned as `raw_text`.
 
156
 
157
  ---
158
 
159
  ## Entity Types
160
 
161
+ The model recognizes **21 financial entity types**, including (but not limited to):
162
 
163
  * `COUNTRY`
164
  * `CITY`
165
  * `CURRENCY`
166
  * `FINANCIAL_INSTRUMENT`
 
167
  * `BANK`
168
+ * `ORGANIZATION`
169
  * `NATIONALITY`
170
  * `EVENT`
171
  * `TIME`
172
  * `QUANTITY_OR_UNIT`
 
173
 
174
  ---
175
 
176
+ ## Evaluation Results
177
+
178
+ The model was evaluated on a held-out test set using standard NER metrics:
179
 
180
+ | Metric | Score |
181
+ | --------- | ---------- |
182
+ | Precision | **96.08%** |
183
+ | Recall | **95.87%** |
184
+ | F1-score | **95.97%** |
185
 
186
+ These results are competitive with reported financial NER systems in other languages, despite the additional challenges posed by Arabic morphology and orthography.
187
 
188
+ ---
 
 
189
 
190
+ ## Usage
191
 
192
+ AMWAL supports **two officially supported usage modes**.
 
 
193
 
194
  ---
195
 
196
+ ### Option 1 — Install via `pip` (recommended)
197
+
198
+ ```bash
199
+ pip install git+https://huggingface.co/Muhsabrys/AMWAL-ner-arabic
200
+ ```
201
+
202
+ ```python
203
+ from amwal import load_ner
204
 
205
+ ner = load_ner()
206
+ result = ner("نص عربي مالي")
207
+ ```
208
 
209
  ---
210
 
211
+ ### Option 2 — Use directly from Hugging Face (no installation)
212
 
213
+ ```python
214
+ from huggingface_hub import snapshot_download
215
+ import sys
 
216
 
217
+ repo_path = snapshot_download("Muhsabrys/AMWAL-ner-arabic")
218
+ sys.path.append(repo_path)
219
 
220
+ from amwal import load_ner
221
 
222
+ ner = load_ner(local_path=repo_path)
223
+ result = ner("نص عربي مالي")
224
+ ```
225
 
226
+ ---
227
 
228
+ ## Output Format
 
 
 
 
229
 
230
+ ```json
231
+ {
232
+ "raw_text": "...",
233
+ "normalized_text": "...",
234
+ "entities": [
235
+ {
236
+ "text": "قطر",
237
+ "label": "COUNTRY",
238
+ "start": 11,
239
+ "end": 14
240
+ }
241
+ ]
242
+ }
243
+ ```
244
 
245
  ---
246
 
247
  ## Limitations
248
 
249
+ * Domain-specific to financial text
250
+ * Not suitable for general-purpose Arabic NER
251
+ * Does not model relations between entities
252
+ * Not compatible with Hugging Face Transformers APIs
253
 
254
  ---
255
 
256
  ## Future Work
257
 
258
+ Planned future directions include:
259
 
260
+ * Expanding the annotated corpus
261
+ * Introducing hierarchical entity structures
262
+ * Modeling relations between financial entities
263
+ * Constructing an Arabic financial knowledge graph
264
 
265
  ---
266
 
 
277
  year={2025}
278
  }
279
  ```
280
+
281
+ ---