Muhsabrys
/

AMWAL_ArFinNER

@@ -1,180 +1,266 @@
 # AMWAL: Arabic Financial Named Entity Recognition (NER)
-## Overview
-**AMWAL** is a **spaCy-based Named Entity Recognition (NER) system** designed specifically for **Arabic financial news and reports**.
-It targets the extraction of structured financial entities from unstructured Arabic text, addressing the lack of high-quality Arabic financial NLP resources.
-This repository provides:
-* A trained **spaCy NER pipeline**
-* An integrated **Arabic normalization layer**
-* A simple Python API for inference
-> ⚠️ This is **not a Transformers / BERT model**.
-> Usage is via **spaCy**, not `AutoModelForTokenClassification`.
 ---
-## Key Features
-* Domain-specific Arabic **financial entity recognition**
-* Robust handling of **Arabic orthographic variation**
-* Fine-grained financial entity schema (21 types)
-* Ready-to-use inference via Hugging Face
-* Suitable for research and downstream financial NLP tasks
 ---
-## Installation
-```bash
-pip install spacy huggingface_hub
-```
 ---
-## Usage (Recommended)
-### Load from Hugging Face
-```python
-from amwal import load_ner
-ner = load_ner()  # downloads model from Hugging Face
-text = "أعلن صندوق قطر السيادي عن استثمار بقيمة 500 مليون دولار أمريكي في سندات حكومية يابانية مقومة بالين في طوكيو."
-output = ner(text)
-print(output)
-```
-### Output Format
-```json
-{
-  "raw_text": "...",
-  "normalized_text": "...",
-  "entities": [
-    {
-      "text": "قطر",
-      "label": "COUNTRY",
-      "start": 11,
-      "end": 14
-    },
-    {
-      "text": "دولار",
-      "label": "CURRENCY",
-      "start": 50,
-      "end": 55
-    }
-  ]
-}
-```
 ---
 ## Arabic Normalization
-Inference applies **the same normalization used during training**, including:
-* Removal of diacritics
-* Orthographic normalization:
-  * `إ، أ، آ → ا`
-  * `ؤ، ئ → ء`
-  * `ة → ه`
-  * `ى → ي`
-Normalization is applied **internally only**.
-The original input text is always preserved in `raw_text`.
 ---
 ## Entity Types
-The model recognizes **21 financial entity categories**, including:
 * `COUNTRY`
 * `CITY`
 * `CURRENCY`
 * `FINANCIAL_INSTRUMENT`
-* `ORGANIZATION`
 * `BANK`
 * `NATIONALITY`
 * `EVENT`
 * `TIME`
 * `QUANTITY_OR_UNIT`
-* *(and others)*
 ---
-## Data Collection and Annotation
-We constructed a **specialized Arabic financial corpus** sourced from **three major Arabic financial newspapers** covering the period **2000–2023**.
-Entity annotation followed a **semi-automatic workflow**:
-1. Automatic candidate extraction
-2. Manual annotation
-3. Expert review and correction
-The final dataset contains:
-* **17.1K annotated entity tokens**
-* **21 entity categories**
-* High inter-annotator consistency
 ---
-## Entity Standardization
-Entity categories were aligned with concepts from the **Financial Industry Business Ontology (FIBO, 2020)** to ensure conceptual consistency and compatibility with financial knowledge systems.
 ---
-## Model Development
-* Framework: **spaCy (custom NER pipeline)**
-* Architecture: **spaCy NER with contextual embeddings**
-* Training focused on **domain-specific financial language**
-* Integrated normalization to reduce Arabic sparsity effects
-> Note: While AraBERT resources informed preprocessing decisions, this release is a **spaCy pipeline**, not a Transformers model.
----
-## Evaluation
-The model was evaluated on a held-out test set with the following results:
-| Metric    | Score      |
-| --------- | ---------- |
-| Precision | **96.08%** |
-| Recall    | **95.87%** |
-| F1-score  | **95.97%** |
-These results are competitive with, and in some cases exceed, reported financial NER systems in other languages.
 ---
 ## Limitations
-* The model is **domain-specific** (financial news and reports)
-* It is **not suitable for general-purpose Arabic NER**
-* Not compatible with `transformers.AutoModel*` APIs
 ---
 ## Future Work
-Planned extensions include:
-* Expanding the corpus size and temporal coverage
-* Introducing **hierarchical entity structures**
-* Modeling **relations between financial entities**
-* Developing an **Arabic financial knowledge graph**
 ---
@@ -191,4 +277,5 @@ If you use AMWAL in your research, please cite:
   year={2025}
 }
 ```

+---
+language:
+  - ar
+license: apache-2.0   # change if needed
+pipeline_tag: token-classification
+library_name: spacy
+tags:
+  - arabic
+  - named-entity-recognition
+  - ner
+  - finance
+  - financial-ner
+  - spacy
+  - information-extraction
+  - ontology-aligned
+datasets:
+  - custom
+---
+Below is a **final, clean, complete Hugging Face model card**, ready to **paste directly** into the HF **Model Card editor** or commit as `README.md`.
+It is:
+* HF-compliant
+* spaCy-correct
+* discoverable
+* paper-ready
+* honest about usage
+* supports **both usage modes**
+* avoids Transformers confusion
+Nothing extra, nothing missing.
+---
+```yaml
+---
+language:
+  - ar
+license: apache-2.0   # change if needed
+pipeline_tag: token-classification
+library_name: spacy
+tags:
+  - arabic
+  - named-entity-recognition
+  - ner
+  - finance
+  - financial-ner
+  - spacy
+  - information-extraction
+  - ontology-aligned
+datasets:
+  - custom
+---
+```
 # AMWAL: Arabic Financial Named Entity Recognition (NER)
+## Quick Start
+### Install (recommended)
+```bash
+pip install git+https://huggingface.co/Muhsabrys/AMWAL-ner-arabic
+```
+```python
+from amwal import load_ner
+ner = load_ner()
+text = "أعلن صندوق قطر السيادي عن استثمار بقيمة 500 مليون دولار أمريكي في سندات حكومية يابانية مقومة بالين في طوكيو."
+result = ner(text)
+print(result["entities"])
+```
 ---
+## Model Summary
+**AMWAL** is a **spaCy-based Named Entity Recognition (NER) system** designed for extracting **financial entities from Arabic text**, with a primary focus on **Arabic financial news and reports**.
+The model addresses challenges specific to Arabic financial NLP, including orthographic variation, domain-specific terminology, and the scarcity of annotated financial resources for Arabic.
 ---
+## Intended Use
+AMWAL is intended for:
+* Arabic financial news analysis
+* Information extraction from financial reports
+* Financial text preprocessing
+* Academic research in Arabic NLP and finance
+* Data enrichment for financial knowledge graphs
+It is **not intended** for:
+* General-purpose Arabic NER
+* Non-financial domains
+* Direct use with Hugging Face Transformers APIs
 ---
+## Data Collection and Annotation
+A specialized Arabic financial corpus was constructed from **three major Arabic financial newspapers**, covering the period **2000–2023**.
+The annotation process followed a **semi-automatic workflow**:
+1. Automatic candidate entity extraction
+2. Manual annotation
+3. Expert review and correction
+The final dataset contains:
+* **17.1K annotated entity tokens**
+* **21 financial entity categories**
+* Consistent domain coverage across multiple time periods
+---
+## Entity Schema and Standardization
+Entity categories were standardized using concepts from the
+**Financial Industry Business Ontology (FIBO, 2020)** to ensure conceptual consistency and compatibility with structured financial representations.
+---
+## Model Architecture and Training
+* **Framework:** spaCy
+* **Pipeline:** Custom Named Entity Recognition (NER)
+* **Domain:** Arabic financial text
+The model was trained on the annotated corpus using spaCy’s NER pipeline.
+To mitigate sparsity caused by Arabic orthographic variation, normalization was applied consistently during training and inference.
 ---
 ## Arabic Normalization
+The following normalization steps are applied **internally during inference**, matching the training setup:
+* Removal of all diacritics
+* Character normalization:
+  * `إ`, `أ`, `آ` → `ا`
+  * `ؤ`, `ئ` → `ء`
+  * `ة` → `ه`
+  * `ى` → `ي`
+The original input text is always preserved and returned as `raw_text`.
 ---
 ## Entity Types
+The model recognizes **21 financial entity types**, including (but not limited to):
 * `COUNTRY`
 * `CITY`
 * `CURRENCY`
 * `FINANCIAL_INSTRUMENT`
 * `BANK`
+* `ORGANIZATION`
 * `NATIONALITY`
 * `EVENT`
 * `TIME`
 * `QUANTITY_OR_UNIT`
 ---
+## Evaluation Results
+The model was evaluated on a held-out test set using standard NER metrics:
+| Metric    | Score      |
+| --------- | ---------- |
+| Precision | **96.08%** |
+| Recall    | **95.87%** |
+| F1-score  | **95.97%** |
+These results are competitive with reported financial NER systems in other languages, despite the additional challenges posed by Arabic morphology and orthography.
+---
+## Usage
+AMWAL supports **two officially supported usage modes**.
 ---
+### Option 1 — Install via `pip` (recommended)
+```bash
+pip install git+https://huggingface.co/Muhsabrys/AMWAL-ner-arabic
+```
+```python
+from amwal import load_ner
+ner = load_ner()
+result = ner("نص عربي مالي")
+```
 ---
+### Option 2 — Use directly from Hugging Face (no installation)
+```python
+from huggingface_hub import snapshot_download
+import sys
+repo_path = snapshot_download("Muhsabrys/AMWAL-ner-arabic")
+sys.path.append(repo_path)
+from amwal import load_ner
+ner = load_ner(local_path=repo_path)
+result = ner("نص عربي مالي")
+```
+---
+## Output Format
+```json
+{
+  "raw_text": "...",
+  "normalized_text": "...",
+  "entities": [
+    {
+      "text": "قطر",
+      "label": "COUNTRY",
+      "start": 11,
+      "end": 14
+    }
+  ]
+}
+```
 ---
 ## Limitations
+* Domain-specific to financial text
+* Not suitable for general-purpose Arabic NER
+* Does not model relations between entities
+* Not compatible with Hugging Face Transformers APIs
 ---
 ## Future Work
+Planned future directions include:
+* Expanding the annotated corpus
+* Introducing hierarchical entity structures
+* Modeling relations between financial entities
+* Constructing an Arabic financial knowledge graph
 ---
   year={2025}
 }
 ```
+---