Update README.md
Browse files
README.md
CHANGED
|
@@ -1,19 +1,68 @@
|
|
| 1 |
-
|
| 2 |
---
|
| 3 |
-
language:
|
|
|
|
| 4 |
tags:
|
| 5 |
- word2vec
|
| 6 |
- darija
|
| 7 |
- moroccan-arabic
|
| 8 |
- nlp
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
---
|
|
|
|
| 10 |
# Darija2Vec SOTA (300D)
|
| 11 |
|
| 12 |
-
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
-
## Utilisation
|
| 16 |
```python
|
| 17 |
from gensim.models import KeyedVectors
|
| 18 |
-
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
- ar
|
| 4 |
tags:
|
| 5 |
- word2vec
|
| 6 |
- darija
|
| 7 |
- moroccan-arabic
|
| 8 |
- nlp
|
| 9 |
+
- embeddings
|
| 10 |
+
license: apache-2.0
|
| 11 |
+
datasets:
|
| 12 |
+
- imomayiz/darija-english
|
| 13 |
+
- atlasia/darija_sentiment
|
| 14 |
---
|
| 15 |
+
|
| 16 |
# Darija2Vec SOTA (300D)
|
| 17 |
|
| 18 |
+
**Darija2Vec-SOTA-300D** is a high-performance Word2Vec embedding model specifically engineered for the Moroccan dialect (**Darija**). Developed as a State-of-the-Art (SOTA) resource, it addresses the unique challenges of Moroccan Arabic NLP, particularly the heavy use of code-switching and diverse orthographic scripts (Arabic and Latin/Arabizi).
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
## 🌟 Key Technical Innovations (SOTA)
|
| 23 |
+
|
| 24 |
+
Unlike standard embeddings that treat different scripts as separate languages, this model implements a **Script Unification Pipeline**:
|
| 25 |
+
|
| 26 |
+
* **Script Unification**: Systematic mapping of high-frequency Arabizi terms to their Arabic script equivalents (e.g., `ana` → `أنا`, `ghadi` → `غادي`). This doubles the statistical density for core semantic concepts.
|
| 27 |
+
* **English Noise Filtering**: A custom heuristic filter was used to purge English segments often found in bilingual datasets like DODa, ensuring the semantic space is purely Darija-centric.
|
| 28 |
+
* **High-Dimensionality (300D)**: Trained with a 300-dimension Skip-gram architecture to capture complex Moroccan morphological and semantic nuances.
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## 📊 Model Specifications
|
| 33 |
+
|
| 34 |
+
| Parameter | Configuration |
|
| 35 |
+
| :--- | :--- |
|
| 36 |
+
| **Model Type** | Word2Vec Skip-gram (`sg=1`) |
|
| 37 |
+
| **Vector Dimensions** | 300 |
|
| 38 |
+
| **Window Size** | 7 (optimized for Darija syntax) |
|
| 39 |
+
| **Corpus Size** | ~317,141 unique sentences |
|
| 40 |
+
| **Min Word Count** | 5 |
|
| 41 |
+
| **Training Epochs** | 15 |
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
|
| 45 |
+
## 📥 Dataset Sources
|
| 46 |
+
The model was trained on a consolidated corpus combining the best available public resources:
|
| 47 |
+
1. **Darija Open Dataset (DODa)**: Recursive scan of translated sentences.
|
| 48 |
+
2. **Goud.ma News**: For formal and journalistic Darija vocabulary.
|
| 49 |
+
3. **Atlasia/Bounhar Sentiment**: For authentic social media and conversational data.
|
| 50 |
+
|
| 51 |
+
---
|
| 52 |
+
|
| 53 |
+
## 💻 Usage (Gensim)
|
| 54 |
|
|
|
|
| 55 |
```python
|
| 56 |
from gensim.models import KeyedVectors
|
| 57 |
+
from huggingface_hub import hf_hub_download
|
| 58 |
+
|
| 59 |
+
# Download the SOTA vectors
|
| 60 |
+
repo_id = "halimbahae/Darija2Vec-SOTA-300D"
|
| 61 |
+
vector_file = hf_hub_download(repo_id=repo_id, filename="darija2vec_sota_vectors.txt")
|
| 62 |
+
|
| 63 |
+
# Load into Gensim
|
| 64 |
+
wv = KeyedVectors.load_word2vec_format(vector_file, binary=False)
|
| 65 |
+
|
| 66 |
+
# Explore similarities
|
| 67 |
+
print(wv.most_similar("مزيان", topn=5))
|
| 68 |
+
print(wv.most_similar("طوموبيل", topn=5))
|