halimbahae
/

Darija2Vec-SOTA-300D

moroccan-arabic

Model card Files Files and versions

halimbahae commited on Dec 19, 2025

Commit

94654ed

·

verified ·

1 Parent(s): 8d8cfe2

Update README.md

Files changed (1) hide show

README.md +56 -7

README.md CHANGED Viewed

@@ -1,19 +1,68 @@
 ---
-language: ar
 tags:
 - word2vec
 - darija
 - moroccan-arabic
 - nlp
 ---
 # Darija2Vec SOTA (300D)
-Ce modèle Word2Vec a été entraîné sur un corpus unifié de Darija (Arabe + Latin) de ~300k lignes.
-L'unification des scripts a été appliquée pour fusionner les concepts sémantiques.
-## Utilisation
 ```python
 from gensim.models import KeyedVectors
-wv = KeyedVectors.load_word2vec_format("darija2vec_sota_vectors.txt", binary=False)
-print(wv.most_similar("مزيان"))

 ---
+language:
+- ar
 tags:
 - word2vec
 - darija
 - moroccan-arabic
 - nlp
+- embeddings
+license: apache-2.0
+datasets:
+- imomayiz/darija-english
+- atlasia/darija_sentiment
 ---
 # Darija2Vec SOTA (300D)
+**Darija2Vec-SOTA-300D** is a high-performance Word2Vec embedding model specifically engineered for the Moroccan dialect (**Darija**). Developed as a State-of-the-Art (SOTA) resource, it addresses the unique challenges of Moroccan Arabic NLP, particularly the heavy use of code-switching and diverse orthographic scripts (Arabic and Latin/Arabizi).
+## 🌟 Key Technical Innovations (SOTA)
+Unlike standard embeddings that treat different scripts as separate languages, this model implements a **Script Unification Pipeline**:
+* **Script Unification**: Systematic mapping of high-frequency Arabizi terms to their Arabic script equivalents (e.g., `ana` → `أنا`, `ghadi` → `غادي`). This doubles the statistical density for core semantic concepts.
+* **English Noise Filtering**: A custom heuristic filter was used to purge English segments often found in bilingual datasets like DODa, ensuring the semantic space is purely Darija-centric.
+* **High-Dimensionality (300D)**: Trained with a 300-dimension Skip-gram architecture to capture complex Moroccan morphological and semantic nuances.
+---
+## 📊 Model Specifications
+| Parameter | Configuration |
+| :--- | :--- |
+| **Model Type** | Word2Vec Skip-gram (`sg=1`) |
+| **Vector Dimensions** | 300 |
+| **Window Size** | 7 (optimized for Darija syntax) |
+| **Corpus Size** | ~317,141 unique sentences |
+| **Min Word Count** | 5 |
+| **Training Epochs** | 15 |
+---
+## 📥 Dataset Sources
+The model was trained on a consolidated corpus combining the best available public resources:
+1.  **Darija Open Dataset (DODa)**: Recursive scan of translated sentences.
+2.  **Goud.ma News**: For formal and journalistic Darija vocabulary.
+3.  **Atlasia/Bounhar Sentiment**: For authentic social media and conversational data.
+---
+## 💻 Usage (Gensim)
 ```python
 from gensim.models import KeyedVectors
+from huggingface_hub import hf_hub_download
+# Download the SOTA vectors
+repo_id = "halimbahae/Darija2Vec-SOTA-300D"
+vector_file = hf_hub_download(repo_id=repo_id, filename="darija2vec_sota_vectors.txt")
+# Load into Gensim
+wv = KeyedVectors.load_word2vec_format(vector_file, binary=False)
+# Explore similarities
+print(wv.most_similar("مزيان", topn=5))
+print(wv.most_similar("طوموبيل", topn=5))