halimbahae commited on
Commit
94654ed
·
verified ·
1 Parent(s): 8d8cfe2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -7
README.md CHANGED
@@ -1,19 +1,68 @@
1
-
2
  ---
3
- language: ar
 
4
  tags:
5
  - word2vec
6
  - darija
7
  - moroccan-arabic
8
  - nlp
 
 
 
 
 
9
  ---
 
10
  # Darija2Vec SOTA (300D)
11
 
12
- Ce modèle Word2Vec a été entraîné sur un corpus unifié de Darija (Arabe + Latin) de ~300k lignes.
13
- L'unification des scripts a été appliquée pour fusionner les concepts sémantiques.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- ## Utilisation
16
  ```python
17
  from gensim.models import KeyedVectors
18
- wv = KeyedVectors.load_word2vec_format("darija2vec_sota_vectors.txt", binary=False)
19
- print(wv.most_similar("مزيان"))
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - ar
4
  tags:
5
  - word2vec
6
  - darija
7
  - moroccan-arabic
8
  - nlp
9
+ - embeddings
10
+ license: apache-2.0
11
+ datasets:
12
+ - imomayiz/darija-english
13
+ - atlasia/darija_sentiment
14
  ---
15
+
16
  # Darija2Vec SOTA (300D)
17
 
18
+ **Darija2Vec-SOTA-300D** is a high-performance Word2Vec embedding model specifically engineered for the Moroccan dialect (**Darija**). Developed as a State-of-the-Art (SOTA) resource, it addresses the unique challenges of Moroccan Arabic NLP, particularly the heavy use of code-switching and diverse orthographic scripts (Arabic and Latin/Arabizi).
19
+
20
+
21
+
22
+ ## 🌟 Key Technical Innovations (SOTA)
23
+
24
+ Unlike standard embeddings that treat different scripts as separate languages, this model implements a **Script Unification Pipeline**:
25
+
26
+ * **Script Unification**: Systematic mapping of high-frequency Arabizi terms to their Arabic script equivalents (e.g., `ana` → `أنا`, `ghadi` → `غادي`). This doubles the statistical density for core semantic concepts.
27
+ * **English Noise Filtering**: A custom heuristic filter was used to purge English segments often found in bilingual datasets like DODa, ensuring the semantic space is purely Darija-centric.
28
+ * **High-Dimensionality (300D)**: Trained with a 300-dimension Skip-gram architecture to capture complex Moroccan morphological and semantic nuances.
29
+
30
+ ---
31
+
32
+ ## 📊 Model Specifications
33
+
34
+ | Parameter | Configuration |
35
+ | :--- | :--- |
36
+ | **Model Type** | Word2Vec Skip-gram (`sg=1`) |
37
+ | **Vector Dimensions** | 300 |
38
+ | **Window Size** | 7 (optimized for Darija syntax) |
39
+ | **Corpus Size** | ~317,141 unique sentences |
40
+ | **Min Word Count** | 5 |
41
+ | **Training Epochs** | 15 |
42
+
43
+ ---
44
+
45
+ ## 📥 Dataset Sources
46
+ The model was trained on a consolidated corpus combining the best available public resources:
47
+ 1. **Darija Open Dataset (DODa)**: Recursive scan of translated sentences.
48
+ 2. **Goud.ma News**: For formal and journalistic Darija vocabulary.
49
+ 3. **Atlasia/Bounhar Sentiment**: For authentic social media and conversational data.
50
+
51
+ ---
52
+
53
+ ## 💻 Usage (Gensim)
54
 
 
55
  ```python
56
  from gensim.models import KeyedVectors
57
+ from huggingface_hub import hf_hub_download
58
+
59
+ # Download the SOTA vectors
60
+ repo_id = "halimbahae/Darija2Vec-SOTA-300D"
61
+ vector_file = hf_hub_download(repo_id=repo_id, filename="darija2vec_sota_vectors.txt")
62
+
63
+ # Load into Gensim
64
+ wv = KeyedVectors.load_word2vec_format(vector_file, binary=False)
65
+
66
+ # Explore similarities
67
+ print(wv.most_similar("مزيان", topn=5))
68
+ print(wv.most_similar("طوموبيل", topn=5))