mamed0v
commited on
Commit
·
ed9dfba
1
Parent(s):
2aa8b90
Loaded W2V model
Browse files
README.md
CHANGED
|
@@ -18,6 +18,55 @@ To use this project, you'll need:
|
|
| 18 |
- Gensim
|
| 19 |
- tqdm
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
## Installation 🔧
|
| 22 |
|
| 23 |
1. Clone this repository:
|
|
|
|
| 18 |
- Gensim
|
| 19 |
- tqdm
|
| 20 |
|
| 21 |
+
## Metadata
|
| 22 |
+
```
|
| 23 |
+
Model: turkmen_word2vec
|
| 24 |
+
Vocabulary size: 153695
|
| 25 |
+
Vector size: 300
|
| 26 |
+
Window size: 5
|
| 27 |
+
Min count: 15
|
| 28 |
+
Training epochs: 10
|
| 29 |
+
Final training loss: 80079792.0
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
## Turkmen-Specific Character Replacement 🔤
|
| 33 |
+
|
| 34 |
+
One of the key features of this project is its handling of Turkmen-specific characters. The Turkmen alphabet includes several characters that are not present in the standard Latin alphabet. To ensure compatibility and improve processing, I implement a custom character replacement system.
|
| 35 |
+
|
| 36 |
+
### Replacement Map
|
| 37 |
+
|
| 38 |
+
Here's the character replacement map used in the preprocessing step:
|
| 39 |
+
|
| 40 |
+
```python
|
| 41 |
+
REPLACEMENTS = {
|
| 42 |
+
'ä': 'a', 'ç': 'ch', 'ö': 'o', 'ü': 'u', 'ň': 'n', 'ý': 'y', 'ğ': 'g', 'ş': 's',
|
| 43 |
+
'Ç': 'Ch', 'Ö': 'O', 'Ü': 'U', 'Ä': 'A', 'Ň': 'N', 'Ş': 'S', 'Ý': 'Y', 'Ğ': 'G'
|
| 44 |
+
}
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
This mapping ensures that:
|
| 48 |
+
- Special Turkmen characters are converted to their closest Latin alphabet equivalents.
|
| 49 |
+
- The essence of the original text is preserved while making it more processable for standard NLP tools.
|
| 50 |
+
- Both lowercase and uppercase variants are handled appropriately.
|
| 51 |
+
|
| 52 |
+
### Implementation
|
| 53 |
+
|
| 54 |
+
The replacement is implemented in the `preprocess_sentence` function:
|
| 55 |
+
|
| 56 |
+
```python
|
| 57 |
+
def preprocess_sentence(sentence: str) -> List[str]:
|
| 58 |
+
for original, replacement in REPLACEMENTS.items():
|
| 59 |
+
sentence = sentence.replace(original, replacement)
|
| 60 |
+
# ... (rest of the preprocessing steps)
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
This step is crucial as it:
|
| 64 |
+
1. Standardizes the text, making it easier to process and analyze.
|
| 65 |
+
2. Maintains the semantic meaning of words while adapting them to a more universal character set.
|
| 66 |
+
3. Improves compatibility with existing NLP tools and libraries that might not natively support Turkmen characters.
|
| 67 |
+
|
| 68 |
+
By implementing this character replacement, we ensure that our Word2Vec model can effectively learn from and represent Turkmen text, despite the unique characteristics of the Turkmen alphabet.
|
| 69 |
+
|
| 70 |
## Installation 🔧
|
| 71 |
|
| 72 |
1. Clone this repository:
|