mamed0v
/

turkmen-word2vec

Model card Files Files and versions

mamed0v commited on Jun 24, 2024

Commit

ed9dfba

·

1 Parent(s): 2aa8b90

Loaded W2V model

Files changed (1) hide show

README.md +49 -0

README.md CHANGED Viewed

@@ -18,6 +18,55 @@ To use this project, you'll need:
 - Gensim
 - tqdm
 ## Installation 🔧
 1. Clone this repository:

 - Gensim
 - tqdm
+## Metadata
+```
+Model: turkmen_word2vec
+Vocabulary size: 153695
+Vector size: 300
+Window size: 5
+Min count: 15
+Training epochs: 10
+Final training loss: 80079792.0
+```
+## Turkmen-Specific Character Replacement 🔤
+One of the key features of this project is its handling of Turkmen-specific characters. The Turkmen alphabet includes several characters that are not present in the standard Latin alphabet. To ensure compatibility and improve processing, I implement a custom character replacement system.
+### Replacement Map
+Here's the character replacement map used in the preprocessing step:
+```python
+REPLACEMENTS = {
+    'ä': 'a', 'ç': 'ch', 'ö': 'o', 'ü': 'u', 'ň': 'n', 'ý': 'y', 'ğ': 'g', 'ş': 's',
+    'Ç': 'Ch', 'Ö': 'O', 'Ü': 'U', 'Ä': 'A', 'Ň': 'N', 'Ş': 'S', 'Ý': 'Y', 'Ğ': 'G'
+}
+```
+This mapping ensures that:
+- Special Turkmen characters are converted to their closest Latin alphabet equivalents.
+- The essence of the original text is preserved while making it more processable for standard NLP tools.
+- Both lowercase and uppercase variants are handled appropriately.
+### Implementation
+The replacement is implemented in the `preprocess_sentence` function:
+```python
+def preprocess_sentence(sentence: str) -> List[str]:
+    for original, replacement in REPLACEMENTS.items():
+        sentence = sentence.replace(original, replacement)
+    # ... (rest of the preprocessing steps)
+```
+This step is crucial as it:
+1. Standardizes the text, making it easier to process and analyze.
+2. Maintains the semantic meaning of words while adapting them to a more universal character set.
+3. Improves compatibility with existing NLP tools and libraries that might not natively support Turkmen characters.
+By implementing this character replacement, we ensure that our Word2Vec model can effectively learn from and represent Turkmen text, despite the unique characteristics of the Turkmen alphabet.
 ## Installation 🔧
 1. Clone this repository: