mamed0v commited on
Commit
ed9dfba
·
1 Parent(s): 2aa8b90

Loaded W2V model

Browse files
Files changed (1) hide show
  1. README.md +49 -0
README.md CHANGED
@@ -18,6 +18,55 @@ To use this project, you'll need:
18
  - Gensim
19
  - tqdm
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  ## Installation 🔧
22
 
23
  1. Clone this repository:
 
18
  - Gensim
19
  - tqdm
20
 
21
+ ## Metadata
22
+ ```
23
+ Model: turkmen_word2vec
24
+ Vocabulary size: 153695
25
+ Vector size: 300
26
+ Window size: 5
27
+ Min count: 15
28
+ Training epochs: 10
29
+ Final training loss: 80079792.0
30
+ ```
31
+
32
+ ## Turkmen-Specific Character Replacement 🔤
33
+
34
+ One of the key features of this project is its handling of Turkmen-specific characters. The Turkmen alphabet includes several characters that are not present in the standard Latin alphabet. To ensure compatibility and improve processing, I implement a custom character replacement system.
35
+
36
+ ### Replacement Map
37
+
38
+ Here's the character replacement map used in the preprocessing step:
39
+
40
+ ```python
41
+ REPLACEMENTS = {
42
+ 'ä': 'a', 'ç': 'ch', 'ö': 'o', 'ü': 'u', 'ň': 'n', 'ý': 'y', 'ğ': 'g', 'ş': 's',
43
+ 'Ç': 'Ch', 'Ö': 'O', 'Ü': 'U', 'Ä': 'A', 'Ň': 'N', 'Ş': 'S', 'Ý': 'Y', 'Ğ': 'G'
44
+ }
45
+ ```
46
+
47
+ This mapping ensures that:
48
+ - Special Turkmen characters are converted to their closest Latin alphabet equivalents.
49
+ - The essence of the original text is preserved while making it more processable for standard NLP tools.
50
+ - Both lowercase and uppercase variants are handled appropriately.
51
+
52
+ ### Implementation
53
+
54
+ The replacement is implemented in the `preprocess_sentence` function:
55
+
56
+ ```python
57
+ def preprocess_sentence(sentence: str) -> List[str]:
58
+ for original, replacement in REPLACEMENTS.items():
59
+ sentence = sentence.replace(original, replacement)
60
+ # ... (rest of the preprocessing steps)
61
+ ```
62
+
63
+ This step is crucial as it:
64
+ 1. Standardizes the text, making it easier to process and analyze.
65
+ 2. Maintains the semantic meaning of words while adapting them to a more universal character set.
66
+ 3. Improves compatibility with existing NLP tools and libraries that might not natively support Turkmen characters.
67
+
68
+ By implementing this character replacement, we ensure that our Word2Vec model can effectively learn from and represent Turkmen text, despite the unique characteristics of the Turkmen alphabet.
69
+
70
  ## Installation 🔧
71
 
72
  1. Clone this repository: