File size: 4,721 Bytes
43c6748
 
 
2aa8b90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed9dfba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2aa8b90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---

license: cc-by-4.0
---

# Turkmen Word2Vec Model ๐Ÿ‡น๐Ÿ‡ฒ๐Ÿ’ฌ

Welcome to the Turkmen Word2Vec Model project! This open-source initiative aims to provide a robust word embedding solution for the Turkmen language. ๐Ÿš€

## Introduction ๐ŸŒŸ

The Turkmen Word2Vec Model is designed to create high-quality word embeddings for the Turkmen language. By leveraging the power of Word2Vec and incorporating Turkmen-specific preprocessing, this project offers a valuable resource for various natural language processing tasks in Turkmen.

## Requirements ๐Ÿ“‹

To use this project, you'll need:

- Python 3.6+
- NLTK
- Gensim
- tqdm

## Metadata
```

Model: turkmen_word2vec

Vocabulary size: 153695

Vector size: 300

Window size: 5

Min count: 15

Training epochs: 10

Final training loss: 80079792.0

```

## Turkmen-Specific Character Replacement ๐Ÿ”ค

One of the key features of this project is its handling of Turkmen-specific characters. The Turkmen alphabet includes several characters that are not present in the standard Latin alphabet. To ensure compatibility and improve processing, I implement a custom character replacement system.

### Replacement Map

Here's the character replacement map used in the preprocessing step:

```python

REPLACEMENTS = {

    'รค': 'a', 'รง': 'ch', 'รถ': 'o', 'รผ': 'u', 'ลˆ': 'n', 'รฝ': 'y', 'ฤŸ': 'g', 'ลŸ': 's',

    'ร‡': 'Ch', 'ร–': 'O', 'รœ': 'U', 'ร„': 'A', 'ล‡': 'N', 'ลž': 'S', 'ร': 'Y', 'ฤž': 'G'

}

```

This mapping ensures that:
- Special Turkmen characters are converted to their closest Latin alphabet equivalents.
- The essence of the original text is preserved while making it more processable for standard NLP tools.
- Both lowercase and uppercase variants are handled appropriately.

### Implementation

The replacement is implemented in the `preprocess_sentence` function:

```python

def preprocess_sentence(sentence: str) -> List[str]:

    for original, replacement in REPLACEMENTS.items():

        sentence = sentence.replace(original, replacement)

    # ... (rest of the preprocessing steps)

```

This step is crucial as it:
1. Standardizes the text, making it easier to process and analyze.
2. Maintains the semantic meaning of words while adapting them to a more universal character set.
3. Improves compatibility with existing NLP tools and libraries that might not natively support Turkmen characters.

By implementing this character replacement, we ensure that our Word2Vec model can effectively learn from and represent Turkmen text, despite the unique characteristics of the Turkmen alphabet.

## Installation ๐Ÿ”ง

1. Clone this repository:
   ```

   git clone https://github.com/yourusername/turkmen-word2vec.git

   cd turkmen-word2vec

   ```

2. Create a virtual environment (optional but recommended):
   ```

   python -m venv venv

   source venv/bin/activate

   ```

3. Install the required packages:
   ```

   pip install -r requirements.txt

   ```

## Usage ๐Ÿš€

1. Prepare your Turkmen text data in a file (one sentence per line).

2. Update the `CONFIG` dictionary in `train_turkmen_word2vec.py` with your desired parameters and file paths.

3. Run the script:
   ```

   python train_turkmen_word2vec.py

   ```

4. The script will preprocess the data, train the model, and save it along with its metadata.

5. You can then use the trained model in your projects:
   ```python

   from gensim.models import Word2Vec



   model = Word2Vec.load("tkm_w2v/turkmen_word2vec.model")

   similar_words = model.wv.most_similar("salam", topn=10)  # Example usage

   ```

## Configuration โš™๏ธ

You can customize the model's behavior by modifying the `CONFIG` dictionary in `train_turkmen_word2vec.py`. Here are the available options:

- `input_file`: Path to your input text file
- `output_dir`: Directory to save the model and metadata
- `model_name`: Name of your model
- `vector_size`: Dimensionality of the word vectors
- `window`: Maximum distance between the current and predicted word
- `min_count`: Minimum frequency of words to be included in the model
- `sg`: Training algorithm (1 for skip-gram, 0 for CBOW)
- `epochs`: Number of training epochs
- `negative`: Number of negative samples for negative sampling
- `sample`: Threshold for downsampling higher-frequency words

## Contact ๐Ÿ“ฌ

Bahtiyar Mamedov - [@gargamelix](https://t.me/gargamelix) - 31mb41@gmail.com

Project Link: [https://huggingface.co/mamed0v/turkmen-word2vec](https://huggingface.co/mamed0v/turkmen-word2vec)

---

Happy embedding! ๐ŸŽ‰ If you find this project useful, please give it a star โญ๏ธ and share it with others who might benefit from it.