|
|
---
|
|
|
license: cc-by-4.0
|
|
|
---
|
|
|
# Turkmen Word2Vec Model πΉπ²π¬
|
|
|
|
|
|
Welcome to the Turkmen Word2Vec Model project! This open-source initiative aims to provide a robust word embedding solution for the Turkmen language. π
|
|
|
|
|
|
## Introduction π
|
|
|
|
|
|
The Turkmen Word2Vec Model is designed to create high-quality word embeddings for the Turkmen language. By leveraging the power of Word2Vec and incorporating Turkmen-specific preprocessing, this project offers a valuable resource for various natural language processing tasks in Turkmen.
|
|
|
|
|
|
## Requirements π
|
|
|
|
|
|
To use this project, you'll need:
|
|
|
|
|
|
- Python 3.6+
|
|
|
- NLTK
|
|
|
- Gensim
|
|
|
- tqdm
|
|
|
|
|
|
## Metadata
|
|
|
```
|
|
|
Model: turkmen_word2vec
|
|
|
Vocabulary size: 153695
|
|
|
Vector size: 300
|
|
|
Window size: 5
|
|
|
Min count: 15
|
|
|
Training epochs: 10
|
|
|
Final training loss: 80079792.0
|
|
|
```
|
|
|
|
|
|
## Turkmen-Specific Character Replacement π€
|
|
|
|
|
|
One of the key features of this project is its handling of Turkmen-specific characters. The Turkmen alphabet includes several characters that are not present in the standard Latin alphabet. To ensure compatibility and improve processing, I implement a custom character replacement system.
|
|
|
|
|
|
### Replacement Map
|
|
|
|
|
|
Here's the character replacement map used in the preprocessing step:
|
|
|
|
|
|
```python
|
|
|
REPLACEMENTS = {
|
|
|
'Γ€': 'a', 'Γ§': 'ch', 'ΓΆ': 'o', 'ΓΌ': 'u', 'Ε': 'n', 'Γ½': 'y', 'Δ': 'g', 'Ε': 's',
|
|
|
'Γ': 'Ch', 'Γ': 'O', 'Γ': 'U', 'Γ': 'A', 'Ε': 'N', 'Ε': 'S', 'Γ': 'Y', 'Δ': 'G'
|
|
|
}
|
|
|
```
|
|
|
|
|
|
This mapping ensures that:
|
|
|
- Special Turkmen characters are converted to their closest Latin alphabet equivalents.
|
|
|
- The essence of the original text is preserved while making it more processable for standard NLP tools.
|
|
|
- Both lowercase and uppercase variants are handled appropriately.
|
|
|
|
|
|
### Implementation
|
|
|
|
|
|
The replacement is implemented in the `preprocess_sentence` function:
|
|
|
|
|
|
```python
|
|
|
def preprocess_sentence(sentence: str) -> List[str]:
|
|
|
for original, replacement in REPLACEMENTS.items():
|
|
|
sentence = sentence.replace(original, replacement)
|
|
|
# ... (rest of the preprocessing steps)
|
|
|
```
|
|
|
|
|
|
This step is crucial as it:
|
|
|
1. Standardizes the text, making it easier to process and analyze.
|
|
|
2. Maintains the semantic meaning of words while adapting them to a more universal character set.
|
|
|
3. Improves compatibility with existing NLP tools and libraries that might not natively support Turkmen characters.
|
|
|
|
|
|
By implementing this character replacement, we ensure that our Word2Vec model can effectively learn from and represent Turkmen text, despite the unique characteristics of the Turkmen alphabet.
|
|
|
|
|
|
## Installation π§
|
|
|
|
|
|
1. Clone this repository:
|
|
|
```
|
|
|
git clone https://github.com/yourusername/turkmen-word2vec.git
|
|
|
cd turkmen-word2vec
|
|
|
```
|
|
|
|
|
|
2. Create a virtual environment (optional but recommended):
|
|
|
```
|
|
|
python -m venv venv
|
|
|
source venv/bin/activate
|
|
|
```
|
|
|
|
|
|
3. Install the required packages:
|
|
|
```
|
|
|
pip install -r requirements.txt
|
|
|
```
|
|
|
|
|
|
## Usage π
|
|
|
|
|
|
1. Prepare your Turkmen text data in a file (one sentence per line).
|
|
|
|
|
|
2. Update the `CONFIG` dictionary in `train_turkmen_word2vec.py` with your desired parameters and file paths.
|
|
|
|
|
|
3. Run the script:
|
|
|
```
|
|
|
python train_turkmen_word2vec.py
|
|
|
```
|
|
|
|
|
|
4. The script will preprocess the data, train the model, and save it along with its metadata.
|
|
|
|
|
|
5. You can then use the trained model in your projects:
|
|
|
```python
|
|
|
from gensim.models import Word2Vec
|
|
|
|
|
|
model = Word2Vec.load("tkm_w2v/turkmen_word2vec.model")
|
|
|
similar_words = model.wv.most_similar("salam", topn=10) # Example usage
|
|
|
```
|
|
|
|
|
|
## Configuration βοΈ
|
|
|
|
|
|
You can customize the model's behavior by modifying the `CONFIG` dictionary in `train_turkmen_word2vec.py`. Here are the available options:
|
|
|
|
|
|
- `input_file`: Path to your input text file
|
|
|
- `output_dir`: Directory to save the model and metadata
|
|
|
- `model_name`: Name of your model
|
|
|
- `vector_size`: Dimensionality of the word vectors
|
|
|
- `window`: Maximum distance between the current and predicted word
|
|
|
- `min_count`: Minimum frequency of words to be included in the model
|
|
|
- `sg`: Training algorithm (1 for skip-gram, 0 for CBOW)
|
|
|
- `epochs`: Number of training epochs
|
|
|
- `negative`: Number of negative samples for negative sampling
|
|
|
- `sample`: Threshold for downsampling higher-frequency words
|
|
|
|
|
|
## Contact π¬
|
|
|
|
|
|
Bahtiyar Mamedov - [@gargamelix](https://t.me/gargamelix) - 31mb41@gmail.com
|
|
|
|
|
|
Project Link: [https://huggingface.co/mamed0v/turkmen-word2vec](https://huggingface.co/mamed0v/turkmen-word2vec)
|
|
|
|
|
|
---
|
|
|
|
|
|
Happy embedding! π If you find this project useful, please give it a star βοΈ and share it with others who might benefit from it. |