File size: 2,777 Bytes
26e3b4b
 
 
 
 
 
014222a
 
cb2baa9
26e3b4b
cb2baa9
36a845e
cb2baa9
 
17677a9
26e3b4b
cb2baa9
 
 
 
26e3b4b
 
 
 
 
cb2baa9
014222a
26e3b4b
cb2baa9
 
014222a
cb2baa9
 
 
26e3b4b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
014222a
440a706
b8b687d
cb2baa9
440a706
cb2baa9
 
014222a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
license: apache-2.0
datasets:
- Hailay/TigQA
pipeline_tag: sentence-similarity
---
datasets:
- Hailay/TigQA

# Geez Word2Vec Skipgram Model

This repository contains a Word2Vec model trained on the TIGQA dataset using a custom tokenizer with SpaCy. and The model can be used as part of the language-adaptive pretraining process and for embedding initialization.

## Usage
Word2Vec static embeddings can be used to align or map the semantic relationships of pretrained model embeddings with the target language embeddings (specifically, embeddings generated by the Tigrinya tokenizer).
You can download and use the model in your Python code as follows:

```python
from gensim.models import Word2Vec

# URL of the model file on Hugging Face
model_url = "https://huggingface.co/Hailay/Geez_word2vec_skipgram.model/resolve/main/Geez_word2vec_skipgram.model"

# Load the trained Word2Vec model directly from the URL
model = Word2Vec.load(model_url)

# Get a vector for a word
word_vector = model.wv['ሰα‰₯']
print(f"Vector for 'ሰα‰₯': {word_vector}")

# Find the most similar words
similar_words = model.wv.most_similar('ሰα‰₯')
print(f"Words similar to 'ሰα‰₯': {similar_words}")

#Visualizing Word Vectors
You can visualize the word vectors using t-SNE:
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import numpy as np

# Words to visualize but you can change the words from the trained vocublary 
words = ['ሰα‰₯', 'α‹“αˆˆαˆ', 'αˆ°αˆ‹αˆ', 'αˆ“α‹­αˆŠ','αŒŠα‹œ', 'α‰£αˆ…αˆͺ']

# Get the vectors for the words
word_vectors = np.array([model.wv[word] for word in words])

# Reduce dimensionality using t-SNE with a lower perplexity value
perplexity_value = min(5, len(words) - 1)
tsne = TSNE(n_components=2, perplexity=perplexity_value, random_state=0)
word_vectors_2d = tsne.fit_transform(word_vectors)

# Create a scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(word_vectors_2d[:, 0], word_vectors_2d[:, 1], edgecolors='k', c='r')

# Add annotations to the points
for i, word in enumerate(words):
    plt.annotate(word, xy=(word_vectors_2d[i, 0], word_vectors_2d[i, 1]), xytext=(5, 2),
                 textcoords='offset points', ha='right', va='bottom')

plt.title('2D Visualization of Word2Vec Embeddings')
plt.xlabel('TSNE Component 1')
plt.ylabel('TSNE Component 2')
plt.grid(True)
plt.show()


##Dataset Source
 
The dataset for training this model contains text data in the Geez script of the Tigrinya language.
It is a publicly available dataset as part of an NLP resource for low-resource languages for research and development.

For more information about the TIGQA dataset, visit this link. https://zenodo.org/records/11423987 and from  HornMT

License
This Word2Vec model and its associated files are released under the MIT License.