Update README.md
Browse files
README.md
CHANGED
|
@@ -1,11 +1,21 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
# Geez Word2Vec Model
|
| 3 |
|
| 4 |
This repository contains a Word2Vec model trained on the TIGQA dataset using a custom tokenizer with SpaCy.
|
| 5 |
|
| 6 |
## Model Description
|
| 7 |
|
| 8 |
-
The Word2Vec model in this repository has been trained to generate word embeddings for Geez script Tigrinya text
|
| 9 |
|
| 10 |
## Usage
|
| 11 |
|
|
@@ -24,18 +34,20 @@ from gensim.models import Word2Vec
|
|
| 24 |
# Load the trained Word2Vec model
|
| 25 |
model = Word2Vec.load("Geez_word2vec_skipgram.model")
|
| 26 |
|
| 27 |
-
# Get vector for a word
|
| 28 |
word_vector = model.wv['ሰብ']
|
| 29 |
print(f"Vector for 'ሰብ': {word_vector}")
|
| 30 |
|
| 31 |
-
# Find most similar words
|
| 32 |
similar_words = model.wv.most_similar('ሰብ')
|
| 33 |
print(f"Words similar to 'ሰብ': {similar_words}")
|
| 34 |
|
| 35 |
Dataset Source
|
| 36 |
-
|
|
|
|
|
|
|
| 37 |
|
| 38 |
-
For more information about the TIGQA dataset, visit this link.
|
| 39 |
|
| 40 |
License
|
| 41 |
-
This Word2Vec model and its associated files are released under the MIT License.
|
|
|
|
| 1 |
+
---
|
| 2 |
+
datasets:
|
| 3 |
+
- Hailay/TigQA
|
| 4 |
+
language:
|
| 5 |
+
- ti
|
| 6 |
+
---
|
| 7 |
+
datasets:
|
| 8 |
+
- Hailay/TigQA
|
| 9 |
+
language:
|
| 10 |
+
- ti
|
| 11 |
+
---
|
| 12 |
# Geez Word2Vec Model
|
| 13 |
|
| 14 |
This repository contains a Word2Vec model trained on the TIGQA dataset using a custom tokenizer with SpaCy.
|
| 15 |
|
| 16 |
## Model Description
|
| 17 |
|
| 18 |
+
The Word2Vec model in this repository has been trained to generate word embeddings for Geez script Tigrinya text. The model captures semantic relationships between words in the Geez language based on their context in the TIGQA dataset.
|
| 19 |
|
| 20 |
## Usage
|
| 21 |
|
|
|
|
| 34 |
# Load the trained Word2Vec model
|
| 35 |
model = Word2Vec.load("Geez_word2vec_skipgram.model")
|
| 36 |
|
| 37 |
+
# Get a vector for a word
|
| 38 |
word_vector = model.wv['ሰብ']
|
| 39 |
print(f"Vector for 'ሰብ': {word_vector}")
|
| 40 |
|
| 41 |
+
# Find the most similar words
|
| 42 |
similar_words = model.wv.most_similar('ሰብ')
|
| 43 |
print(f"Words similar to 'ሰብ': {similar_words}")
|
| 44 |
|
| 45 |
Dataset Source
|
| 46 |
+
|
| 47 |
+
The TIGQA dataset for training this model contains text data in the Geez script of the Tigrinya language.
|
| 48 |
+
It is a publicly available dataset widely used for research and development of NLP models for the Tigrinya language.
|
| 49 |
|
| 50 |
+
For more information about the TIGQA dataset, visit this link. https://zenodo.org/records/11423987
|
| 51 |
|
| 52 |
License
|
| 53 |
+
This Word2Vec model and its associated files are released under the MIT License.
|