--- license: apache-2.0 datasets: - Hailay/TigQA pipeline_tag: sentence-similarity --- datasets: - Hailay/TigQA # Geez Word2Vec Skipgram Model This repository contains a Word2Vec model trained on the TIGQA dataset using a custom tokenizer with SpaCy. and The model can be used as part of the language-adaptive pretraining process and for embedding initialization. ## Usage Word2Vec static embeddings can be used to align or map the semantic relationships of pretrained model embeddings with the target language embeddings (specifically, embeddings generated by the Tigrinya tokenizer). You can download and use the model in your Python code as follows: ```python from gensim.models import Word2Vec # URL of the model file on Hugging Face model_url = "https://huggingface.co/Hailay/Geez_word2vec_skipgram.model/resolve/main/Geez_word2vec_skipgram.model" # Load the trained Word2Vec model directly from the URL model = Word2Vec.load(model_url) # Get a vector for a word word_vector = model.wv['ሰብ'] print(f"Vector for 'ሰብ': {word_vector}") # Find the most similar words similar_words = model.wv.most_similar('ሰብ') print(f"Words similar to 'ሰብ': {similar_words}") #Visualizing Word Vectors You can visualize the word vectors using t-SNE: import matplotlib.pyplot as plt from sklearn.manifold import TSNE import numpy as np # Words to visualize but you can change the words from the trained vocublary words = ['ሰብ', 'ዓለም', 'ሰላም', 'ሓይሊ','ጊዜ', 'ባህሪ'] # Get the vectors for the words word_vectors = np.array([model.wv[word] for word in words]) # Reduce dimensionality using t-SNE with a lower perplexity value perplexity_value = min(5, len(words) - 1) tsne = TSNE(n_components=2, perplexity=perplexity_value, random_state=0) word_vectors_2d = tsne.fit_transform(word_vectors) # Create a scatter plot plt.figure(figsize=(10, 6)) plt.scatter(word_vectors_2d[:, 0], word_vectors_2d[:, 1], edgecolors='k', c='r') # Add annotations to the points for i, word in enumerate(words): plt.annotate(word, xy=(word_vectors_2d[i, 0], word_vectors_2d[i, 1]), xytext=(5, 2), textcoords='offset points', ha='right', va='bottom') plt.title('2D Visualization of Word2Vec Embeddings') plt.xlabel('TSNE Component 1') plt.ylabel('TSNE Component 2') plt.grid(True) plt.show() ##Dataset Source The dataset for training this model contains text data in the Geez script of the Tigrinya language. It is a publicly available dataset as part of an NLP resource for low-resource languages for research and development. For more information about the TIGQA dataset, visit this link. https://zenodo.org/records/11423987 and from HornMT License This Word2Vec model and its associated files are released under the MIT License.