YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Odia Word Embeddings
This project implements and compares different word embedding models (Word2Vec and GloVe) for the Odia language. The models are trained on a combination of Odia literature and Wikipedia data.
Project Structure
βββ data/
β βββ raw/ # Raw data files (not in repo)
β βββ processed/ # Processed data files (not in repo)
β βββ interim/ # Intermediate data files (not in repo)
βββ models/ # Trained models (not in repo)
βββ notebooks/ # Jupyter notebooks for analysis
βββ scripts/ # Utility scripts
βββ src/ # Source code
βββ data/ # Data processing modules
βββ models/ # Model training modules
Data and Models
Due to size limitations, the data and model files are hosted on Hugging Face:
Data
All data is available in the odia-word-embeddings-data dataset:
Raw Data
odia_wiki_scraped.txt- Scraped Odia Wikipedia articlesodia_literature.txt- Odia literature corpus
Processed Data
odia_wiki_scraped.csv- Processed Wikipedia articlesodia_literature.csv- Processed literature corpus
Additional Data Used
- Monolingual Odia corpus from OdiEnCorp 1.0 (used for training)
Models
All models are available in the odia-word-embeddings repository:
- Word2Vec Model
- GloVe Model
Setup
- Clone the repository:
git clone https://github.com/VanshajR/Odia-Word-Embeddings.git
cd Odia-Word-Embeddings
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Download data and models:
# Install huggingface-hub
pip install huggingface-hub
# Login to Hugging Face
huggingface-cli login
# Download data
huggingface-cli download VanshajR/odia-word-embeddings-data --local-dir data/
# Download models
huggingface-cli VanshajR/odia-word-embeddings --local-dir models/
Usage
Training Models
- Train Word2Vec:
python scripts/train_word2vec.py
- Train GloVe:
python scripts/train_glove.py
Evaluating Models
Run the evaluate_embeddings.ipynb notebook
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- OdiEnCorp team for providing the monolingual Odia corpus, check it out here
- Wikipedia for their public articles
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support