YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Odia Word Embeddings
This project implements and compares different word embedding models (Word2Vec and GloVe) for the Odia language. The models are trained on a combination of Odia literature and Wikipedia data.
Project Structure
βββ data/
β βββ raw/ # Raw data files (not in repo)
β βββ processed/ # Processed data files (not in repo)
β βββ interim/ # Intermediate data files (not in repo)
βββ models/ # Trained models (not in repo)
βββ notebooks/ # Jupyter notebooks for analysis
βββ scripts/ # Utility scripts
βββ src/ # Source code
βββ data/ # Data processing modules
βββ models/ # Model training modules
Data and Models
Due to size limitations, the data and model files are hosted on Hugging Face:
Data
All data is available in the odia-word-embeddings-data dataset:
Raw Data
odia_wiki_scraped.txt- Scraped Odia Wikipedia articlesodia_literature.txt- Odia literature corpus
Processed Data
odia_wiki_scraped.csv- Processed Wikipedia articlesodia_literature.csv- Processed literature corpus
Additional Data Used
- Monolingual Odia corpus from OdiEnCorp 1.0 (used for training)
Models
All models are available in the odia-word-embeddings repository:
- Word2Vec Model
- GloVe Model
Setup
- Clone the repository:
git clone https://github.com/VanshajR/Odia-Word-Embeddings.git
cd Odia-Word-Embeddings
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Download data and models:
# Install huggingface-hub
pip install huggingface-hub
# Login to Hugging Face
huggingface-cli login
# Download data
huggingface-cli download VanshajR/odia-word-embeddings-data --local-dir data/
# Download models
huggingface-cli VanshajR/odia-word-embeddings --local-dir models/
Usage
Training Models
- Train Word2Vec:
python scripts/train_word2vec.py
- Train GloVe:
python scripts/train_glove.py
Evaluating Models
Run the evaluate_embeddings.ipynb notebook
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- OdiEnCorp team for providing the monolingual Odia corpus, check it out here
- Wikipedia for their public articles
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support