YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Odia Word Embeddings

This project implements and compares different word embedding models (Word2Vec and GloVe) for the Odia language. The models are trained on a combination of Odia literature and Wikipedia data.

Project Structure

β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/          # Raw data files (not in repo)
β”‚   β”œβ”€β”€ processed/    # Processed data files (not in repo)
β”‚   └── interim/      # Intermediate data files (not in repo)
β”œβ”€β”€ models/           # Trained models (not in repo)
β”œβ”€β”€ notebooks/        # Jupyter notebooks for analysis
β”œβ”€β”€ scripts/          # Utility scripts
└── src/             # Source code
    β”œβ”€β”€ data/        # Data processing modules
    └── models/      # Model training modules

Data and Models

Due to size limitations, the data and model files are hosted on Hugging Face:

Data

All data is available in the odia-word-embeddings-data dataset:

Raw Data

  • odia_wiki_scraped.txt - Scraped Odia Wikipedia articles
  • odia_literature.txt - Odia literature corpus

Processed Data

  • odia_wiki_scraped.csv - Processed Wikipedia articles
  • odia_literature.csv - Processed literature corpus

Additional Data Used

Models

All models are available in the odia-word-embeddings repository:

  • Word2Vec Model
  • GloVe Model

Setup

  1. Clone the repository:
git clone https://github.com/VanshajR/Odia-Word-Embeddings.git
cd Odia-Word-Embeddings
  1. Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Download data and models:
# Install huggingface-hub
pip install huggingface-hub

# Login to Hugging Face
huggingface-cli login

# Download data
huggingface-cli download VanshajR/odia-word-embeddings-data --local-dir data/

# Download models
huggingface-cli VanshajR/odia-word-embeddings --local-dir models/

Usage

Training Models

  1. Train Word2Vec:
python scripts/train_word2vec.py
  1. Train GloVe:
python scripts/train_glove.py

Evaluating Models

Run the evaluate_embeddings.ipynb notebook

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • OdiEnCorp team for providing the monolingual Odia corpus, check it out here
  • Wikipedia for their public articles
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support