YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Odia Word Embeddings

This project implements and compares different word embedding models (Word2Vec and GloVe) for the Odia language. The models are trained on a combination of Odia literature and Wikipedia data.

Project Structure

β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/          # Raw data files (not in repo)
β”‚   β”œβ”€β”€ processed/    # Processed data files (not in repo)
β”‚   └── interim/      # Intermediate data files (not in repo)
β”œβ”€β”€ models/           # Trained models (not in repo)
β”œβ”€β”€ notebooks/        # Jupyter notebooks for analysis
β”œβ”€β”€ scripts/          # Utility scripts
└── src/             # Source code
    β”œβ”€β”€ data/        # Data processing modules
    └── models/      # Model training modules

Data and Models

Due to size limitations, the data and model files are hosted on Hugging Face:

Data

All data is available in the odia-word-embeddings-data dataset:

Raw Data

  • odia_wiki_scraped.txt - Scraped Odia Wikipedia articles
  • odia_literature.txt - Odia literature corpus

Processed Data

  • odia_wiki_scraped.csv - Processed Wikipedia articles
  • odia_literature.csv - Processed literature corpus

Additional Data Used

Models

All models are available in the odia-word-embeddings repository:

  • Word2Vec Model
  • GloVe Model

Setup

  1. Clone the repository:
git clone https://github.com/VanshajR/Odia-Word-Embeddings.git
cd Odia-Word-Embeddings
  1. Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Download data and models:
# Install huggingface-hub
pip install huggingface-hub

# Login to Hugging Face
huggingface-cli login

# Download data
huggingface-cli download VanshajR/odia-word-embeddings-data --local-dir data/

# Download models
huggingface-cli VanshajR/odia-word-embeddings --local-dir models/

Usage

Training Models

  1. Train Word2Vec:
python scripts/train_word2vec.py
  1. Train GloVe:
python scripts/train_glove.py

Evaluating Models

Run the evaluate_embeddings.ipynb notebook

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • OdiEnCorp team for providing the monolingual Odia corpus, check it out here
  • Wikipedia for their public articles
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support