VanshajR commited on
Commit
ad35cdc
Β·
verified Β·
1 Parent(s): 1ea232a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +106 -0
README.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Odia Word Embeddings
2
+
3
+ This project implements and compares different word embedding models (Word2Vec and GloVe) for the Odia language. The models are trained on a combination of Odia literature and Wikipedia data.
4
+
5
+ ## Project Structure
6
+
7
+ ```
8
+ β”œβ”€β”€ data/
9
+ β”‚ β”œβ”€β”€ raw/ # Raw data files (not in repo)
10
+ β”‚ β”œβ”€β”€ processed/ # Processed data files (not in repo)
11
+ β”‚ └── interim/ # Intermediate data files (not in repo)
12
+ β”œβ”€β”€ models/ # Trained models (not in repo)
13
+ β”œβ”€β”€ notebooks/ # Jupyter notebooks for analysis
14
+ β”œβ”€β”€ scripts/ # Utility scripts
15
+ └── src/ # Source code
16
+ β”œβ”€β”€ data/ # Data processing modules
17
+ └── models/ # Model training modules
18
+ ```
19
+
20
+ ## Data and Models
21
+
22
+ Due to size limitations, the data and model files are hosted on Hugging Face:
23
+
24
+ ### Data
25
+ All data is available in the [odia-word-embeddings-data](https://huggingface.co/datasets/your-username/odia-word-embeddings-data) dataset:
26
+
27
+ #### Raw Data
28
+ - `odia_wiki_scraped.txt` - Scraped Odia Wikipedia articles
29
+ - `odia_literature.txt` - Odia literature corpus
30
+
31
+ #### Processed Data
32
+ - `odia_wiki_scraped.csv` - Processed Wikipedia articles
33
+ - `odia_literature.csv` - Processed literature corpus
34
+
35
+ #### Additional Data Used
36
+ - Monolingual Odia corpus from [OdiEnCorp 1.0](https://github.com/odiencorp/OdiEnCorp) (used for training)
37
+
38
+ ### Models
39
+ All models are available in the [odia-word-embeddings](https://huggingface.co/VanshajR/odia-word-embeddings) repository:
40
+ - Word2Vec Model
41
+ - GloVe Model
42
+
43
+ ## Setup
44
+
45
+ 1. Clone the repository:
46
+ ```bash
47
+ git clone https://github.com/VanshajR/Odia-Word-Embeddings.git
48
+ cd Odia-Word-Embeddings
49
+ ```
50
+
51
+ 2. Create and activate a virtual environment:
52
+ ```bash
53
+ python -m venv venv
54
+ source venv/bin/activate # On Windows: venv\Scripts\activate
55
+ ```
56
+
57
+ 3. Install dependencies:
58
+ ```bash
59
+ pip install -r requirements.txt
60
+ ```
61
+
62
+ 4. Download data and models:
63
+ ```bash
64
+ # Install huggingface-hub
65
+ pip install huggingface-hub
66
+
67
+ # Login to Hugging Face
68
+ huggingface-cli login
69
+
70
+ # Download data
71
+ huggingface-cli download VanshajR/odia-word-embeddings-data --local-dir data/
72
+
73
+ # Download models
74
+ huggingface-cli VanshajR/odia-word-embeddings --local-dir models/
75
+ ```
76
+
77
+ ## Usage
78
+
79
+ ### Training Models
80
+
81
+ 1. Train Word2Vec:
82
+ ```bash
83
+ python scripts/train_word2vec.py
84
+ ```
85
+
86
+ 2. Train GloVe:
87
+ ```bash
88
+ python scripts/train_glove.py
89
+ ```
90
+
91
+ ### Evaluating Models
92
+
93
+ Run the evaluate_embeddings.ipynb notebook
94
+
95
+ ## Contributing
96
+
97
+ Contributions are welcome! Please feel free to submit a Pull Request.
98
+
99
+ ## License
100
+
101
+ This project is licensed under the MIT License - see the LICENSE file for details.
102
+
103
+ ## Acknowledgments
104
+
105
+ - OdiEnCorp team for providing the monolingual Odia corpus, check it out [here](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2879)
106
+ - Wikipedia for their public articles