Update README.md
Browse files
README.md
CHANGED
|
@@ -16,7 +16,7 @@ tags:
|
|
| 16 |
- multilingual
|
| 17 |
- sequence-to-sequence
|
| 18 |
---
|
| 19 |
-
|
| 20 |
|
| 21 |
````markdown
|
| 22 |
---
|
|
@@ -58,15 +58,15 @@ model-index:
|
|
| 58 |
value: 0.76
|
| 59 |
---
|
| 60 |
|
| 61 |
-
|
| 62 |
|
| 63 |
This is a **Neural Machine Translation (NMT)** model trained to translate between **Malayalam (ml)** and **Hindi (hi)** using the **Fairseq** framework. It was trained on a custom curated low-resource parallel corpus.
|
| 64 |
|
| 65 |
-
|
| 66 |
|
| 67 |
-
- Framework:
|
| 68 |
-
- Architecture:
|
| 69 |
-
- Type:
|
| 70 |
- Layers: 6 encoder / 6 decoder
|
| 71 |
- Embedding size: 512
|
| 72 |
- FFN size: 2048
|
|
@@ -75,7 +75,7 @@ This is a **Neural Machine Translation (NMT)** model trained to translate betwee
|
|
| 75 |
- Tokenizer: SentencePiece (trained jointly on ml-hi)
|
| 76 |
- Vocabulary size: 32,000 (joint BPE)
|
| 77 |
|
| 78 |
-
|
| 79 |
|
| 80 |
| Setting | Value |
|
| 81 |
|----------------------|------------------------|
|
|
@@ -89,7 +89,7 @@ This is a **Neural Machine Translation (NMT)** model trained to translate betwee
|
|
| 89 |
| Hardware | 1 x V100 32GB GPU |
|
| 90 |
| Training time | ~16 hours |
|
| 91 |
|
| 92 |
-
|
| 93 |
|
| 94 |
The model was evaluated on a manually annotated Malayalam-Hindi test set consisting of 10,000 sentence pairs.
|
| 95 |
|
|
@@ -100,9 +100,9 @@ The model was evaluated on a manually annotated Malayalam-Hindi test set consist
|
|
| 100 |
| BLEU | 11.08 | 29.56 |
|
| 101 |
| COMET | 0.76 | 0.62 |
|
| 102 |
|
| 103 |
-
|
| 104 |
|
| 105 |
-
|
| 106 |
|
| 107 |
```bash
|
| 108 |
fairseq-interactive /data-bin \
|
|
@@ -121,7 +121,7 @@ fairseq-interactive /data-bin \
|
|
| 121 |
|
| 122 |
````
|
| 123 |
|
| 124 |
-
|
| 125 |
|
| 126 |
```python
|
| 127 |
import torch
|
|
@@ -134,37 +134,10 @@ model.eval()
|
|
| 134 |
|
| 135 |
> Note: To use this model effectively, you need the SentencePiece model (`spm.model`) and the exact Fairseq dictionary files (`dict.ml.txt`, `dict.hi.txt`).
|
| 136 |
|
| 137 |
-
|
| 138 |
|
| 139 |
This model was trained on a custom dataset compiled from:
|
| 140 |
|
| 141 |
* [AI4Bharat OPUS Corpus](https://github.com/AI4Bharat/IndicTrans)
|
| 142 |
* Manually aligned Malayalam-Hindi sentences from news and educational data
|
| 143 |
* Crawled parallel content from Indian government websites (under open license)
|
| 144 |
-
|
| 145 |
-
Preprocessing was done with:
|
| 146 |
-
|
| 147 |
-
* Normalization
|
| 148 |
-
* Language ID filtering
|
| 149 |
-
* Sentence length and alignment heuristics
|
| 150 |
-
|
| 151 |
-
## ๐ License
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
## ๐ค Citation
|
| 156 |
-
|
| 157 |
-
```
|
| 158 |
-
@misc{malayalam-hindi-nmt,
|
| 159 |
-
author = {Navaneeth Sreedharan , Sneha S, Renimol V R},
|
| 160 |
-
title = {Malayalam-Hindi Neural Machine Translation using Fairseq},
|
| 161 |
-
year = {2025},
|
| 162 |
-
howpublished = {\url{https://huggingface.co/icfoss/Malayalam-Hindi-Translation-Model-fairseq}}
|
| 163 |
-
}
|
| 164 |
-
```
|
| 165 |
-
|
| 166 |
-
## ๐โโ๏ธ Contact / Contributions
|
| 167 |
-
|
| 168 |
-
For queries or collaboration, contact `navaneeth@icfoss.com`. Contributions are welcome via pull requests or issues.
|
| 169 |
-
|
| 170 |
-
```
|
|
|
|
| 16 |
- multilingual
|
| 17 |
- sequence-to-sequence
|
| 18 |
---
|
| 19 |
+
`README.md` for Hugging Face Model Card
|
| 20 |
|
| 21 |
````markdown
|
| 22 |
---
|
|
|
|
| 58 |
value: 0.76
|
| 59 |
---
|
| 60 |
|
| 61 |
+
Malayalam โ Hindi Translation Model (Fairseq)
|
| 62 |
|
| 63 |
This is a **Neural Machine Translation (NMT)** model trained to translate between **Malayalam (ml)** and **Hindi (hi)** using the **Fairseq** framework. It was trained on a custom curated low-resource parallel corpus.
|
| 64 |
|
| 65 |
+
Model Architecture
|
| 66 |
|
| 67 |
+
- Framework: Fairseq (PyTorch)
|
| 68 |
+
- Architecture: Transformer
|
| 69 |
+
- Type: Sequence-to-sequence
|
| 70 |
- Layers: 6 encoder / 6 decoder
|
| 71 |
- Embedding size: 512
|
| 72 |
- FFN size: 2048
|
|
|
|
| 75 |
- Tokenizer: SentencePiece (trained jointly on ml-hi)
|
| 76 |
- Vocabulary size: 32,000 (joint BPE)
|
| 77 |
|
| 78 |
+
Training Details
|
| 79 |
|
| 80 |
| Setting | Value |
|
| 81 |
|----------------------|------------------------|
|
|
|
|
| 89 |
| Hardware | 1 x V100 32GB GPU |
|
| 90 |
| Training time | ~16 hours |
|
| 91 |
|
| 92 |
+
Evaluation
|
| 93 |
|
| 94 |
The model was evaluated on a manually annotated Malayalam-Hindi test set consisting of 10,000 sentence pairs.
|
| 95 |
|
|
|
|
| 100 |
| BLEU | 11.08 | 29.56 |
|
| 101 |
| COMET | 0.76 | 0.62 |
|
| 102 |
|
| 103 |
+
Usage
|
| 104 |
|
| 105 |
+
In Fairseq (CLI)
|
| 106 |
|
| 107 |
```bash
|
| 108 |
fairseq-interactive /data-bin \
|
|
|
|
| 121 |
|
| 122 |
````
|
| 123 |
|
| 124 |
+
In Python (Torch-based loading)
|
| 125 |
|
| 126 |
```python
|
| 127 |
import torch
|
|
|
|
| 134 |
|
| 135 |
> Note: To use this model effectively, you need the SentencePiece model (`spm.model`) and the exact Fairseq dictionary files (`dict.ml.txt`, `dict.hi.txt`).
|
| 136 |
|
| 137 |
+
Dataset
|
| 138 |
|
| 139 |
This model was trained on a custom dataset compiled from:
|
| 140 |
|
| 141 |
* [AI4Bharat OPUS Corpus](https://github.com/AI4Bharat/IndicTrans)
|
| 142 |
* Manually aligned Malayalam-Hindi sentences from news and educational data
|
| 143 |
* Crawled parallel content from Indian government websites (under open license)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|