Update README.md
Browse files
README.md
CHANGED
|
@@ -4,12 +4,17 @@ language:
|
|
| 4 |
- bn
|
| 5 |
metrics:
|
| 6 |
- bleu
|
|
|
|
|
|
|
|
|
|
| 7 |
library_name: transformers
|
| 8 |
pipeline_tag: text2text-generation
|
|
|
|
|
|
|
| 9 |
---
|
| 10 |
-
#
|
| 11 |
|
| 12 |
-
The goal of this project was to develop a software model that could fix grammatical and syntax errors in Bengali text. The approach was similar to how a language translator works, where the incorrect sentence is transformed into a correct one. We fine tune a pertained model, namely [mBart50] with a [dataset] of 1.M samples for 6500 steps and achieve a
|
| 13 |
|
| 14 |
## Initial Testing:
|
| 15 |
|
|
@@ -67,12 +72,13 @@ correct_bengali_sentence = tokenizer.decode(outputs[0], skip_special_tokens=True
|
|
| 67 |
# আপনি কেমন আছেন?
|
| 68 |
```
|
| 69 |
|
|
|
|
| 70 |
|
| 71 |
#### Important note: You need to make sure if have used `use_safetensors =True` parameter during loading the model.
|
| 72 |
|
| 73 |
# General issues faced during the entire journey:
|
| 74 |
|
| 75 |
-
- Issue: The system is not printing any evaluation
|
| 76 |
Solution: The GPU that I am training on doesn't support FP16/BF16 precision. Commenting out `fp16 =True` in the Seq2SeqTrainingArguments solved the issue.
|
| 77 |
|
| 78 |
- Issue: Training on TPU crashes on both Colab and Kaggle.
|
|
@@ -85,5 +91,6 @@ The model is clearly overfitting, and we can reduce that. My best guess is that
|
|
| 85 |
I'm also planning to run a 4-bit quantization on the same model to see how it performs against the base model. It should be a fun experiment.
|
| 86 |
|
| 87 |
## Resources and References:
|
|
|
|
| 88 |
[Dataset Source](https://github.com/hishab-nlp/BNSECData)
|
| 89 |
[Model Documentation and Troubleshooting](https://huggingface.co/docs/transformers/model_doc/mbart)
|
|
|
|
| 4 |
- bn
|
| 5 |
metrics:
|
| 6 |
- bleu
|
| 7 |
+
- cer
|
| 8 |
+
- wer
|
| 9 |
+
- meteor
|
| 10 |
library_name: transformers
|
| 11 |
pipeline_tag: text2text-generation
|
| 12 |
+
tags:
|
| 13 |
+
- text-generation-inference
|
| 14 |
---
|
| 15 |
+
# Bengali Text Correction Overview:
|
| 16 |
|
| 17 |
+
The goal of this project was to develop a software model that could fix grammatical and syntax errors in Bengali text. The approach was similar to how a language translator works, where the incorrect sentence is transformed into a correct one. We fine tune a pertained model, namely [mBart50](https://huggingface.co/facebook/mbart-large-50) with a [dataset](https://github.com/hishab-nlp/BNSECData) of 1.3 M samples for 6500 steps and achieve a score of `{BLEU: 0.443, CER:0.159, WER:0.406, Meteor: 0.655}`when tested on unseen data. Clone/download this [repo](https://github.com/himisir/Bengali-Sentence-Error-Correction), run the `correction.py` script and type the sentence after the prompt and you are all set.
|
| 18 |
|
| 19 |
## Initial Testing:
|
| 20 |
|
|
|
|
| 72 |
# আপনি কেমন আছেন?
|
| 73 |
```
|
| 74 |
|
| 75 |
+
If you want to test this model from the terminal, run the `python correction.py` script and type the sentence after the prompt and you are all set. you'll need the `transformers` library to run this script. Install the `transformers` model using `pip install -q transformers[torch] -U`.
|
| 76 |
|
| 77 |
#### Important note: You need to make sure if have used `use_safetensors =True` parameter during loading the model.
|
| 78 |
|
| 79 |
# General issues faced during the entire journey:
|
| 80 |
|
| 81 |
+
- Issue: The system is not printing any evaluation function.
|
| 82 |
Solution: The GPU that I am training on doesn't support FP16/BF16 precision. Commenting out `fp16 =True` in the Seq2SeqTrainingArguments solved the issue.
|
| 83 |
|
| 84 |
- Issue: Training on TPU crashes on both Colab and Kaggle.
|
|
|
|
| 91 |
I'm also planning to run a 4-bit quantization on the same model to see how it performs against the base model. It should be a fun experiment.
|
| 92 |
|
| 93 |
## Resources and References:
|
| 94 |
+
|
| 95 |
[Dataset Source](https://github.com/hishab-nlp/BNSECData)
|
| 96 |
[Model Documentation and Troubleshooting](https://huggingface.co/docs/transformers/model_doc/mbart)
|