|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- sumitaryal/nepali_grammatical_error_detection |
|
|
language: |
|
|
- ne |
|
|
metrics: |
|
|
- accuracy |
|
|
base_model: |
|
|
- google/muril-base-cased |
|
|
pipeline_tag: text-classification |
|
|
widget: |
|
|
- src: रामले भात खायो । |
|
|
example_title: Sample 1 |
|
|
new_version: sumitaryal/Nepali_Grammatical_Error_Detection_MuRIL |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Model Card for Nepali Grammatical Error Detection (MuRIL) |
|
|
|
|
|
This model is designed for **Nepali Grammatical Error Detection (GED)** task. It utilizes the BERT-based MuRIL model to detect grammatical errors in Nepali text. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Developed by:** Sumit Aryal |
|
|
- **Model type:** BERT (MuRIL-based) |
|
|
- **Language(s):** Nepali |
|
|
- **License:** Apache 2.0 |
|
|
- **Finetuned from model:** google/muril-base-cased |
|
|
|
|
|
### Dataset |
|
|
|
|
|
- **Dataset Name:** [Nepali Grammatical Error Detection Dataset](https://huggingface.co/datasets/sumitaryal/nepali_grammatical_error_detection) |
|
|
- **Description:** The dataset comprises a total of **2,568,682** correctly constructed sentences alongside their erroneous counterparts, resulting in **7,514,122** samples for the training dataset. For the validation dataset, it contains **365,606** correct sentences and **405,905** incorrect sentences. This diverse collection encompasses various types of grammatical errors, including verb inflections, homophones, punctuation errors, and sentence structure issues, making it a comprehensive resource for training and evaluating grammatical error detection models. |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [Nepali Grammatical Error Detection MuRIL](https://huggingface.co/sumitaryal/Nepali_Grammatical_Error_Detection_MuRIL) |
|
|
- **Paper:** "BERT-Based Nepali Grammatical Error Detection and Correction Leveraging a New Corpus" (INSPECT-2024) |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
- Grammar checking for written Nepali text. |
|
|
|
|
|
## Evaluation Metrics |
|
|
- **Accuracy:** 91.1515% |
|
|
- **Traning Loss:** 0.242700 |
|
|
- **Validation Loss:** 0.217756 |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Use the code below to get started with the model. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import BertForSequenceClassification, AutoTokenizer |
|
|
|
|
|
model = BertForSequenceClassification.from_pretrained("sumitaryal/Nepali_Grammatical_Error_Detection_MuRIL") |
|
|
tokenizer = AutoTokenizer.from_pretrained("sumitaryal/Nepali_Grammatical_Error_Detection_MuRIL", do_lower_case=False) |
|
|
|
|
|
input_sentence = "रामले भात खायो ।" |
|
|
inputs = tokenizer(input_sentence, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
logits = model(**inputs).logits |
|
|
|
|
|
predicted_class_id = logits.argmax().item() |
|
|
predicted_class = model.config.id2label[predicted_class_id] |
|
|
print(f'The sentence "{input_sentence}" is "{predicted_class}"') |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
- Framework: PyTorch |
|
|
- Hyperparameters: |
|
|
- Epoch = 1 |
|
|
- Train Batch Size = 256 |
|
|
- Valid Batch Size = 256 |
|
|
- Loss Function = Cross Entripy Loss |
|
|
- Optimizer = AdamW |
|
|
- Optimizer Parameters: |
|
|
- Learning Rate = 5e-5 |
|
|
- β1 = 0.9 |
|
|
- β2 = 0.999 |
|
|
- ϵ = 1e−8 |
|
|
- GPU = NVIDIA® GeForce® RTXTM 4060 GPU, 8GB VRAM |