upload model

Files changed (8) hide show

README.md +173 -3
config.json +36 -0
dict.txt +0 -0
generation_config.json +9 -0
model.safetensors +3 -0
sentencepiece.bpe.model +3 -0
special_tokens_map.json +51 -0
tokenizer_config.json +56 -0

README.md CHANGED Viewed

@@ -1,3 +1,173 @@
----
-license: mit
----

+---
+base_model: vinai/bartpho-syllable
+library_name: peft
+tags:
+- base_model:adapter:vinai/bartpho-syllable
+- lora
+- transformers
+- seq2seq
+- vietnamese
+- error-correction
+- spell-checking
+- text-generation
+license: mit
+language:
+- vi
+metrics:
+- bleu
+- wer
+- cer
+- accuracy
+---
+# BartPho-Syllable - Vietnamese Error Correction (LoRA)
+## Model Details
+### Model Description
+This model is a Fine-tuned version of **[vinai/bartpho-syllable](https://huggingface.co/vinai/bartpho-syllable)** using **LoRA (Low-Rank Adaptation)**. It is specifically designed for **Vietnamese Error Correction (VEC)** tasks.
+Unlike simple diacritic restoration models, this model aims to correct:
+1.  **Missing Diacritics:** (e.g., "trang phuc" -> "trang phục")
+2.  **Spelling Errors:** (e.g., "bài toán" vs "bài toan")
+3.  **Teencode & Informal Variants:** Normalizing teencode, slang, and informal online writing into standard Vietnamese (e.g., "zui wa" -> "vui quá", "iu vk" -> "yêu vợ").
+3.  **Basic Grammar/Contextual correction** based on syllable-level understanding.
+The model was trained on a dataset of approximately **50,000 sentences across the training, validation, and test splits**, which were **automatically labeled using a large language model from crawled Vietnamese social media comments**. Due to the nature of social media data, the dataset may contain noise or labeling imperfections; however, it is **not intended to include any offensive content or to target any individual or organization**.
+- **Developed by:** Thanh-Dan Bui
+- **Model type:** Seq2Seq (Encoder-Decoder) with LoRA Adapter
+- **Language(s):** Vietnamese
+- **License:** MIT
+- **Finetuned from model:** `vinai/bartpho-syllable`
+## Uses
+### Direct Use
+The model is designed for Vietnamese text error correction. It takes noisy Vietnamese text as input, including missing diacritics, spelling mistakes, and informal or teencode expressions, and produces grammatically correct and orthographically normalized Vietnamese text as output.
+**Example:**
+*   **Input:** "t đang xu ly 1 bai toán la sưa lỗi cho tieng viet"
+*   **Output:** "tôi đang xử lý 1 bài toán là sửa lỗi cho tiếng Việt"
+### Out-of-Scope Use
+*   Translation from other languages to Vietnamese.
+*   Generating text from scratch (Open-ended generation).
+*   Correcting highly specialized technical jargon not present in general Vietnamese corpora.
+## Bias, Risks, and Limitations
+*   **Context Length:** The model is optimized for sentence-level correction (max length ~256 tokens). Very long paragraphs should be split before processing.
+*   **Ambiguity:** In cases where a noisy or abbreviated form can correspond to multiple valid standard forms (for example, variants of “không” such as “k”, “ko”, “hong”, or “hông”), the model relies on contextual information to infer the most likely correction, which may occasionally result in incorrect predictions.
+*   **Proper Nouns:** The model might attempt to "correct" foreign names or uncommon proper nouns if they resemble Vietnamese words.
+## How to Get Started with the Model
+You can use this model with the `transformers` libraries.
+```python
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
+path = "yammdd/vietnamese-error-correction"
+tokenizer = AutoTokenizer.from_pretrained(path)
+model = AutoModelForSeq2SeqLM.from_pretrained(path)
+pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
+text = "hum ni a bùn wá bé iu ưi"
+out = pipe(text, max_new_tokens=256)
+print(out[0]["generated_text"])
+# Output: hôm nay anh buồn quá bé yêu ơi
+```
+## Training Details
+### Training Data
+*   **Source:** Aggregated Vietnamese text corpus.
+*   **Task:** Vietnamese text correction (diacritic restoration and error correction).
+*   **Size:** Approximately 50,000 sentence pairs (split into Train/Validation/Test sets).
+*   **Data Format:**
+    *   **Input:** Text with removed diacritics or synthetically induced spelling errors.
+    *   **Target:** Original, grammatically correct Vietnamese text.
+*   **Sequence Length:** Maximum input and output length of 256 tokens.
+### Training Procedure
+*   **Base Model:** `vinai/bartpho-syllable`
+*   **Technique:** Parameter-Efficient Fine-Tuning (PEFT) using **LoRA** (Low-Rank Adaptation).
+*   **LoRA Configuration:**
+    *   **Target Modules:** `q_proj`, `v_proj`, `out_proj`, `fc1`, `fc2` (covering both attention and feed-forward layers).
+    *   **Rank (r):** 32
+    *   **Alpha:** 64
+    *   **Dropout:** 0.1
+*   **Precision:** FP16 (Mixed Precision) for optimized memory usage and speed.
+#### Training Hyperparameters
+*   **Optimizer:** AdamW with weight decay of 0.01.
+*   **Batch Size:** 16 per device (Total effective batch size depends on GPU count, typically 32 on 2x T4).
+*   **Learning Rate:** 5e-4.
+*   **Training Epochs:** 5.
+*   **Evaluation Strategy:** Every 2,000 steps.
+*   **Label Smoothing:** Implicitly handled by `DataCollatorForSeq2Seq` with `label_pad_token_id=-100`.
+#### Speeds, Sizes, Times
+*   **Hardware:** 2x NVIDIA T4 GPUs (Kaggle environment).
+*   **Checkpoint Size:** The adapter weights are lightweight (only several megabytes), significantly smaller than the full BARTpho base model.
+*   **Training Dynamics:** Managed via the Hugging Face `Seq2SeqTrainer` with `predict_with_generate` enabled for validation metrics.
+## Evaluation
+### Testing Data, Factors & Metrics
+The model was evaluated on a held-out test set of **5,081 samples**, covering a diverse range of Vietnamese sentence structures and lengths.
+#### Metrics
+*   **BLEU Score:** Measures the n-gram overlap between the predicted and target text.
+*   **Word Error Rate (WER):** Measures the ratio of errors (substitutions, deletions, insertions) at the word level.
+*   **Character Error Rate (CER):** Measures accuracy at the character level, which is critical for verifying diacritic placement.
+*   **Exact Match Accuracy:** The percentage of sentences where every single character matches the ground truth.
+*   **Word Accuracy:** The percentage of individual words correctly predicted (excluding length mismatches).
+### Results
+#### 1. Overall Performance
+| Metric | Score | Note |
+| :--- | :--- | :--- |
+| **BLEU** | **86.92** | High linguistic and semantic fidelity |
+| **Word Accuracy** | **93.65%** | Robust word-level correction |
+| **Exact Match** | **52.23%** | Entire sentence perfectly restored |
+| **WER** | **0.0864** | ~8.6% error rate per word |
+| **CER** | **0.0366** | ~3.7% error rate per character |
+*Note: The Exact Match score reflects the inherent ambiguity in the Vietnamese language (e.g., "muon" could be "muốn", "mượn", or "muộn"), where multiple correct interpretations may exist without broader paragraph context.*
+#### 2. Accuracy by Sentence Length
+The model's performance varies based on the complexity and length of the input:
+| Category | Length (words) | Accuracy | Sample Count |
+| :--- | :--- | :--- | :--- |
+| **Short** | < 10 | **61.40%** | 2,347 |
+| **Medium** | 10 - 30 | **47.47%** | 2,408 |
+| **Long** | > 30 | **21.47%** | 326 |
+*Analysis: The model performs exceptionally well on short to medium sentences. Accuracy declines on longer sequences (>30 words), likely due to the increased probability of cumulative errors and the 256-token limit.*
+---
+## Environmental Impact
+- **Hardware Type:** 2 x NVIDIA Tesla T4 GPUs.
+- **Cloud Provider:** Kaggle.
+- **Training Duration:** [Insert Hours, e.g., 12 hours].
+- **Carbon Emitted:** Estimated based on the total GPU hours and the carbon intensity of the hosting region.
+### Framework Versions
+- **PEFT:** 0.18.0
+- **Transformers:** 4.57.3
+- **PyTorch:** 2.9.0
+- **Datasets:** 4.0.0

config.json ADDED Viewed

	@@ -0,0 +1,36 @@

+{
+  "activation_dropout": 0.0,
+  "activation_function": "gelu",
+  "architectures": [
+    "MBartForConditionalGeneration"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": 0,
+  "classifier_dropout": 0.0,
+  "d_model": 1024,
+  "decoder_attention_heads": 16,
+  "decoder_ffn_dim": 4096,
+  "decoder_layerdrop": 0.0,
+  "decoder_layers": 12,
+  "decoder_start_token_id": 2,
+  "dropout": 0.1,
+  "dtype": "float16",
+  "encoder_attention_heads": 16,
+  "encoder_ffn_dim": 4096,
+  "encoder_layerdrop": 0.0,
+  "encoder_layers": 12,
+  "eos_token_id": 2,
+  "forced_eos_token_id": 2,
+  "gradient_checkpointing": false,
+  "init_std": 0.02,
+  "is_encoder_decoder": true,
+  "max_position_embeddings": 1024,
+  "model_type": "mbart",
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "scale_embedding": false,
+  "tokenizer_class": "BartphoTokenizer",
+  "transformers_version": "4.57.3",
+  "use_cache": true,
+  "vocab_size": 40030
+}

dict.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

generation_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 0,
+  "decoder_start_token_id": 2,
+  "eos_token_id": 2,
+  "forced_eos_token_id": 2,
+  "pad_token_id": 1,
+  "transformers_version": "4.57.3"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7427f302807027e522fcf1a26e2fd18f52afdb5502d451623cbd17890473084b
+size 791770036

sentencepiece.bpe.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
+size 5069051

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "40029": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "BartphoTokenizer",
+  "unk_token": "<unk>"
+}