Yam commited on
Commit
1dd72ef
·
1 Parent(s): 6389ae0

upload model

Browse files
README.md CHANGED
@@ -1,3 +1,173 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: vinai/bartpho-syllable
3
+ library_name: peft
4
+ tags:
5
+ - base_model:adapter:vinai/bartpho-syllable
6
+ - lora
7
+ - transformers
8
+ - seq2seq
9
+ - vietnamese
10
+ - error-correction
11
+ - spell-checking
12
+ - text-generation
13
+ license: mit
14
+ language:
15
+ - vi
16
+ metrics:
17
+ - bleu
18
+ - wer
19
+ - cer
20
+ - accuracy
21
+ ---
22
+
23
+ # BartPho-Syllable - Vietnamese Error Correction (LoRA)
24
+
25
+ ## Model Details
26
+
27
+ ### Model Description
28
+
29
+ This model is a Fine-tuned version of **[vinai/bartpho-syllable](https://huggingface.co/vinai/bartpho-syllable)** using **LoRA (Low-Rank Adaptation)**. It is specifically designed for **Vietnamese Error Correction (VEC)** tasks.
30
+
31
+ Unlike simple diacritic restoration models, this model aims to correct:
32
+ 1. **Missing Diacritics:** (e.g., "trang phuc" -> "trang phục")
33
+ 2. **Spelling Errors:** (e.g., "bài toán" vs "bài toan")
34
+ 3. **Teencode & Informal Variants:** Normalizing teencode, slang, and informal online writing into standard Vietnamese (e.g., "zui wa" -> "vui quá", "iu vk" -> "yêu vợ").
35
+ 3. **Basic Grammar/Contextual correction** based on syllable-level understanding.
36
+
37
+ The model was trained on a dataset of approximately **50,000 sentences across the training, validation, and test splits**, which were **automatically labeled using a large language model from crawled Vietnamese social media comments**. Due to the nature of social media data, the dataset may contain noise or labeling imperfections; however, it is **not intended to include any offensive content or to target any individual or organization**.
38
+
39
+ - **Developed by:** Thanh-Dan Bui
40
+ - **Model type:** Seq2Seq (Encoder-Decoder) with LoRA Adapter
41
+ - **Language(s):** Vietnamese
42
+ - **License:** MIT
43
+ - **Finetuned from model:** `vinai/bartpho-syllable`
44
+
45
+ ## Uses
46
+
47
+ ### Direct Use
48
+
49
+ The model is designed for Vietnamese text error correction. It takes noisy Vietnamese text as input, including missing diacritics, spelling mistakes, and informal or teencode expressions, and produces grammatically correct and orthographically normalized Vietnamese text as output.
50
+
51
+ **Example:**
52
+ * **Input:** "t đang xu ly 1 bai toán la sưa lỗi cho tieng viet"
53
+ * **Output:** "tôi đang xử lý 1 bài toán là sửa lỗi cho tiếng Việt"
54
+
55
+ ### Out-of-Scope Use
56
+
57
+ * Translation from other languages to Vietnamese.
58
+ * Generating text from scratch (Open-ended generation).
59
+ * Correcting highly specialized technical jargon not present in general Vietnamese corpora.
60
+
61
+ ## Bias, Risks, and Limitations
62
+
63
+ * **Context Length:** The model is optimized for sentence-level correction (max length ~256 tokens). Very long paragraphs should be split before processing.
64
+ * **Ambiguity:** In cases where a noisy or abbreviated form can correspond to multiple valid standard forms (for example, variants of “không” such as “k”, “ko”, “hong”, or “hông”), the model relies on contextual information to infer the most likely correction, which may occasionally result in incorrect predictions.
65
+ * **Proper Nouns:** The model might attempt to "correct" foreign names or uncommon proper nouns if they resemble Vietnamese words.
66
+
67
+ ## How to Get Started with the Model
68
+
69
+ You can use this model with the `transformers` libraries.
70
+
71
+ ```python
72
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
73
+
74
+ path = "yammdd/vietnamese-error-correction"
75
+
76
+ tokenizer = AutoTokenizer.from_pretrained(path)
77
+ model = AutoModelForSeq2SeqLM.from_pretrained(path)
78
+
79
+ pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
80
+
81
+ text = "hum ni a bùn wá bé iu ưi"
82
+ out = pipe(text, max_new_tokens=256)
83
+
84
+ print(out[0]["generated_text"])
85
+ # Output: hôm nay anh buồn quá bé yêu ơi
86
+ ```
87
+
88
+ ## Training Details
89
+
90
+ ### Training Data
91
+ * **Source:** Aggregated Vietnamese text corpus.
92
+ * **Task:** Vietnamese text correction (diacritic restoration and error correction).
93
+ * **Size:** Approximately 50,000 sentence pairs (split into Train/Validation/Test sets).
94
+ * **Data Format:**
95
+ * **Input:** Text with removed diacritics or synthetically induced spelling errors.
96
+ * **Target:** Original, grammatically correct Vietnamese text.
97
+ * **Sequence Length:** Maximum input and output length of 256 tokens.
98
+
99
+ ### Training Procedure
100
+ * **Base Model:** `vinai/bartpho-syllable`
101
+ * **Technique:** Parameter-Efficient Fine-Tuning (PEFT) using **LoRA** (Low-Rank Adaptation).
102
+ * **LoRA Configuration:**
103
+ * **Target Modules:** `q_proj`, `v_proj`, `out_proj`, `fc1`, `fc2` (covering both attention and feed-forward layers).
104
+ * **Rank (r):** 32
105
+ * **Alpha:** 64
106
+ * **Dropout:** 0.1
107
+ * **Precision:** FP16 (Mixed Precision) for optimized memory usage and speed.
108
+
109
+ #### Training Hyperparameters
110
+ * **Optimizer:** AdamW with weight decay of 0.01.
111
+ * **Batch Size:** 16 per device (Total effective batch size depends on GPU count, typically 32 on 2x T4).
112
+ * **Learning Rate:** 5e-4.
113
+ * **Training Epochs:** 5.
114
+ * **Evaluation Strategy:** Every 2,000 steps.
115
+ * **Label Smoothing:** Implicitly handled by `DataCollatorForSeq2Seq` with `label_pad_token_id=-100`.
116
+
117
+ #### Speeds, Sizes, Times
118
+ * **Hardware:** 2x NVIDIA T4 GPUs (Kaggle environment).
119
+ * **Checkpoint Size:** The adapter weights are lightweight (only several megabytes), significantly smaller than the full BARTpho base model.
120
+ * **Training Dynamics:** Managed via the Hugging Face `Seq2SeqTrainer` with `predict_with_generate` enabled for validation metrics.
121
+
122
+ ## Evaluation
123
+
124
+ ### Testing Data, Factors & Metrics
125
+
126
+ The model was evaluated on a held-out test set of **5,081 samples**, covering a diverse range of Vietnamese sentence structures and lengths.
127
+
128
+ #### Metrics
129
+ * **BLEU Score:** Measures the n-gram overlap between the predicted and target text.
130
+ * **Word Error Rate (WER):** Measures the ratio of errors (substitutions, deletions, insertions) at the word level.
131
+ * **Character Error Rate (CER):** Measures accuracy at the character level, which is critical for verifying diacritic placement.
132
+ * **Exact Match Accuracy:** The percentage of sentences where every single character matches the ground truth.
133
+ * **Word Accuracy:** The percentage of individual words correctly predicted (excluding length mismatches).
134
+
135
+ ### Results
136
+
137
+ #### 1. Overall Performance
138
+ | Metric | Score | Note |
139
+ | :--- | :--- | :--- |
140
+ | **BLEU** | **86.92** | High linguistic and semantic fidelity |
141
+ | **Word Accuracy** | **93.65%** | Robust word-level correction |
142
+ | **Exact Match** | **52.23%** | Entire sentence perfectly restored |
143
+ | **WER** | **0.0864** | ~8.6% error rate per word |
144
+ | **CER** | **0.0366** | ~3.7% error rate per character |
145
+
146
+ *Note: The Exact Match score reflects the inherent ambiguity in the Vietnamese language (e.g., "muon" could be "muốn", "mượn", or "muộn"), where multiple correct interpretations may exist without broader paragraph context.*
147
+
148
+ #### 2. Accuracy by Sentence Length
149
+ The model's performance varies based on the complexity and length of the input:
150
+
151
+ | Category | Length (words) | Accuracy | Sample Count |
152
+ | :--- | :--- | :--- | :--- |
153
+ | **Short** | < 10 | **61.40%** | 2,347 |
154
+ | **Medium** | 10 - 30 | **47.47%** | 2,408 |
155
+ | **Long** | > 30 | **21.47%** | 326 |
156
+
157
+ *Analysis: The model performs exceptionally well on short to medium sentences. Accuracy declines on longer sequences (>30 words), likely due to the increased probability of cumulative errors and the 256-token limit.*
158
+
159
+ ---
160
+
161
+ ## Environmental Impact
162
+
163
+ - **Hardware Type:** 2 x NVIDIA Tesla T4 GPUs.
164
+ - **Cloud Provider:** Kaggle.
165
+ - **Training Duration:** [Insert Hours, e.g., 12 hours].
166
+ - **Carbon Emitted:** Estimated based on the total GPU hours and the carbon intensity of the hosting region.
167
+
168
+ ### Framework Versions
169
+
170
+ - **PEFT:** 0.18.0
171
+ - **Transformers:** 4.57.3
172
+ - **PyTorch:** 2.9.0
173
+ - **Datasets:** 4.0.0
config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "activation_dropout": 0.0,
3
+ "activation_function": "gelu",
4
+ "architectures": [
5
+ "MBartForConditionalGeneration"
6
+ ],
7
+ "attention_dropout": 0.0,
8
+ "bos_token_id": 0,
9
+ "classifier_dropout": 0.0,
10
+ "d_model": 1024,
11
+ "decoder_attention_heads": 16,
12
+ "decoder_ffn_dim": 4096,
13
+ "decoder_layerdrop": 0.0,
14
+ "decoder_layers": 12,
15
+ "decoder_start_token_id": 2,
16
+ "dropout": 0.1,
17
+ "dtype": "float16",
18
+ "encoder_attention_heads": 16,
19
+ "encoder_ffn_dim": 4096,
20
+ "encoder_layerdrop": 0.0,
21
+ "encoder_layers": 12,
22
+ "eos_token_id": 2,
23
+ "forced_eos_token_id": 2,
24
+ "gradient_checkpointing": false,
25
+ "init_std": 0.02,
26
+ "is_encoder_decoder": true,
27
+ "max_position_embeddings": 1024,
28
+ "model_type": "mbart",
29
+ "num_hidden_layers": 12,
30
+ "pad_token_id": 1,
31
+ "scale_embedding": false,
32
+ "tokenizer_class": "BartphoTokenizer",
33
+ "transformers_version": "4.57.3",
34
+ "use_cache": true,
35
+ "vocab_size": 40030
36
+ }
dict.txt ADDED
The diff for this file is too large to render. See raw diff
 
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "decoder_start_token_id": 2,
5
+ "eos_token_id": 2,
6
+ "forced_eos_token_id": 2,
7
+ "pad_token_id": 1,
8
+ "transformers_version": "4.57.3"
9
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7427f302807027e522fcf1a26e2fd18f52afdb5502d451623cbd17890473084b
3
+ size 791770036
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": true,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "40029": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": true,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": false,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "extra_special_tokens": {},
49
+ "mask_token": "<mask>",
50
+ "model_max_length": 1000000000000000019884624838656,
51
+ "pad_token": "<pad>",
52
+ "sep_token": "</s>",
53
+ "sp_model_kwargs": {},
54
+ "tokenizer_class": "BartphoTokenizer",
55
+ "unk_token": "<unk>"
56
+ }