yammdd commited on
Commit
7cbedc1
·
verified ·
1 Parent(s): 18f418e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -34,7 +34,7 @@ Unlike simple diacritic restoration models, this model aims to correct:
34
  3. **Teencode & Informal Variants:** Normalizing teencode, slang, and informal online writing into standard Vietnamese (e.g., "zui wa" -> "vui quá", "iu vk" -> "yêu vợ").
35
  3. **Basic Grammar/Contextual correction** based on syllable-level understanding.
36
 
37
- The model was trained on a dataset of approximately **50,000 sentences across the training, validation, and test splits**, which were **automatically labeled using a large language model from crawled Vietnamese social media comments**. Due to the nature of social media data, the dataset may contain noise or labeling imperfections; however, it is **not intended to include any offensive content or to target any individual or organization**.
38
 
39
  - **Developed by:** Thanh-Dan Bui
40
  - **Model type:** Seq2Seq (Encoder-Decoder) with LoRA Adapter
@@ -90,7 +90,7 @@ print(out[0]["generated_text"])
90
  ### Training Data
91
  * **Source:** Aggregated Vietnamese text corpus.
92
  * **Task:** Vietnamese text correction (diacritic restoration and error correction).
93
- * **Size:** Approximately 50,000 sentence pairs (split into Train/Validation/Test sets).
94
  * **Data Format:**
95
  * **Input:** Text with removed diacritics or synthetically induced spelling errors.
96
  * **Target:** Original, grammatically correct Vietnamese text.
@@ -123,7 +123,7 @@ print(out[0]["generated_text"])
123
 
124
  ### Testing Data, Factors & Metrics
125
 
126
- The model was evaluated on a held-out test set of **5,081 samples**, covering a diverse range of Vietnamese sentence structures and lengths.
127
 
128
  #### Metrics
129
  * **BLEU Score:** Measures the n-gram overlap between the predicted and target text.
 
34
  3. **Teencode & Informal Variants:** Normalizing teencode, slang, and informal online writing into standard Vietnamese (e.g., "zui wa" -> "vui quá", "iu vk" -> "yêu vợ").
35
  3. **Basic Grammar/Contextual correction** based on syllable-level understanding.
36
 
37
+ The model was trained on a dataset of approximately **70,000 sentences across the training, validation, and test splits**, which were **automatically labeled using a large language model from crawled Vietnamese social media comments**. Due to the nature of social media data, the dataset may contain noise or labeling imperfections; however, it is **not intended to include any offensive content or to target any individual or organization**.
38
 
39
  - **Developed by:** Thanh-Dan Bui
40
  - **Model type:** Seq2Seq (Encoder-Decoder) with LoRA Adapter
 
90
  ### Training Data
91
  * **Source:** Aggregated Vietnamese text corpus.
92
  * **Task:** Vietnamese text correction (diacritic restoration and error correction).
93
+ * **Size:** Approximately 70,000 sentence pairs (split into Train/Validation/Test sets).
94
  * **Data Format:**
95
  * **Input:** Text with removed diacritics or synthetically induced spelling errors.
96
  * **Target:** Original, grammatically correct Vietnamese text.
 
123
 
124
  ### Testing Data, Factors & Metrics
125
 
126
+ The model was evaluated on a held-out test set of **7,056 samples**, covering a diverse range of Vietnamese sentence structures and lengths.
127
 
128
  #### Metrics
129
  * **BLEU Score:** Measures the n-gram overlap between the predicted and target text.