yammdd
/

vietnamese-error-correction

@@ -34,7 +34,7 @@ Unlike simple diacritic restoration models, this model aims to correct:
 3.  **Teencode & Informal Variants:** Normalizing teencode, slang, and informal online writing into standard Vietnamese (e.g., "zui wa" -> "vui quá", "iu vk" -> "yêu vợ").
 3.  **Basic Grammar/Contextual correction** based on syllable-level understanding.
-The model was trained on a dataset of approximately **50,000 sentences across the training, validation, and test splits**, which were **automatically labeled using a large language model from crawled Vietnamese social media comments**. Due to the nature of social media data, the dataset may contain noise or labeling imperfections; however, it is **not intended to include any offensive content or to target any individual or organization**.
 - **Developed by:** Thanh-Dan Bui
 - **Model type:** Seq2Seq (Encoder-Decoder) with LoRA Adapter
@@ -90,7 +90,7 @@ print(out[0]["generated_text"])
 ### Training Data
 *   **Source:** Aggregated Vietnamese text corpus.
 *   **Task:** Vietnamese text correction (diacritic restoration and error correction).
-*   **Size:** Approximately 50,000 sentence pairs (split into Train/Validation/Test sets).
 *   **Data Format:**
     *   **Input:** Text with removed diacritics or synthetically induced spelling errors.
     *   **Target:** Original, grammatically correct Vietnamese text.
@@ -123,7 +123,7 @@ print(out[0]["generated_text"])
 ### Testing Data, Factors & Metrics
-The model was evaluated on a held-out test set of **5,081 samples**, covering a diverse range of Vietnamese sentence structures and lengths.
 #### Metrics
 *   **BLEU Score:** Measures the n-gram overlap between the predicted and target text.

 3.  **Teencode & Informal Variants:** Normalizing teencode, slang, and informal online writing into standard Vietnamese (e.g., "zui wa" -> "vui quá", "iu vk" -> "yêu vợ").
 3.  **Basic Grammar/Contextual correction** based on syllable-level understanding.
+The model was trained on a dataset of approximately **70,000 sentences across the training, validation, and test splits**, which were **automatically labeled using a large language model from crawled Vietnamese social media comments**. Due to the nature of social media data, the dataset may contain noise or labeling imperfections; however, it is **not intended to include any offensive content or to target any individual or organization**.
 - **Developed by:** Thanh-Dan Bui
 - **Model type:** Seq2Seq (Encoder-Decoder) with LoRA Adapter
 ### Training Data
 *   **Source:** Aggregated Vietnamese text corpus.
 *   **Task:** Vietnamese text correction (diacritic restoration and error correction).
+*   **Size:** Approximately 70,000 sentence pairs (split into Train/Validation/Test sets).
 *   **Data Format:**
     *   **Input:** Text with removed diacritics or synthetically induced spelling errors.
     *   **Target:** Original, grammatically correct Vietnamese text.
 ### Testing Data, Factors & Metrics
+The model was evaluated on a held-out test set of **7,056 samples**, covering a diverse range of Vietnamese sentence structures and lengths.
 #### Metrics
 *   **BLEU Score:** Measures the n-gram overlap between the predicted and target text.