Update README.md
Browse files
README.md
CHANGED
|
@@ -34,7 +34,7 @@ Unlike simple diacritic restoration models, this model aims to correct:
|
|
| 34 |
3. **Teencode & Informal Variants:** Normalizing teencode, slang, and informal online writing into standard Vietnamese (e.g., "zui wa" -> "vui quá", "iu vk" -> "yêu vợ").
|
| 35 |
3. **Basic Grammar/Contextual correction** based on syllable-level understanding.
|
| 36 |
|
| 37 |
-
The model was trained on a dataset of approximately **
|
| 38 |
|
| 39 |
- **Developed by:** Thanh-Dan Bui
|
| 40 |
- **Model type:** Seq2Seq (Encoder-Decoder) with LoRA Adapter
|
|
@@ -90,7 +90,7 @@ print(out[0]["generated_text"])
|
|
| 90 |
### Training Data
|
| 91 |
* **Source:** Aggregated Vietnamese text corpus.
|
| 92 |
* **Task:** Vietnamese text correction (diacritic restoration and error correction).
|
| 93 |
-
* **Size:** Approximately
|
| 94 |
* **Data Format:**
|
| 95 |
* **Input:** Text with removed diacritics or synthetically induced spelling errors.
|
| 96 |
* **Target:** Original, grammatically correct Vietnamese text.
|
|
@@ -123,7 +123,7 @@ print(out[0]["generated_text"])
|
|
| 123 |
|
| 124 |
### Testing Data, Factors & Metrics
|
| 125 |
|
| 126 |
-
The model was evaluated on a held-out test set of **
|
| 127 |
|
| 128 |
#### Metrics
|
| 129 |
* **BLEU Score:** Measures the n-gram overlap between the predicted and target text.
|
|
|
|
| 34 |
3. **Teencode & Informal Variants:** Normalizing teencode, slang, and informal online writing into standard Vietnamese (e.g., "zui wa" -> "vui quá", "iu vk" -> "yêu vợ").
|
| 35 |
3. **Basic Grammar/Contextual correction** based on syllable-level understanding.
|
| 36 |
|
| 37 |
+
The model was trained on a dataset of approximately **70,000 sentences across the training, validation, and test splits**, which were **automatically labeled using a large language model from crawled Vietnamese social media comments**. Due to the nature of social media data, the dataset may contain noise or labeling imperfections; however, it is **not intended to include any offensive content or to target any individual or organization**.
|
| 38 |
|
| 39 |
- **Developed by:** Thanh-Dan Bui
|
| 40 |
- **Model type:** Seq2Seq (Encoder-Decoder) with LoRA Adapter
|
|
|
|
| 90 |
### Training Data
|
| 91 |
* **Source:** Aggregated Vietnamese text corpus.
|
| 92 |
* **Task:** Vietnamese text correction (diacritic restoration and error correction).
|
| 93 |
+
* **Size:** Approximately 70,000 sentence pairs (split into Train/Validation/Test sets).
|
| 94 |
* **Data Format:**
|
| 95 |
* **Input:** Text with removed diacritics or synthetically induced spelling errors.
|
| 96 |
* **Target:** Original, grammatically correct Vietnamese text.
|
|
|
|
| 123 |
|
| 124 |
### Testing Data, Factors & Metrics
|
| 125 |
|
| 126 |
+
The model was evaluated on a held-out test set of **7,056 samples**, covering a diverse range of Vietnamese sentence structures and lengths.
|
| 127 |
|
| 128 |
#### Metrics
|
| 129 |
* **BLEU Score:** Measures the n-gram overlap between the predicted and target text.
|