Update README.md
Browse files
README.md
CHANGED
|
@@ -53,13 +53,13 @@ Interested parties may reach out via the Hugging Face discussion board or review
|
|
| 53 |
|
| 54 |
</details>
|
| 55 |
|
|
|
|
|
|
|
| 56 |
<details>
|
| 57 |
-
<summary><b>
|
| 58 |
|
| 59 |
<br>
|
| 60 |
|
| 61 |
-
# PROGRESS REPORT: Phase 1 Foundation Model Alignment
|
| 62 |
-
|
| 63 |
**Organization:** Philippine Languages Translation and AI Training Community
|
| 64 |
**Project Phase:** Phase 1 - Foundation Model Alignment and NMT Parity
|
| 65 |
**Target Languages:** Tagalog, Cebuano, Ilocano, Hiligaynon, Bicolano, Waray, Kapampangan, and Pangasinan
|
|
@@ -114,12 +114,13 @@ Building high-performance NLP architectures for Philippine languages cannot rely
|
|
| 114 |
|
| 115 |
</details>
|
| 116 |
|
|
|
|
|
|
|
| 117 |
<details>
|
| 118 |
-
<summary><b>
|
| 119 |
|
| 120 |
<br>
|
| 121 |
|
| 122 |
-
# SOLUTION DOCUMENT: Crowdsourced Authentic Dataset Generation Strategy
|
| 123 |
|
| 124 |
**Organization:** Philippine Languages Translation and AI Training Community (PLTAT)
|
| 125 |
**Project Phase:** Phase 1.5 - Authentic Data Remediation & HITL Integration
|
|
|
|
| 53 |
|
| 54 |
</details>
|
| 55 |
|
| 56 |
+
# PROGRESS REPORT: Phase 1 Foundation Model Alignment
|
| 57 |
+
|
| 58 |
<details>
|
| 59 |
+
<summary><b>Summary:</b> Phase 1 is underway, but achieving a high-fidelity "Teacher" model for Philippine languages using Llama 3.1 and machine-translated Alpaca data is currently bottlenecked. Llama 3.1's inherent English-centric bias combined with syntactically flawed, machine-translated training data creates a compounding error loop. This results in grammatical corruption, dialect mixing, and severe hallucinations rather than true Neural Machine Translation (NMT) parity. There is still a long way to go to build a reliable teacher model; we must pivot away from machine-translated shortcuts and invest in human-curated, native-first datasets before progressing to knowledge distillation.</summary>
|
| 60 |
|
| 61 |
<br>
|
| 62 |
|
|
|
|
|
|
|
| 63 |
**Organization:** Philippine Languages Translation and AI Training Community
|
| 64 |
**Project Phase:** Phase 1 - Foundation Model Alignment and NMT Parity
|
| 65 |
**Target Languages:** Tagalog, Cebuano, Ilocano, Hiligaynon, Bicolano, Waray, Kapampangan, and Pangasinan
|
|
|
|
| 114 |
|
| 115 |
</details>
|
| 116 |
|
| 117 |
+
# SOLUTION DOCUMENT: Crowdsourced Authentic Dataset Generation Strategy
|
| 118 |
+
|
| 119 |
<details>
|
| 120 |
+
<summary><b>Summary:</b> In response to the hallucination loop caused by machine-translated training data, stakeholders have pivoted towards authentic, native-first dataset curation. To facilitate this, we have developed the PLTAT App—an all-in-one "Swiss Army knife" platform for crowdsourcing the translation, generation, evaluation, and correction of NLP datasets. Because building a high-fidelity teacher model is a long-term, iterative process, we are actively seeking institutional stakeholders (universities, government agencies) to sustain this effort. Technical resources, including the PLTAT Chat App and our Ollama Colab Server Notebook, are now live for community testing.</summary>
|
| 121 |
|
| 122 |
<br>
|
| 123 |
|
|
|
|
| 124 |
|
| 125 |
**Organization:** Philippine Languages Translation and AI Training Community (PLTAT)
|
| 126 |
**Project Phase:** Phase 1.5 - Authentic Data Remediation & HITL Integration
|