welyjesch commited on
Commit
4ec96ef
·
verified ·
1 Parent(s): cdabf17

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -5
README.md CHANGED
@@ -53,13 +53,13 @@ Interested parties may reach out via the Hugging Face discussion board or review
53
 
54
  </details>
55
 
 
 
56
  <details>
57
- <summary><b>TL;DR Summary:</b> Phase 1 is underway, but achieving a high-fidelity "Teacher" model for Philippine languages using Llama 3.1 and machine-translated Alpaca data is currently bottlenecked. Llama 3.1's inherent English-centric bias combined with syntactically flawed, machine-translated training data creates a compounding error loop. This results in grammatical corruption, dialect mixing, and severe hallucinations rather than true Neural Machine Translation (NMT) parity. There is still a long way to go to build a reliable teacher model; we must pivot away from machine-translated shortcuts and invest in human-curated, native-first datasets before progressing to knowledge distillation.</summary>
58
 
59
  <br>
60
 
61
- # PROGRESS REPORT: Phase 1 Foundation Model Alignment
62
-
63
  **Organization:** Philippine Languages Translation and AI Training Community
64
  **Project Phase:** Phase 1 - Foundation Model Alignment and NMT Parity
65
  **Target Languages:** Tagalog, Cebuano, Ilocano, Hiligaynon, Bicolano, Waray, Kapampangan, and Pangasinan
@@ -114,12 +114,13 @@ Building high-performance NLP architectures for Philippine languages cannot rely
114
 
115
  </details>
116
 
 
 
117
  <details>
118
- <summary><b>TL;DR Summary:</b> In response to the hallucination loop caused by machine-translated training data, stakeholders have pivoted towards authentic, native-first dataset curation. To facilitate this, we have developed the PLTAT App—an all-in-one "Swiss Army knife" platform for crowdsourcing the translation, generation, evaluation, and correction of NLP datasets. Because building a high-fidelity teacher model is a long-term, iterative process, we are actively seeking institutional stakeholders (universities, government agencies) to sustain this effort. Technical resources, including the PLTAT Chat App and our Ollama Colab Server Notebook, are now live for community testing.</summary>
119
 
120
  <br>
121
 
122
- # SOLUTION DOCUMENT: Crowdsourced Authentic Dataset Generation Strategy
123
 
124
  **Organization:** Philippine Languages Translation and AI Training Community (PLTAT)
125
  **Project Phase:** Phase 1.5 - Authentic Data Remediation & HITL Integration
 
53
 
54
  </details>
55
 
56
+ # PROGRESS REPORT: Phase 1 Foundation Model Alignment
57
+
58
  <details>
59
+ <summary><b>Summary:</b> Phase 1 is underway, but achieving a high-fidelity "Teacher" model for Philippine languages using Llama 3.1 and machine-translated Alpaca data is currently bottlenecked. Llama 3.1's inherent English-centric bias combined with syntactically flawed, machine-translated training data creates a compounding error loop. This results in grammatical corruption, dialect mixing, and severe hallucinations rather than true Neural Machine Translation (NMT) parity. There is still a long way to go to build a reliable teacher model; we must pivot away from machine-translated shortcuts and invest in human-curated, native-first datasets before progressing to knowledge distillation.</summary>
60
 
61
  <br>
62
 
 
 
63
  **Organization:** Philippine Languages Translation and AI Training Community
64
  **Project Phase:** Phase 1 - Foundation Model Alignment and NMT Parity
65
  **Target Languages:** Tagalog, Cebuano, Ilocano, Hiligaynon, Bicolano, Waray, Kapampangan, and Pangasinan
 
114
 
115
  </details>
116
 
117
+ # SOLUTION DOCUMENT: Crowdsourced Authentic Dataset Generation Strategy
118
+
119
  <details>
120
+ <summary><b>Summary:</b> In response to the hallucination loop caused by machine-translated training data, stakeholders have pivoted towards authentic, native-first dataset curation. To facilitate this, we have developed the PLTAT App—an all-in-one "Swiss Army knife" platform for crowdsourcing the translation, generation, evaluation, and correction of NLP datasets. Because building a high-fidelity teacher model is a long-term, iterative process, we are actively seeking institutional stakeholders (universities, government agencies) to sustain this effort. Technical resources, including the PLTAT Chat App and our Ollama Colab Server Notebook, are now live for community testing.</summary>
121
 
122
  <br>
123
 
 
124
 
125
  **Organization:** Philippine Languages Translation and AI Training Community (PLTAT)
126
  **Project Phase:** Phase 1.5 - Authentic Data Remediation & HITL Integration