welyjesch commited on
Commit
609edf2
·
verified ·
1 Parent(s): 4ec96ef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -53,7 +53,7 @@ Interested parties may reach out via the Hugging Face discussion board or review
53
 
54
  </details>
55
 
56
- # PROGRESS REPORT: Phase 1 Foundation Model Alignment
57
 
58
  <details>
59
  <summary><b>Summary:</b> Phase 1 is underway, but achieving a high-fidelity "Teacher" model for Philippine languages using Llama 3.1 and machine-translated Alpaca data is currently bottlenecked. Llama 3.1's inherent English-centric bias combined with syntactically flawed, machine-translated training data creates a compounding error loop. This results in grammatical corruption, dialect mixing, and severe hallucinations rather than true Neural Machine Translation (NMT) parity. There is still a long way to go to build a reliable teacher model; we must pivot away from machine-translated shortcuts and invest in human-curated, native-first datasets before progressing to knowledge distillation.</summary>
@@ -114,7 +114,7 @@ Building high-performance NLP architectures for Philippine languages cannot rely
114
 
115
  </details>
116
 
117
- # SOLUTION DOCUMENT: Crowdsourced Authentic Dataset Generation Strategy
118
 
119
  <details>
120
  <summary><b>Summary:</b> In response to the hallucination loop caused by machine-translated training data, stakeholders have pivoted towards authentic, native-first dataset curation. To facilitate this, we have developed the PLTAT App—an all-in-one "Swiss Army knife" platform for crowdsourcing the translation, generation, evaluation, and correction of NLP datasets. Because building a high-fidelity teacher model is a long-term, iterative process, we are actively seeking institutional stakeholders (universities, government agencies) to sustain this effort. Technical resources, including the PLTAT Chat App and our Ollama Colab Server Notebook, are now live for community testing.</summary>
@@ -124,7 +124,7 @@ Building high-performance NLP architectures for Philippine languages cannot rely
124
 
125
  **Organization:** Philippine Languages Translation and AI Training Community (PLTAT)
126
  **Project Phase:** Phase 1.5 - Authentic Data Remediation & HITL Integration
127
- **Date:** [Current Date]
128
 
129
  ---
130
 
 
53
 
54
  </details>
55
 
56
+ ## Progress Repoort for Phase 1
57
 
58
  <details>
59
  <summary><b>Summary:</b> Phase 1 is underway, but achieving a high-fidelity "Teacher" model for Philippine languages using Llama 3.1 and machine-translated Alpaca data is currently bottlenecked. Llama 3.1's inherent English-centric bias combined with syntactically flawed, machine-translated training data creates a compounding error loop. This results in grammatical corruption, dialect mixing, and severe hallucinations rather than true Neural Machine Translation (NMT) parity. There is still a long way to go to build a reliable teacher model; we must pivot away from machine-translated shortcuts and invest in human-curated, native-first datasets before progressing to knowledge distillation.</summary>
 
114
 
115
  </details>
116
 
117
+ ## Current Status: Crowdsourced Authentic Dataset Generation Strategy
118
 
119
  <details>
120
  <summary><b>Summary:</b> In response to the hallucination loop caused by machine-translated training data, stakeholders have pivoted towards authentic, native-first dataset curation. To facilitate this, we have developed the PLTAT App—an all-in-one "Swiss Army knife" platform for crowdsourcing the translation, generation, evaluation, and correction of NLP datasets. Because building a high-fidelity teacher model is a long-term, iterative process, we are actively seeking institutional stakeholders (universities, government agencies) to sustain this effort. Technical resources, including the PLTAT Chat App and our Ollama Colab Server Notebook, are now live for community testing.</summary>
 
124
 
125
  **Organization:** Philippine Languages Translation and AI Training Community (PLTAT)
126
  **Project Phase:** Phase 1.5 - Authentic Data Remediation & HITL Integration
127
+ **Date:** April 6, 2026
128
 
129
  ---
130