jmcinern commited on
Commit
14b7e29
·
verified ·
1 Parent(s): 0e1593e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -21,7 +21,7 @@ metrics:
21
 
22
  # Qomhrá: A Bilingual Irish & English LLM
23
 
24
- **Qomhrá**, **Q**wen (Base model) + c**omhrá** (Irish for "Conversation") is an 8-billion parameter bilingual Large Language Model (LLM) designed to support the low-resource language of Irish (*Gaeilge*). It is adapted from **Qwen3-8B** via a rigorous pipeline of Bilingual Continued Pre-Training (CPT) and Instruction Tuning.
25
 
26
  Developed by researchers at **Trinity College Dublin**, **University College Cork**, and **Queen's University Belfast**, Qomhrá aims to foster technological sovereignty for the Irish language community by providing an open-weight alternative to proprietary APIs.
27
 
@@ -39,7 +39,7 @@ Developed by researchers at **Trinity College Dublin**, **University College Cor
39
  The development of Qomhrá followed a two-stage pipeline:
40
 
41
  ### 1. Bilingual Continued Pre-Training (CPT)
42
- The model was adapted using a bilingual corpus of **3.265 billion characters**. Unlike previous approaches that suffered from catastrophic forgetting, we used a high mixture of English data (approx. 25%) to maintain reasoning capabilities.
43
 
44
  **Data Mixture:**
45
  * **Irish (~75%):**
@@ -57,7 +57,7 @@ The model was adapted using a bilingual corpus of **3.265 billion characters**.
57
  * **Optimizer:** AdamW ($lr=1e^{-4}$).
58
 
59
  ### 2. Instruction Tuning
60
- We curated a **30k sample** parallel English-Irish instruction dataset. This was created by translating the **Dolly V2** dataset using **Gemini-2.5-Pro**, which was selected after a rigorous human evaluation ranking it as the top performer for Irish text generation (outperforming GPT-5 and Claude-4-Sonnet).
61
 
62
  ## Evaluation Results
63
 
 
21
 
22
  # Qomhrá: A Bilingual Irish & English LLM
23
 
24
+ **Qomhrá**, **Q**wen (Base model) + c**omhrá** (Irish for "Conversation") is an 8-billion parameter bilingual Large Language Model (LLM) designed to support the low-resource language of Irish (*Gaeilge*). It is adapted from **Qwen3-8B** via a pipeline of Bilingual Continued Pre-Training (CPT) and Instruction Tuning.
25
 
26
  Developed by researchers at **Trinity College Dublin**, **University College Cork**, and **Queen's University Belfast**, Qomhrá aims to foster technological sovereignty for the Irish language community by providing an open-weight alternative to proprietary APIs.
27
 
 
39
  The development of Qomhrá followed a two-stage pipeline:
40
 
41
  ### 1. Bilingual Continued Pre-Training (CPT)
42
+ The model was adapted using a bilingual corpus of **3.265 billion characters**. Unlike previous approaches that suffered from catastrophic forgetting, we used a high mixture of English data (approx. 25%) to maintain English language capabilities.
43
 
44
  **Data Mixture:**
45
  * **Irish (~75%):**
 
57
  * **Optimizer:** AdamW ($lr=1e^{-4}$).
58
 
59
  ### 2. Instruction Tuning
60
+ We curated a **30k sample** parallel English-Irish instruction dataset. This was created by translating the **Dolly V2** dataset using **Gemini-2.5-Pro**, which was selected after a human evaluation ranking it as the top performer for Irish text generation (outperforming GPT-5 and Claude-4-Sonnet).
61
 
62
  ## Evaluation Results
63