jmcinern
/

Qomhra

@@ -21,7 +21,7 @@ metrics:
 # Qomhrá: A Bilingual Irish & English LLM
-**Qomhrá**, **Q**wen (Base model) + c**omhrá** (Irish for "Conversation") is an 8-billion parameter bilingual Large Language Model (LLM) designed to support the low-resource language of Irish (*Gaeilge*). It is adapted from **Qwen3-8B** via a rigorous pipeline of Bilingual Continued Pre-Training (CPT) and Instruction Tuning.
 Developed by researchers at **Trinity College Dublin**, **University College Cork**, and **Queen's University Belfast**, Qomhrá aims to foster technological sovereignty for the Irish language community by providing an open-weight alternative to proprietary APIs.
@@ -39,7 +39,7 @@ Developed by researchers at **Trinity College Dublin**, **University College Cor
 The development of Qomhrá followed a two-stage pipeline:
 ### 1. Bilingual Continued Pre-Training (CPT)
-The model was adapted using a bilingual corpus of **3.265 billion characters**. Unlike previous approaches that suffered from catastrophic forgetting, we used a high mixture of English data (approx. 25%) to maintain reasoning capabilities.
 **Data Mixture:**
 * **Irish (~75%):**
@@ -57,7 +57,7 @@ The model was adapted using a bilingual corpus of **3.265 billion characters**.
 * **Optimizer:** AdamW ($lr=1e^{-4}$).
 ### 2. Instruction Tuning
-We curated a **30k sample** parallel English-Irish instruction dataset. This was created by translating the **Dolly V2** dataset using **Gemini-2.5-Pro**, which was selected after a rigorous human evaluation ranking it as the top performer for Irish text generation (outperforming GPT-5 and Claude-4-Sonnet).
 ## Evaluation Results

 # Qomhrá: A Bilingual Irish & English LLM
+**Qomhrá**, **Q**wen (Base model) + c**omhrá** (Irish for "Conversation") is an 8-billion parameter bilingual Large Language Model (LLM) designed to support the low-resource language of Irish (*Gaeilge*). It is adapted from **Qwen3-8B** via a pipeline of Bilingual Continued Pre-Training (CPT) and Instruction Tuning.
 Developed by researchers at **Trinity College Dublin**, **University College Cork**, and **Queen's University Belfast**, Qomhrá aims to foster technological sovereignty for the Irish language community by providing an open-weight alternative to proprietary APIs.
 The development of Qomhrá followed a two-stage pipeline:
 ### 1. Bilingual Continued Pre-Training (CPT)
+The model was adapted using a bilingual corpus of **3.265 billion characters**. Unlike previous approaches that suffered from catastrophic forgetting, we used a high mixture of English data (approx. 25%) to maintain English language capabilities.
 **Data Mixture:**
 * **Irish (~75%):**
 * **Optimizer:** AdamW ($lr=1e^{-4}$).
 ### 2. Instruction Tuning
+We curated a **30k sample** parallel English-Irish instruction dataset. This was created by translating the **Dolly V2** dataset using **Gemini-2.5-Pro**, which was selected after a human evaluation ranking it as the top performer for Irish text generation (outperforming GPT-5 and Claude-4-Sonnet).
 ## Evaluation Results