Update README.md
Browse files
README.md
CHANGED
|
@@ -21,7 +21,7 @@ metrics:
|
|
| 21 |
|
| 22 |
# Qomhrá: A Bilingual Irish & English LLM
|
| 23 |
|
| 24 |
-
**Qomhrá**, **Q**wen (Base model) + c**omhrá** (Irish for "Conversation") is an 8-billion parameter bilingual Large Language Model (LLM) designed to support the low-resource language of Irish (*Gaeilge*). It is adapted from **Qwen3-8B** via a
|
| 25 |
|
| 26 |
Developed by researchers at **Trinity College Dublin**, **University College Cork**, and **Queen's University Belfast**, Qomhrá aims to foster technological sovereignty for the Irish language community by providing an open-weight alternative to proprietary APIs.
|
| 27 |
|
|
@@ -39,7 +39,7 @@ Developed by researchers at **Trinity College Dublin**, **University College Cor
|
|
| 39 |
The development of Qomhrá followed a two-stage pipeline:
|
| 40 |
|
| 41 |
### 1. Bilingual Continued Pre-Training (CPT)
|
| 42 |
-
The model was adapted using a bilingual corpus of **3.265 billion characters**. Unlike previous approaches that suffered from catastrophic forgetting, we used a high mixture of English data (approx. 25%) to maintain
|
| 43 |
|
| 44 |
**Data Mixture:**
|
| 45 |
* **Irish (~75%):**
|
|
@@ -57,7 +57,7 @@ The model was adapted using a bilingual corpus of **3.265 billion characters**.
|
|
| 57 |
* **Optimizer:** AdamW ($lr=1e^{-4}$).
|
| 58 |
|
| 59 |
### 2. Instruction Tuning
|
| 60 |
-
We curated a **30k sample** parallel English-Irish instruction dataset. This was created by translating the **Dolly V2** dataset using **Gemini-2.5-Pro**, which was selected after a
|
| 61 |
|
| 62 |
## Evaluation Results
|
| 63 |
|
|
|
|
| 21 |
|
| 22 |
# Qomhrá: A Bilingual Irish & English LLM
|
| 23 |
|
| 24 |
+
**Qomhrá**, **Q**wen (Base model) + c**omhrá** (Irish for "Conversation") is an 8-billion parameter bilingual Large Language Model (LLM) designed to support the low-resource language of Irish (*Gaeilge*). It is adapted from **Qwen3-8B** via a pipeline of Bilingual Continued Pre-Training (CPT) and Instruction Tuning.
|
| 25 |
|
| 26 |
Developed by researchers at **Trinity College Dublin**, **University College Cork**, and **Queen's University Belfast**, Qomhrá aims to foster technological sovereignty for the Irish language community by providing an open-weight alternative to proprietary APIs.
|
| 27 |
|
|
|
|
| 39 |
The development of Qomhrá followed a two-stage pipeline:
|
| 40 |
|
| 41 |
### 1. Bilingual Continued Pre-Training (CPT)
|
| 42 |
+
The model was adapted using a bilingual corpus of **3.265 billion characters**. Unlike previous approaches that suffered from catastrophic forgetting, we used a high mixture of English data (approx. 25%) to maintain English language capabilities.
|
| 43 |
|
| 44 |
**Data Mixture:**
|
| 45 |
* **Irish (~75%):**
|
|
|
|
| 57 |
* **Optimizer:** AdamW ($lr=1e^{-4}$).
|
| 58 |
|
| 59 |
### 2. Instruction Tuning
|
| 60 |
+
We curated a **30k sample** parallel English-Irish instruction dataset. This was created by translating the **Dolly V2** dataset using **Gemini-2.5-Pro**, which was selected after a human evaluation ranking it as the top performer for Irish text generation (outperforming GPT-5 and Claude-4-Sonnet).
|
| 61 |
|
| 62 |
## Evaluation Results
|
| 63 |
|