|
|
--- |
|
|
language: |
|
|
- ga |
|
|
- en |
|
|
tags: |
|
|
- irish |
|
|
- low-resource |
|
|
- bilingual |
|
|
- text-generation |
|
|
- instruction-following |
|
|
license: apache-2.0 |
|
|
base_model: Qwen/Qwen3-8B |
|
|
datasets: |
|
|
- databricks/dolly-v2 |
|
|
- uonlp/CulturaX |
|
|
- cis-lmu/Glot500 |
|
|
metrics: |
|
|
- bleu |
|
|
- accuracy |
|
|
--- |
|
|
|
|
|
# Qomhrá: A Bilingual Irish & English LLM |
|
|
|
|
|
**Qomhrá**, **Q**wen (Base model) + c**omhrá** (Irish for "Conversation") is an 8-billion parameter bilingual Large Language Model (LLM) designed to support the low-resource language of Irish (*Gaeilge*). It is adapted from **Qwen3-8B** via a pipeline of Bilingual Continued Pre-Training (CPT) and Instruction Tuning. |
|
|
|
|
|
Developed by researchers at **Trinity College Dublin**, **University College Cork**, and **Queen's University Belfast**, Qomhrá aims to foster technological sovereignty for the Irish language community by providing an open-weight alternative to proprietary APIs. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
* **Model Name:** Qomhrá-8B-Instruct |
|
|
* **Developed by:** Joseph McInerney (TCD & QUB), Khanh-Tung Tran (UCC), Liam Lonergan (TCD), Ailbhe Ní Chasaide (TCD), Neasa Ní Chiaráin (TCD), Barry Devereux (QUB). |
|
|
* **Language(s):** Irish (Gaeilge) and English |
|
|
* **Base Model:** Qwen/Qwen3-8B |
|
|
* **License:** Apache 2.0 |
|
|
* **Paper:** TBC |
|
|
|
|
|
## Training Methodology |
|
|
|
|
|
The development of Qomhrá followed a two-stage pipeline: |
|
|
|
|
|
### 1. Bilingual Continued Pre-Training (CPT) |
|
|
The model was adapted using a bilingual corpus of **3.265 billion characters**. Unlike previous approaches that suffered from catastrophic forgetting, we used a high mixture of English data (approx. 25%) to maintain English language capabilities. |
|
|
|
|
|
**Data Mixture:** |
|
|
* **Irish (~75%):** |
|
|
* **UCCIX_CulturaX:** 1.2B characters |
|
|
* **National Corpus of Irish (CNG):** 549M characters |
|
|
* **UCCIX_Glot500:** 530M characters |
|
|
* **Other:** UCCIX (Wikipedia, ParaCrawl, ELRC) and The Bible. |
|
|
* **English (~25%):** |
|
|
* **Wikipedia:** 819M characters (2022 dump). |
|
|
|
|
|
**Training Config:** |
|
|
* **Compute:** 2x Nvidia H100 (80GB). |
|
|
* **Context Window:** Packed to 2048 tokens. |
|
|
* **Precision:** BF16. |
|
|
* **Optimizer:** AdamW ($lr=1e^{-4}$). |
|
|
|
|
|
### 2. Instruction Tuning |
|
|
We curated a **30k sample** parallel English-Irish instruction dataset. This was created by translating the **Dolly V2** dataset using **Gemini-2.5-Pro**, which was selected after a human evaluation ranking it as the top performer for Irish text generation (outperforming GPT-5 and Claude-4-Sonnet). |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
### Benchmark Definitions |
|
|
* **Cloze-gle** tests the model's familiarity with Irish grammatical gender, where the model is presented with three sentences that vary by pronoun, and the model must assign the correct gender agreement. |
|
|
* **SIB-gle** tests topic modelling, the model must ascribe a topic label to text given options such as political, science, or sport. |
|
|
* **IQA-gle/eng** tests the model's question answering ability in both Irish and English. The model is presented with a user question and some supporting context and it must select the most likely answer. |
|
|
* **BLEU gle <-> eng** measures the model's bi-directional Irish and English translation accuracy on health domain data (Lankford et al., 2022). |
|
|
* **NQ-eng** tests the model's world knowledge, requiring an exact match on general knowledge style questions in English. |
|
|
|
|
|
### Performance |
|
|
|
|
|
Qomhrá-Instruct outperforms existing open-source baselines on Irish understanding and generation while maintaining strong English capabilities. |
|
|
|
|
|
| Benchmark | Qomhrá-Instruct | UCCIX | Llama-3.1-8B | |
|
|
| :--- | :--- | :--- | :--- | |
|
|
| **Cloze-gle** | **0.88** | 0.75 | 0.59 | |
|
|
| **SIB-gle** | **0.8186** | 0.7794 | 0.7696 | |
|
|
| **IQA-gle** | **0.6760** | 0.3889 | 0.4861 | |
|
|
| **IQA-eng** | **0.7924** | 0.3704 | 0.7747 | |
|
|
| **BLEU eng2gle** | 0.1167 | **0.3334** | 0.0880 | |
|
|
| **BLEU gle2eng** | 0.0770 | **0.4636** | 0.4229 | |
|
|
| **NQ-eng** | 0.1269 | 0.1668 | **0.2767** | |
|
|
|
|
|
*Note: As discussed in the paper, lower scores on generation benchmarks (BLEU/NQ) for the Instruct model compared to base models are driven by response length distributions; the Instruct model learns to provide concise answers, whereas base models generate longer sequences that artificially inflate overlap metrics.* |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
model_id = "jmcinern/Qomhra" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# Irish Prompt |
|
|
messages = [ |
|
|
{"role": "system", "content": "Is cúntóir úsáideach agus dílis tú."}, |
|
|
{"role": "user", "content": "Cé hé Uachtarán na hÉireann?"} |
|
|
] |
|
|
|
|
|
text = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True |
|
|
) |
|
|
|
|
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
|
|
generated_ids = model.generate( |
|
|
model_inputs.input_ids, |
|
|
max_new_tokens=512 |
|
|
) |
|
|
generated_ids = [ |
|
|
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) |
|
|
] |
|
|
|
|
|
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
|
print(response) |