Qomhra / README.md

Update README.md

14b7e29 verified 9 days ago

5.11 kB

	---
	language:
	- ga
	- en
	tags:
	- irish
	- low-resource
	- bilingual
	- text-generation
	- instruction-following
	license: apache-2.0
	base_model: Qwen/Qwen3-8B
	datasets:
	- databricks/dolly-v2
	- uonlp/CulturaX
	- cis-lmu/Glot500
	metrics:
	- bleu
	- accuracy
	---

	# Qomhrá: A Bilingual Irish & English LLM

	Qomhrá, Qwen (Base model) + comhrá (Irish for "Conversation") is an 8-billion parameter bilingual Large Language Model (LLM) designed to support the low-resource language of Irish (Gaeilge). It is adapted from Qwen3-8B via a pipeline of Bilingual Continued Pre-Training (CPT) and Instruction Tuning.

	Developed by researchers at Trinity College Dublin, University College Cork, and Queen's University Belfast, Qomhrá aims to foster technological sovereignty for the Irish language community by providing an open-weight alternative to proprietary APIs.

	## Model Details

	* Model Name: Qomhrá-8B-Instruct
	* Developed by: Joseph McInerney (TCD & QUB), Khanh-Tung Tran (UCC), Liam Lonergan (TCD), Ailbhe Ní Chasaide (TCD), Neasa Ní Chiaráin (TCD), Barry Devereux (QUB).
	* Language(s): Irish (Gaeilge) and English
	* Base Model: Qwen/Qwen3-8B
	* License: Apache 2.0
	* Paper: TBC

	## Training Methodology

	The development of Qomhrá followed a two-stage pipeline:

	### 1. Bilingual Continued Pre-Training (CPT)
	The model was adapted using a bilingual corpus of 3.265 billion characters. Unlike previous approaches that suffered from catastrophic forgetting, we used a high mixture of English data (approx. 25%) to maintain English language capabilities.

	Data Mixture:
	* Irish (~75%):
	* UCCIX_CulturaX: 1.2B characters
	* National Corpus of Irish (CNG): 549M characters
	* UCCIX_Glot500: 530M characters
	* Other: UCCIX (Wikipedia, ParaCrawl, ELRC) and The Bible.
	* English (~25%):
	* Wikipedia: 819M characters (2022 dump).

	Training Config:
	* Compute: 2x Nvidia H100 (80GB).
	* Context Window: Packed to 2048 tokens.
	* Precision: BF16.
	* Optimizer: AdamW ($lr=1e^{-4}$).

	### 2. Instruction Tuning
	We curated a 30k sample parallel English-Irish instruction dataset. This was created by translating the Dolly V2 dataset using Gemini-2.5-Pro, which was selected after a human evaluation ranking it as the top performer for Irish text generation (outperforming GPT-5 and Claude-4-Sonnet).

	## Evaluation Results

	### Benchmark Definitions
	* Cloze-gle tests the model's familiarity with Irish grammatical gender, where the model is presented with three sentences that vary by pronoun, and the model must assign the correct gender agreement.
	* SIB-gle tests topic modelling, the model must ascribe a topic label to text given options such as political, science, or sport.
	* IQA-gle/eng tests the model's question answering ability in both Irish and English. The model is presented with a user question and some supporting context and it must select the most likely answer.
	* BLEU gle <-> eng measures the model's bi-directional Irish and English translation accuracy on health domain data (Lankford et al., 2022).
	* NQ-eng tests the model's world knowledge, requiring an exact match on general knowledge style questions in English.

	### Performance

	Qomhrá-Instruct outperforms existing open-source baselines on Irish understanding and generation while maintaining strong English capabilities.

	\| Benchmark \| Qomhrá-Instruct \| UCCIX \| Llama-3.1-8B \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| Cloze-gle \| 0.88 \| 0.75 \| 0.59 \|
	\| SIB-gle \| 0.8186 \| 0.7794 \| 0.7696 \|
	\| IQA-gle \| 0.6760 \| 0.3889 \| 0.4861 \|
	\| IQA-eng \| 0.7924 \| 0.3704 \| 0.7747 \|
	\| BLEU eng2gle \| 0.1167 \| 0.3334 \| 0.0880 \|
	\| BLEU gle2eng \| 0.0770 \| 0.4636 \| 0.4229 \|
	\| NQ-eng \| 0.1269 \| 0.1668 \| 0.2767 \|

	Note: As discussed in the paper, lower scores on generation benchmarks (BLEU/NQ) for the Instruct model compared to base models are driven by response length distributions; the Instruct model learns to provide concise answers, whereas base models generate longer sequences that artificially inflate overlap metrics.

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_id = "jmcinern/Qomhra"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="auto"
	)

	# Irish Prompt
	messages = [
	{"role": "system", "content": "Is cúntóir úsáideach agus dílis tú."},
	{"role": "user", "content": "Cé hé Uachtarán na hÉireann?"}
	]

	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	generated_ids = model.generate(
	model_inputs.input_ids,
	max_new_tokens=512
	)
	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
	]

	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(response)