Qomhrá-AWQ: A Language-Aware Quantized Bilingual Irish & English LLM

Qomrá-AWQ is the activation aware quantized version of Qomhrá. The following information regarding Qomra is relevant:

Qwen (Base model) + comhrá (Irish for "Conversation") is an 8-billion parameter bilingual Large Language Model (LLM) designed to support the low-resource language of Irish (Gaeilge). It is adapted from Qwen3-8B via a pipeline of Bilingual Continued Pre-Training (CPT) and Instruction Tuning.

Developed by researchers at Trinity College Dublin, University College Cork, and Queen's University Belfast, Qomhrá aims to foster technological sovereignty for the Irish language community by providing an open-weight alternative to proprietary APIs.

Model Details

  • Model Name: Qomhrá-8B-Instruct
  • Developed by: Joseph McInerney (TCD & QUB), Khanh-Tung Tran (UCC), Liam Lonergan (TCD), Ailbhe Ní Chasaide (TCD), Neasa Ní Chiaráin (TCD), Barry Devereux (QUB).
  • Language(s): Irish (Gaeilge) and English
  • Base Model: Qwen/Qwen3-8B
  • License: Apache 2.0
  • Paper: TBC

Training Methodology

The development of Qomhrá followed a two-stage pipeline:

1. Bilingual Continued Pre-Training (CPT)

The model was adapted using a bilingual corpus of 3.265 billion characters. Unlike previous approaches that suffered from catastrophic forgetting, we used a high mixture of English data (approx. 25%) to maintain English language capabilities.

Data Mixture:

  • Irish (~75%):
    • UCCIX_CulturaX: 1.2B characters
    • National Corpus of Irish (CNG): 549M characters
    • UCCIX_Glot500: 530M characters
    • Other: UCCIX (Wikipedia, ParaCrawl, ELRC) and The Bible.
  • English (~25%):
    • Wikipedia: 819M characters (2022 dump).

Training Config:

  • Compute: 2x Nvidia H100 (80GB).
  • Context Window: Packed to 2048 tokens.
  • Precision: BF16.
  • Optimizer: AdamW ($lr=1e^{-4}$).

2. Instruction Tuning

We curated a 30k sample parallel English-Irish instruction dataset. This was created by translating the Dolly V2 dataset using Gemini-2.5-Pro, which was selected after a human evaluation ranking it as the top performer for Irish text generation (outperforming GPT-5 and Claude-4-Sonnet).

Evaluation Results

Benchmark Definitions

  • Cloze-gle tests the model's familiarity with Irish grammatical gender, where the model is presented with three sentences that vary by pronoun, and the model must assign the correct gender agreement.
  • SIB-gle tests topic modelling, the model must ascribe a topic label to text given options such as political, science, or sport.
  • IQA-gle/eng tests the model's question answering ability in both Irish and English. The model is presented with a user question and some supporting context and it must select the most likely answer.
  • BLEU gle <-> eng measures the model's bi-directional Irish and English translation accuracy on health domain data (Lankford et al., 2022).
  • NQ-eng tests the model's world knowledge, requiring an exact match on general knowledge style questions in English.

Performance

Qomhrá-Instruct outperforms existing open-source baselines on Irish understanding and generation while maintaining strong English capabilities.

Benchmark Qomhrá-Instruct UCCIX Llama-3.1-8B
Cloze-gle 0.88 0.75 0.59
SIB-gle 0.8186 0.7794 0.7696
IQA-gle 0.6760 0.3889 0.4861
IQA-eng 0.7924 0.3704 0.7747
BLEU eng2gle 0.1167 0.3334 0.0880
BLEU gle2eng 0.0770 0.4636 0.4229
NQ-eng 0.1269 0.1668 0.2767

Note: As discussed in the paper, lower scores on generation benchmarks (BLEU/NQ) for the Instruct model compared to base models are driven by response length distributions; the Instruct model learns to provide concise answers, whereas base models generate longer sequences that artificially inflate overlap metrics.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "jmcinern/Qomhra-AWQ"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto"
)

# Irish Prompt
messages = [
    {"role": "system", "content": "Is cúntóir úsáideach agus dílis tú."},
    {"role": "user", "content": "Cé hé Uachtarán na hÉireann?"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Downloads last month
13
Safetensors
Model size
8B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jmcinern/Qomhra-AWQ

Base model

Qwen/Qwen3-8B-Base
Finetuned
Qwen/Qwen3-8B
Finetuned
jmcinern/Qomhra
Finetuned
(1)
this model

Datasets used to train jmcinern/Qomhra-AWQ

Space using jmcinern/Qomhra-AWQ 1