jmcinern commited on
Commit
ecb2d91
·
verified ·
1 Parent(s): 3ba9d55

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +117 -6
README.md CHANGED
@@ -1,12 +1,123 @@
1
  ---
2
- license: mit
3
  language:
4
  - ga
5
  - en
6
- base_model:
7
- - jmcinern/Qomhra
8
- pipeline_tag: question-answering
 
 
 
 
 
 
 
 
 
9
  ---
10
 
11
- # Model
12
- A activation aware quantized (AWQ) version of Qomhra, focused on retaining Irish and English performance, memory overhead for inference.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  language:
3
  - ga
4
  - en
5
+ tags:
6
+ - irish
7
+ - low-resource
8
+ - bilingual
9
+ - text-generation
10
+ - instruction-following
11
+ license: apache-2.0
12
+ base_model: jmcinern/Qomhra
13
+ datasets:
14
+ - databricks/dolly-v2
15
+ - uonlp/CulturaX
16
+ - cis-lmu/Glot500
17
  ---
18
 
19
+ # Qomhrá-AWQ: A Language-Aware Quantized Bilingual Irish & English LLM
20
+
21
+ **Qomrá-AWQ** is the activation aware quantized version of **Qomhrá**. The following information regarding Qomra is relevant:
22
+
23
+ **Q**wen (Base model) + c**omhrá** (Irish for "Conversation") is an 8-billion parameter bilingual Large Language Model (LLM) designed to support the low-resource language of Irish (*Gaeilge*). It is adapted from **Qwen3-8B** via a pipeline of Bilingual Continued Pre-Training (CPT) and Instruction Tuning.
24
+
25
+ Developed by researchers at **Trinity College Dublin**, **University College Cork**, and **Queen's University Belfast**, Qomhrá aims to foster technological sovereignty for the Irish language community by providing an open-weight alternative to proprietary APIs.
26
+
27
+ ## Model Details
28
+
29
+ * **Model Name:** Qomhrá-8B-Instruct
30
+ * **Developed by:** Joseph McInerney (TCD & QUB), Khanh-Tung Tran (UCC), Liam Lonergan (TCD), Ailbhe Ní Chasaide (TCD), Neasa Ní Chiaráin (TCD), Barry Devereux (QUB).
31
+ * **Language(s):** Irish (Gaeilge) and English
32
+ * **Base Model:** Qwen/Qwen3-8B
33
+ * **License:** Apache 2.0
34
+ * **Paper:** TBC
35
+
36
+ ## Training Methodology
37
+
38
+ The development of Qomhrá followed a two-stage pipeline:
39
+
40
+ ### 1. Bilingual Continued Pre-Training (CPT)
41
+ The model was adapted using a bilingual corpus of **3.265 billion characters**. Unlike previous approaches that suffered from catastrophic forgetting, we used a high mixture of English data (approx. 25%) to maintain English language capabilities.
42
+
43
+ **Data Mixture:**
44
+ * **Irish (~75%):**
45
+ * **UCCIX_CulturaX:** 1.2B characters
46
+ * **National Corpus of Irish (CNG):** 549M characters
47
+ * **UCCIX_Glot500:** 530M characters
48
+ * **Other:** UCCIX (Wikipedia, ParaCrawl, ELRC) and The Bible.
49
+ * **English (~25%):**
50
+ * **Wikipedia:** 819M characters (2022 dump).
51
+
52
+ **Training Config:**
53
+ * **Compute:** 2x Nvidia H100 (80GB).
54
+ * **Context Window:** Packed to 2048 tokens.
55
+ * **Precision:** BF16.
56
+ * **Optimizer:** AdamW ($lr=1e^{-4}$).
57
+
58
+ ### 2. Instruction Tuning
59
+ We curated a **30k sample** parallel English-Irish instruction dataset. This was created by translating the **Dolly V2** dataset using **Gemini-2.5-Pro**, which was selected after a human evaluation ranking it as the top performer for Irish text generation (outperforming GPT-5 and Claude-4-Sonnet).
60
+
61
+ ## Evaluation Results
62
+
63
+ ### Benchmark Definitions
64
+ * **Cloze-gle** tests the model's familiarity with Irish grammatical gender, where the model is presented with three sentences that vary by pronoun, and the model must assign the correct gender agreement.
65
+ * **SIB-gle** tests topic modelling, the model must ascribe a topic label to text given options such as political, science, or sport.
66
+ * **IQA-gle/eng** tests the model's question answering ability in both Irish and English. The model is presented with a user question and some supporting context and it must select the most likely answer.
67
+ * **BLEU gle <-> eng** measures the model's bi-directional Irish and English translation accuracy on health domain data (Lankford et al., 2022).
68
+ * **NQ-eng** tests the model's world knowledge, requiring an exact match on general knowledge style questions in English.
69
+
70
+ ### Performance
71
+
72
+ Qomhrá-Instruct outperforms existing open-source baselines on Irish understanding and generation while maintaining strong English capabilities.
73
+
74
+ | Benchmark | Qomhrá-Instruct | UCCIX | Llama-3.1-8B |
75
+ | :--- | :--- | :--- | :--- |
76
+ | **Cloze-gle** | **0.88** | 0.75 | 0.59 |
77
+ | **SIB-gle** | **0.8186** | 0.7794 | 0.7696 |
78
+ | **IQA-gle** | **0.6760** | 0.3889 | 0.4861 |
79
+ | **IQA-eng** | **0.7924** | 0.3704 | 0.7747 |
80
+ | **BLEU eng2gle** | 0.1167 | **0.3334** | 0.0880 |
81
+ | **BLEU gle2eng** | 0.0770 | **0.4636** | 0.4229 |
82
+ | **NQ-eng** | 0.1269 | 0.1668 | **0.2767** |
83
+
84
+ *Note: As discussed in the paper, lower scores on generation benchmarks (BLEU/NQ) for the Instruct model compared to base models are driven by response length distributions; the Instruct model learns to provide concise answers, whereas base models generate longer sequences that artificially inflate overlap metrics.*
85
+
86
+ ## Usage
87
+
88
+ ```python
89
+ from transformers import AutoModelForCausalLM, AutoTokenizer
90
+ import torch
91
+
92
+ model_id = "jmcinern/Qomhra-AWQ"
93
+
94
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
95
+ model = AutoModelForCausalLM.from_pretrained(
96
+ model_id,
97
+ device_map="auto"
98
+ )
99
+
100
+ # Irish Prompt
101
+ messages = [
102
+ {"role": "system", "content": "Is cúntóir úsáideach agus dílis tú."},
103
+ {"role": "user", "content": "Cé hé Uachtarán na hÉireann?"}
104
+ ]
105
+
106
+ text = tokenizer.apply_chat_template(
107
+ messages,
108
+ tokenize=False,
109
+ add_generation_prompt=True
110
+ )
111
+
112
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
113
+
114
+ generated_ids = model.generate(
115
+ model_inputs.input_ids,
116
+ max_new_tokens=512
117
+ )
118
+ generated_ids = [
119
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
120
+ ]
121
+
122
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
123
+ print(response)