|
|
--- |
|
|
extra_gated_heading: "Access Request for Research-Only Model" |
|
|
extra_gated_description: "Please provide your professional details and acknowledge the terms of use to request access." |
|
|
extra_gated_button_content: "Submit Request" |
|
|
extra_gated_prompt: "By requesting access, you acknowledge that this model is provided solely for research purposes, is offered 'as-is' without any guarantees, and cannot be utilized for for-profit tasks or commercial applications." |
|
|
|
|
|
extra_gated_fields: |
|
|
Full Name: text |
|
|
Company / Institution: text |
|
|
Role: text |
|
|
Intended Use Case: text |
|
|
I would like to receive news about the models, publications and events of the research group in Hungarian: |
|
|
type: select |
|
|
options: |
|
|
- Yes |
|
|
- No |
|
|
I acknowledge that this model is for research-only, comes with no guarantee, and cannot be used for for-profit tasks: checkbox |
|
|
language: |
|
|
- hu |
|
|
- de |
|
|
- en |
|
|
base_model: |
|
|
- Qwen/Qwen3-4B |
|
|
pipeline_tag: text-generation |
|
|
license: cc-by-nc-sa-4.0 |
|
|
model-index: |
|
|
- name: Racka-4B |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
type: HULU |
|
|
name: HuCOLA |
|
|
metrics: |
|
|
- name: ACC |
|
|
type: accuracy |
|
|
value: 0.8624 |
|
|
verified: false |
|
|
- name: MCC |
|
|
type: mcc |
|
|
value: 0.5657 |
|
|
verified: false |
|
|
- name: F1 |
|
|
type: f1 |
|
|
value: 0.8563 |
|
|
verified: false |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
type: HULU |
|
|
name: HuCOPA |
|
|
metrics: |
|
|
- name: ACC |
|
|
type: accuracy |
|
|
value: 0.7990 |
|
|
verified: false |
|
|
- name: MCC |
|
|
type: mcc |
|
|
value: 0.5998 |
|
|
verified: false |
|
|
- name: F1 |
|
|
type: f1 |
|
|
value: 0.7988 |
|
|
verified: false |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
type: HULU |
|
|
name: HuSST |
|
|
metrics: |
|
|
- name: ACC |
|
|
type: accuracy |
|
|
value: 0.7603 |
|
|
verified: false |
|
|
- name: MCC |
|
|
type: mcc |
|
|
value: 0.5137 |
|
|
verified: false |
|
|
- name: F1 |
|
|
type: f1 |
|
|
value: 0.7511 |
|
|
verified: false |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
type: HULU |
|
|
name: HuRTE |
|
|
metrics: |
|
|
- name: ACC |
|
|
type: accuracy |
|
|
value: 0.8790 |
|
|
verified: false |
|
|
- name: MCC |
|
|
type: mcc |
|
|
value: 0.7553 |
|
|
verified: false |
|
|
- name: F1 |
|
|
type: f1 |
|
|
value: 0.8790 |
|
|
verified: false |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
type: HULU |
|
|
name: HuWNLI |
|
|
metrics: |
|
|
- name: ACC |
|
|
type: accuracy |
|
|
value: 0.5666 |
|
|
verified: false |
|
|
- name: MCC |
|
|
type: mcc |
|
|
value: 0.1031 |
|
|
verified: false |
|
|
- name: F1 |
|
|
type: f1 |
|
|
value: 0.4548 |
|
|
verified: false |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
type: HULU |
|
|
name: HuCB |
|
|
metrics: |
|
|
- name: ACC |
|
|
type: accuracy |
|
|
value: 0.6388 |
|
|
verified: false |
|
|
- name: MCC |
|
|
type: mcc |
|
|
value: 0.4741 |
|
|
verified: false |
|
|
- name: F1 |
|
|
type: f1 |
|
|
value: 0.6373 |
|
|
verified: false |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
type: OpenHuEval |
|
|
name: HuWildBench |
|
|
metrics: |
|
|
- name: WBScore |
|
|
type: score |
|
|
value: 57.17 |
|
|
verified: false |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
type: OpenHuEval |
|
|
name: HuSimpleQA |
|
|
metrics: |
|
|
- name: Acc |
|
|
type: accuracy |
|
|
value: 10.05 |
|
|
verified: false |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
type: OpenHuEval |
|
|
name: HuProverbRea (OE) |
|
|
metrics: |
|
|
- name: Acc |
|
|
type: accuracy |
|
|
value: 61.94 |
|
|
verified: false |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
type: OpenHuEval |
|
|
name: HuProverbRea (2CQ) |
|
|
metrics: |
|
|
- name: Acc |
|
|
type: accuracy |
|
|
value: 77.53 |
|
|
verified: false |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
type: LM-Eval-Harness-HU |
|
|
name: Arc_hu |
|
|
metrics: |
|
|
- name: Acc_norm |
|
|
type: accuracy |
|
|
value: 0.4101 |
|
|
verified: false |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
type: LM-Eval-Harness-HU |
|
|
name: Hellaswag_hu |
|
|
metrics: |
|
|
- name: Acc_norm |
|
|
type: accuracy |
|
|
value: 0.4510 |
|
|
verified: false |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
type: LM-Eval-Harness-HU |
|
|
name: MMLU_hu |
|
|
metrics: |
|
|
- name: Acc |
|
|
type: accuracy |
|
|
value: 0.5378 |
|
|
verified: false |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
type: LM-Eval-Harness-HU |
|
|
name: TruthfulQA_hu_mc2 |
|
|
metrics: |
|
|
- name: Acc |
|
|
type: accuracy |
|
|
value: 0.5493 |
|
|
verified: false |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
type: LM-Eval-Harness-HU |
|
|
name: GSM8K_hu |
|
|
metrics: |
|
|
- name: Flexible-extract |
|
|
type: accuracy |
|
|
value: 0.5329 |
|
|
verified: false |
|
|
- task: |
|
|
type: text-generation |
|
|
dataset: |
|
|
type: LM-Eval-Harness-HU |
|
|
name: GSM8K_hu |
|
|
metrics: |
|
|
- name: Strict-match |
|
|
type: accuracy |
|
|
value: 0.5299 |
|
|
verified: false |
|
|
--- |
|
|
|
|
|
# Racka-4B Model Card |
|
|
|
|
|
<div style="display:flex; align-items:center; gap:12px;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/640edc40208821a59b710e84/KiAGUcITdOXG5gVhn55yS.png" alt="Racka icon" width="100" height="100" style="flex:0 0 auto;"> <h1 style="margin:0;">Racka</h1> </div> |
|
|
|
|
|
**Racka** (*Regionális Adatokon Célzottan Kialakított Alapmodell*) is a continually pretrained large language model designed to bridge the resource gap between Hungarian and high-resource languages. It employs parameter-efficient continual pretraining via Low-Rank Adaptation (LoRA) on a **Qwen3-4B** (reasoning/instruct) backbone. |
|
|
|
|
|
The model was trained on a mixture of **160B tokens** (44% Hungarian, 24% German, 21% English, 11% Code) on the Komondor HPC. To better match the training distribution, Racka uses an adapted tokenizer that achieves substantially improved tokenization fertility for Hungarian while maintaining competitive performance in English and German. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
* **Developed by:** ELTE Faculty of Humanities (Dept. of Digital Humanities) & ELTE Faculty of Informatics (Dept. of Artificial Intelligence) |
|
|
* **Backbone Model:** [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) (Reasoning/Instruct version) |
|
|
* **Language(s):** Hungarian (primary), English, German, Code |
|
|
* **License:** cc-by-nc-sa-4.0 |
|
|
* **Architecture:** Transformer with LoRA adapters (Rank=64, Alpha=128) |
|
|
* **Training Context Length:** 4,096 tokens (with sequence packing) |
|
|
* **Context Length (Inference):** 32,768 natively and 131,072 tokens with YaRN |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Hugging Face Transformers |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig |
|
|
|
|
|
model_name = "elte-nlp/Racka-4B" |
|
|
|
|
|
# load the tokenizer and the model |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype="auto", |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
messages = [ |
|
|
{"role": "system", "content": "You are a helpful Hungarian assistant."}, |
|
|
{"role": "user", "content": "Magyarázd el a gépi tanulás lényegét óvodásoknak egy mondatban!"} |
|
|
] |
|
|
text = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True, |
|
|
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True. |
|
|
) |
|
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
|
|
|
|
|
generation_config = GenerationConfig( |
|
|
do_sample=True, |
|
|
temperature=0.6, |
|
|
top_p=0.8, |
|
|
top_k=50, |
|
|
repetition_penalty=1.1, |
|
|
presence_penalty=1.1, |
|
|
) |
|
|
|
|
|
# conduct text completion |
|
|
generated_ids = model.generate( |
|
|
input_ids = model_inputs["input_ids"], |
|
|
attention_mask = model_inputs["attention_mask"], |
|
|
max_new_tokens=32768, |
|
|
generation_config=generation_config |
|
|
) |
|
|
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() |
|
|
|
|
|
# parsing thinking content |
|
|
try: |
|
|
# rindex finding 151668 (</think>) |
|
|
index = len(output_ids) - output_ids[::-1].index(151668) |
|
|
except ValueError: |
|
|
index = 0 |
|
|
|
|
|
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n") |
|
|
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n") |
|
|
|
|
|
print("thinking content:", thinking_content) |
|
|
print("content:", content) |
|
|
``` |
|
|
|
|
|
### vLLM |
|
|
|
|
|
```bash |
|
|
vllm serve elte-nlp/Racka-4B --tokenizer elte-nlp/Racka-4B --dtype float16 --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072 --reasoning-parser qwen3 |
|
|
``` |
|
|
|
|
|
## Technical Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was trained on a 160B token corpus designed to mitigate catastrophic forgetting via data replay: |
|
|
|
|
|
| Language | BPE Tokens | Ratio | Sources | |
|
|
| :--- | :--- | :--- | :--- | |
|
|
| **Hungarian** | ~70B | 44% | Common Crawl (heavily filtered), News, Wikipedia, Court Rulings, Subtitles, Academic Repositories. | |
|
|
| **English** | ~38B | 24% | The Pile, FineWeb. | |
|
|
| **German** | ~34B | 21% | Occiglot-FineWeb. | |
|
|
| **Code** | ~18B | 11% | The Stack v2. | |
|
|
|
|
|
### Tokenizer Adaptation |
|
|
|
|
|
The vocabulary was extended by **32,000** new Hungarian tokens initialized via VIPI (Vocabulary Initialization with Partial Inheritance). This reduced Hungarian subword fertility by **~47%**. This fertility reduction results in proportional processing time reduction. |
|
|
|
|
|
| Language | Qwen-3 4B Fertility | Racka-4B Fertility | Change | |
|
|
| :--- | :--- | :--- | :--- | |
|
|
| **Hungarian** | 3.13 | **1.66** | **-46.96%** | |
|
|
| English | 1.57 | 1.94 | +23.44% | |
|
|
| German | 2.05 | 2.31 | +12.62% | |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
* **Infrastructure:** Komondor HPC (64 \\( \times \\) NVIDIA A100 40GB). |
|
|
* **Training time:** 287 hours (total GPU time: 2.1 years) |
|
|
* **Strategy:** Distributed Data Parallel (DDP). |
|
|
* **Parameters:** |
|
|
* LoRA Rank: 64, Alpha: 128, Dropout: 0.1. |
|
|
* Learning Rate: \\( 1\times10^{-4} \\) (LoRA), \\( 5\times10^{-5} \\) (Non-LoRA). |
|
|
* Batch Size: 2 per GPU (Effective batch size: 512). |
|
|
* Steps: 326,357. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
The following tables present the performance of **Racka-4B** compared to its base models (**Qwen3-4B** and **Qwen3-4B-Base**) and the SOTA 8B Hungarian model **PULI-LlumiX-Llama-3.1 8B**. |
|
|
|
|
|
### 1. HULU Benchmark (Fine-tuned) |
|
|
|
|
|
Performance on the Hungarian Language Understanding (HULU) benchmark suite. Results represent the average of multiple runs, taking the best result between LoRA and full fine-tuning. |
|
|
|
|
|
| Dataset | Metric | Qwen3-4B | **Racka-4B** | Qwen3-4B-Base | PULI-LlumiX-Llama-3.1 8B | |
|
|
| :--- | :--- | :--- | :--- | :--- | :--- | |
|
|
| **HuCOLA** | ACC | 0.8109 | 0.8624 | 0.8254 | **0.8989** | |
|
|
| | MCC | 0.3482 | 0.5657 | 0.4044 | **0.6920** | |
|
|
| | F1 | 0.7840 | 0.8563 | 0.8027 | **0.8969** | |
|
|
| **HuCOPA** | ACC | 0.5589 | 0.7990 | 0.5845 | **0.9359** | |
|
|
| | MCC | 0.1181 | 0.5998 | 0.1705 | **0.8720** | |
|
|
| | F1 | 0.5584 | 0.7988 | 0.5837 | **0.9359** | |
|
|
| **HuSST** | ACC | 0.7517 | 0.7603 | 0.7539 | **0.7804** | |
|
|
| | MCC | 0.5022 | 0.5137 | 0.5082 | **0.5598** | |
|
|
| | F1 | 0.7433 | 0.7511 | 0.7513 | **0.7698** | |
|
|
| **HuRTE** | ACC | **0.9078** | 0.8790 | 0.8872 | 0.8979 | |
|
|
| | MCC | **0.8142** | 0.7553 | 0.7719 | 0.7936 | |
|
|
| | F1 | **0.9078** | 0.8790 | 0.8872 | 0.8977 | |
|
|
| **HuWNLI** | ACC | 0.5033 | **0.5666** | 0.5366 | 0.3800 | |
|
|
| | MCC | -0.0980 | **0.1031** | -0.0600 | -0.2815 | |
|
|
| | F1 | 0.3862 | **0.4548** | 0.4069 | 0.3668 | |
|
|
| **HuCB** | ACC | **0.7378** | 0.6388 | 0.6291 | 0.4854 | |
|
|
| | MCC | **0.6078** | 0.4741 | 0.4733 | 0.2742 | |
|
|
| | F1 | **0.7316** | 0.6373 | 0.6112 | 0.4594 | |
|
|
| **Overall** | Avg ACC | 0.711 | **0.751** | 0.702 | 0.729 | |
|
|
| | Avg MCC | 0.382 | **0.502** | 0.378 | 0.485 | |
|
|
| | Avg F1 | 0.685 | **0.7295** | 0.673 | 0.721 | |
|
|
|
|
|
--- |
|
|
|
|
|
### 2. OpenHuEval |
|
|
|
|
|
Evaluation on Hungarian reading comprehension, generation, and reasoning tasks. Qwen and Racka models use a patched implementation of OpenHuEval for compatibility. |
|
|
|
|
|
| Metric | Qwen3-4B | **Racka-4B** | Qwen3-4B-Base | PULI-LlumiX 8B | |
|
|
| :--- | :--- | :--- | :--- | :--- | |
|
|
| **HuWildBench** (WBScore) | **63.03** | 57.17 | 52.59 | 17.77 | |
|
|
| **HuSimpleQA** (Acc) | 7.30 | 10.05 | 5.90 | **20.03** | |
|
|
| **HuProverbRea** (Acc OE) | 62.47 | 61.94 | 41.15 | **75.86** | |
|
|
| **HuProverbRea** (Acc 2CQ) | 74.98 | **77.53** | 0.00 | 77.36 | |
|
|
| **HuMatchingFIB** (B Acc) | 39.59 | 38.93 | **42.30** | 33.54 | |
|
|
| **HuMatchingFIB** (Q Acc) | **5.94** | 4.68 | 5.58 | 3.96 | |
|
|
| **HuStandardFIB** (B Acc) | 13.20 | 18.98 | 0.00 | **29.16** | |
|
|
| **HuStandardFIB** (Q Acc) | 1.08 | **2.15** | 0.00 | **2.15** | |
|
|
| **Overall** | 33.44 | **33.93** | 18.44 | 32.47 | |
|
|
|
|
|
--- |
|
|
|
|
|
### 3. LM-Eval-Harness (Hungarian) |
|
|
|
|
|
Few-shot evaluation on standard benchmarks translated to Hungarian. Best results are kept (with chat template for Racka-4B and without for others). |
|
|
|
|
|
| Dataset (Metric) | Qwen3-4B | **Racka-4B** | Qwen3-4B-Base | PULI-LlumiX 8B | |
|
|
| :--- | :--- | :--- | :--- | :--- | |
|
|
| **Arc_hu** (Acc) | 0.3202 | 0.3450 | 0.3792 | **0.3861** | |
|
|
| **Arc_hu** (Acc_norm) | 0.3844 | 0.4101 | 0.4169 | **0.4323** | |
|
|
| **Hellaswag_hu** (Acc) | 0.3369 | 0.3656 | 0.3610 | **0.4241** | |
|
|
| **Hellaswag_hu** (Acc_norm) | 0.4095 | 0.4510 | 0.4557 | **0.5606** | |
|
|
| **MMLU_hu** (Acc) | 0.5427 | 0.5378 | **0.5965** | 0.5310 | |
|
|
| **TruthfulQA_hu_mc1** (Acc) | 0.3177 | **0.3644** | 0.3281 | 0.3035 | |
|
|
| **TruthfulQA_hu_mc2** (Acc) | 0.5102 | **0.5493** | 0.5045 | 0.4883 | |
|
|
| **GSM8K_hu** (Strict-match) | 0.6330 | 0.5299 | **0.6398** | 0.4761 | |
|
|
| **GSM8K_hu** (Flexible extract) | 0.6285 | 0.5329 | **0.6421** | 0.4791 | |
|
|
| **Overall** | 0.453 | 0.454 | **0.4805** | 0.4546 | |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- The model is capable of both instruction following chat and English reasoning using the original Qwen settings, this is a preserved capability with no direct training targetting this functionality. |
|
|
- The model has not been aligned and is unsafe for use with end-users. |
|
|
- This model is only to be used for research purposes, commercial or for-profit usage is not permitted. |
|
|
|
|
|
## Team |
|
|
|
|
|
In alphabetical order: |
|
|
|
|
|
- Zsolt Csibi (ELTE-IK, AI Dept.) |
|
|
- Bence Gortka (ELTE-BTK, DH-Lab) |
|
|
- Natabara Gyöngyössy (ELTE-IK, AI Dept.) |
|
|
- Kornél Nagy (ELTE-BTK, DH-Lab) |
|
|
- Dávid Nemeskey (ELTE-BTK, DH-Lab) |
|
|
- Gábor Palkó (ELTE-BTK, DH-Lab) |
|
|
- Martin Sallai (ELTE-BTK, DH-Lab) |
|
|
- András Simonyi (ELTE-IK, AI Dept.) |
|
|
- András Szekeres (ELTE-BTK, DH-Lab) |
|
|
|
|
|
|
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
We acknowledge the Digital Government Development and Project Management Ltd. for awarding us access to the Komondor HPC facility based in Hungary. |
|
|
|
|
|
This research was supported by the EKÖP-24 University Excellence Scholarship Program of the Ministry for Culture and Innovation, funded by the National Research, Development and Innovation Fund. |
|
|
|
|
|
The authors acknowledge the support of the National Laboratory for Digital Heritage. Project no. 2022-2.1.1-NL-2022-00009 has been implemented with the support provided by the Ministry of Culture and Innovation of Hungary from the National Research, Development and Innovation Fund, financed under the 2022-2.1.1-NL funding scheme. |
|
|
|
|
|
We would like to thank Levente Szabados for the name idea and initial informal discussions. |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{racka2026, |
|
|
title={Racka: Efficient Hungarian LLM Adaptation on Academic Infrastructure}, |
|
|
author={Csibi, Zsolt and Gortka, Bence Gy\"orgy and Nagy, Korn\'el and Nemeskey, D\'avid M\'ark and Sallai, Martin and Simonyi, Andr\'as and Szekeres, Andr\'as M\'ark and Palk\'o, G\'abor}, |
|
|
journal={Proceedings of the XXII. Hungarian Computational Linguistics Conference}, |
|
|
year={2026} |
|
|
} |
|
|
``` |