Access Request for Research-Only Model

Please provide your professional details and acknowledge the terms of use to request access.

By requesting access, you acknowledge that this model is provided solely for research purposes, is offered 'as-is' without any guarantees, and cannot be utilized for for-profit tasks or commercial applications.

Log in or Sign Up to review the conditions and access this model content.

Racka-4B Model Card

Racka icon

Racka

Racka (Regionális Adatokon Célzottan Kialakított Alapmodell) is a continually pretrained large language model designed to bridge the resource gap between Hungarian and high-resource languages. It employs parameter-efficient continual pretraining via Low-Rank Adaptation (LoRA) on a Qwen3-4B (reasoning/instruct) backbone.

The model was trained on a mixture of 160B tokens (44% Hungarian, 24% German, 21% English, 11% Code) on the Komondor HPC. To better match the training distribution, Racka uses an adapted tokenizer that achieves substantially improved tokenization fertility for Hungarian while maintaining competitive performance in English and German.

For additional details, please see our paper in the MSZNY2026 proceedings: https://rgai.inf.u-szeged.hu/sites/rgai.inf.u-szeged.hu/files/mszny2026.pdf

To learn more about the details of our project, please check our FAQ.

Model Details

  • Developed by: ELTE Faculty of Humanities (Dept. of Digital Humanities) & ELTE Faculty of Informatics (Dept. of Artificial Intelligence)
  • Backbone Model: Qwen/Qwen3-4B (Reasoning/Instruct version)
  • Language(s): Hungarian (primary), English, German, Code
  • License: cc-by-nc-sa-4.0
  • Architecture: Transformer with LoRA adapters (Rank=64, Alpha=128)
  • Training Context Length: 4,096 tokens (with sequence packing)
  • Context Length (Inference): 32,768 natively and 131,072 tokens with YaRN
  • Intended use case: Low-complexity edge tasks, optimized few-shot pipelines, task-specific Hungarian fine-tuning.

Usage

For edge/mobile inference, check out our quantized models in the GGUF format: Racka-4B-GGUF

Hugging Face Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

model_name = "elte-nlp/Racka-4B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful Hungarian assistant."},
    {"role": "user", "content": "Magyarázd el a gépi tanulás lényegét óvodásoknak egy mondatban!"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)


generation_config = GenerationConfig(
    do_sample=True,
    temperature=0.6,
    top_p=0.8,
    top_k=50,
    repetition_penalty=1.1,
    presence_penalty=1.1,
)

# conduct text completion
generated_ids = model.generate(
    input_ids = model_inputs["input_ids"],
    attention_mask = model_inputs["attention_mask"],
    max_new_tokens=32768,
    generation_config=generation_config
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

vLLM

vllm serve elte-nlp/Racka-4B --tokenizer elte-nlp/Racka-4B --dtype float16 --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072 --reasoning-parser qwen3

Technical Details

Training Data

The model was trained on a 160B token corpus designed to mitigate catastrophic forgetting via data replay:

Language BPE Tokens Ratio Sources
Hungarian ~70B 44% Common Crawl (heavily filtered), News, Wikipedia, Court Rulings, Subtitles, Academic Repositories.
English ~38B 24% The Pile, FineWeb.
German ~34B 21% Occiglot-FineWeb.
Code ~18B 11% The Stack v2.

Tokenizer Adaptation

The vocabulary was extended by 32,000 new Hungarian tokens initialized via VIPI (Vocabulary Initialization with Partial Inheritance). This reduced Hungarian subword fertility by ~47%. This fertility reduction results in proportional processing time reduction.

Language Qwen-3 4B Fertility Racka-4B Fertility Change
Hungarian 3.13 1.66 -46.96%
English 1.57 1.94 +23.44%
German 2.05 2.31 +12.62%

Training Configuration

  • Infrastructure: Komondor HPC (64 × \times NVIDIA A100 40GB).
  • Training time: 287 hours (total GPU time: 2.1 years)
  • Strategy: Distributed Data Parallel (DDP).
  • Parameters:
    • LoRA Rank: 64, Alpha: 128, Dropout: 0.1.
    • Learning Rate: 1×104 1\times10^{-4} (LoRA), 5×105 5\times10^{-5} (Non-LoRA).
    • Batch Size: 2 per GPU (Effective batch size: 512).
    • Steps: 326,357.

Evaluation

The following tables present the performance of Racka-4B compared to its base models (Qwen3-4B and Qwen3-4B-Base) and the SOTA 8B Hungarian model PULI-LlumiX-Llama-3.1 8B.

1. HULU Benchmark (Fine-tuned)

Performance on the Hungarian Language Understanding (HULU) benchmark suite. Results represent the average of multiple runs, taking the best result between LoRA and full fine-tuning.

Dataset Metric Qwen3-4B Racka-4B Qwen3-4B-Base PULI-LlumiX-Llama-3.1 8B
HuCOLA ACC 0.8109 0.8624 0.8254 0.8989
MCC 0.3482 0.5657 0.4044 0.6920
F1 0.7840 0.8563 0.8027 0.8969
HuCOPA ACC 0.5589 0.7990 0.5845 0.9359
MCC 0.1181 0.5998 0.1705 0.8720
F1 0.5584 0.7988 0.5837 0.9359
HuSST ACC 0.7517 0.7603 0.7539 0.7804
MCC 0.5022 0.5137 0.5082 0.5598
F1 0.7433 0.7511 0.7513 0.7698
HuRTE ACC 0.9078 0.8790 0.8872 0.8979
MCC 0.8142 0.7553 0.7719 0.7936
F1 0.9078 0.8790 0.8872 0.8977
HuWNLI ACC 0.5033 0.5666 0.5366 0.3800
MCC -0.0980 0.1031 -0.0600 -0.2815
F1 0.3862 0.4548 0.4069 0.3668
HuCB ACC 0.7378 0.6388 0.6291 0.4854
MCC 0.6078 0.4741 0.4733 0.2742
F1 0.7316 0.6373 0.6112 0.4594
Overall Avg ACC 0.711 0.751 0.702 0.729
Avg MCC 0.382 0.502 0.378 0.485
Avg F1 0.685 0.7295 0.673 0.721

2. OpenHuEval

Evaluation on Hungarian reading comprehension, generation, and reasoning tasks. Qwen and Racka models use a patched implementation of OpenHuEval for compatibility.

Metric Qwen3-4B Racka-4B Qwen3-4B-Base PULI-LlumiX 8B
HuWildBench (WBScore) 63.03 57.17 52.59 17.77
HuSimpleQA (Acc) 7.30 10.05 5.90 20.03
HuProverbRea (Acc OE) 62.47 61.94 41.15 75.86
HuProverbRea (Acc 2CQ) 74.98 77.53 0.00 77.36
HuMatchingFIB (B Acc) 39.59 38.93 42.30 33.54
HuMatchingFIB (Q Acc) 5.94 4.68 5.58 3.96
HuStandardFIB (B Acc) 13.20 18.98 0.00 29.16
HuStandardFIB (Q Acc) 1.08 2.15 0.00 2.15
Overall 33.44 33.93 18.44 32.47

3. LM-Eval-Harness (Hungarian)

Few-shot evaluation on standard benchmarks translated to Hungarian. Best results are kept (with chat template for Racka-4B and without for others).

Dataset (Metric) Qwen3-4B Racka-4B Qwen3-4B-Base PULI-LlumiX 8B
Arc_hu (Acc) 0.3202 0.3450 0.3792 0.3861
Arc_hu (Acc_norm) 0.3844 0.4101 0.4169 0.4323
Hellaswag_hu (Acc) 0.3369 0.3656 0.3610 0.4241
Hellaswag_hu (Acc_norm) 0.4095 0.4510 0.4557 0.5606
MMLU_hu (Acc) 0.5427 0.5378 0.5965 0.5310
TruthfulQA_hu_mc1 (Acc) 0.3177 0.3644 0.3281 0.3035
TruthfulQA_hu_mc2 (Acc) 0.5102 0.5493 0.5045 0.4883
GSM8K_hu (Strict-match) 0.6330 0.5299 0.6398 0.4761
GSM8K_hu (Flexible extract) 0.6285 0.5329 0.6421 0.4791
Overall 0.453 0.454 0.4805 0.4546

Limitations

  • The model is capable of both instruction following chat and English reasoning using the original Qwen settings, this is a preserved capability with no direct training targetting this functionality.
  • The model has not been aligned and is unsafe for use with end-users.
  • This model is only to be used for research purposes, commercial or for-profit usage is not permitted.

Team

In alphabetical order:

  • Zsolt Csibi (ELTE-IK, AI Dept.)
  • Bence Gortka (ELTE-BTK, DH-Lab)
  • Natabara Gyöngyössy (ELTE-IK, AI Dept.)
  • Kornél Nagy (ELTE-BTK, DH-Lab)
  • Dávid Nemeskey (ELTE-BTK, DH-Lab)
  • Gábor Palkó (ELTE-BTK, DH-Lab)
  • Martin Sallai (ELTE-BTK, DH-Lab)
  • András Simonyi (ELTE-IK, AI Dept.)
  • András Szekeres (ELTE-BTK, DH-Lab)

Acknowledgements

We acknowledge the Digital Government Development and Project Management Ltd. for awarding us access to the Komondor HPC facility based in Hungary.

This research was supported by the EKÖP-24 University Excellence Scholarship Program of the Ministry for Culture and Innovation, funded by the National Research, Development and Innovation Fund.

The authors acknowledge the support of the National Laboratory for Digital Heritage. Project no. 2022-2.1.1-NL-2022-00009 has been implemented with the support provided by the Ministry of Culture and Innovation of Hungary from the National Research, Development and Innovation Fund, financed under the 2022-2.1.1-NL funding scheme.

We would like to thank Levente Szabados for the name idea and initial informal discussions.

Citation

@inproceedings{csibi2026racka,
  title     = {Racka: Efficient Hungarian LLM Adaptation on Academic Infrastructure},
  author    = {Csibi, Zsolt and Gortka, Bence Gy{\"o}rgy and Gy{\"o}ngy{\"o}ssy, Natabara and Nagy, Korn{\'e}l and Nemeskey, D{\'a}vid M{\'a}rk and Palk{\'o}, G{\'a}bor and Sallai, Martin and Simonyi, Andr{\'a}s and Szekeres, Andr{\'a}s M{\'a}rk},
  booktitle = {XXII. Magyar Sz{\'a}m{\'\i}t{\'o}g{\'e}pes Nyelv{\'e}szeti Konferencia (MSZNY 2026)},
  year      = {2026},
  address   = {Szeged, Hungary},
  pages     = {17--38},
  url       = {https://rgai.inf.u-szeged.hu/sites/rgai.inf.u-szeged.hu/files/mszny2026.pdf}
}

FAQ

Q: Is this model intended for use as a general-purpose chat agent?

A: No. While Racka-4B currently stands as one of the highest-scoring small/edge models on Hungarian benchmarks, it remains a compact, edge-oriented model. Conducting a full language adaptation for larger architectures would demand computing infrastructure beyond the Komondor HPC (AI training is just really expensive on the large scale).

Our team is actively working on adapting larger models; in the meantime, we recommend utilizing models of at least 27B–30B parameters for complex or resource-heavy reasoning tasks.


Q: Why is this a gated-access model with a restricted license?

A: Due to Hungarian and EU regulations, as well as the licensing terms of our source data, certain components of our training corpus cannot be used to generate derivative works for commercial or for-profit activities.

The volume of globally available, high-quality Hungarian text data remains limited. Restricting our training data exclusively to open, unrestricted corpora would not have yielded a sufficient dataset for effective language adaptation. Consequently, we require users to fill in an access agreement confirming the model will be used strictly for non-profit purposes. We believe the academic and non-profit communities will still derive value from this work.


Q: What kind of financial funding did this project receive?

A: As noted in our acknowledgements, our team consists of AI researchers and enthusiasts who volunteered their time. We did not receive dedicated financial funding or grants to build this model. Instead, the project was made possible through institutional and operational support:

  • Worktime Allocations & Academic Grants: Several ELTE researchers and PhD students were permitted to dedicate official university hours and existing fellowship time to this initiative.
  • Compute & HPC Access: Our academic project proposal was accepted by the Komondor HPC team (this is a public opportunity available to all Hungarian academics), granting us vital GPU time. The facility's engineering team also provided essential, specialized HPC support.
  • Voluntary Contribution: Driven by a shared passion for open-source NLP, project members frequently dedicated their personal time to bring this model to completion.

Q: How would you improve on your model?

A: The current model has undergone continual pre-training during language adaptation, but it has not been through any SFT or alignment step to work better than. Our aim is to translate, create and synthetize higher complexity datasets as part of our next projects. This will allow us to train models that are more capable in everyday use cases.

Downloads last month
724
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for elte-nlp/Racka-4B

Finetuned
Qwen/Qwen3-4B
Finetuned
(744)
this model
Quantizations
1 model

Evaluation results