Racka-4B / README.md

nata108

Update README.md

834ab2c verified 3 days ago

preview code

raw

history blame contribute delete

15.4 kB

metadata

extra_gated_heading: Access Request for Research-Only Model
extra_gated_description: >-
  Please provide your professional details and acknowledge the terms of use to
  request access.
extra_gated_button_content: Submit Request
extra_gated_prompt: >-
  By requesting access, you acknowledge that this model is provided solely for
  research purposes, is offered 'as-is' without any guarantees, and cannot be
  utilized for for-profit tasks or commercial applications.
extra_gated_fields:
  Full Name: text
  Company / Institution: text
  Role: text
  Intended Use Case: text
  I would like to receive news about the models, publications and events of the research group in Hungarian:
    type: select
    options:
      - 'Yes'
      - 'No'
  I acknowledge that this model is for research-only, comes with no guarantee, and cannot be used for for-profit tasks: checkbox
language:
  - hu
  - de
  - en
base_model:
  - Qwen/Qwen3-4B
pipeline_tag: text-generation
license: cc-by-nc-sa-4.0
model-index:
  - name: Racka-4B
    results:
      - task:
          type: text-generation
        dataset:
          type: HULU
          name: HuCOLA
        metrics:
          - name: ACC
            type: accuracy
            value: 0.8624
            verified: false
          - name: MCC
            type: mcc
            value: 0.5657
            verified: false
          - name: F1
            type: f1
            value: 0.8563
            verified: false
      - task:
          type: text-generation
        dataset:
          type: HULU
          name: HuCOPA
        metrics:
          - name: ACC
            type: accuracy
            value: 0.799
            verified: false
          - name: MCC
            type: mcc
            value: 0.5998
            verified: false
          - name: F1
            type: f1
            value: 0.7988
            verified: false
      - task:
          type: text-generation
        dataset:
          type: HULU
          name: HuSST
        metrics:
          - name: ACC
            type: accuracy
            value: 0.7603
            verified: false
          - name: MCC
            type: mcc
            value: 0.5137
            verified: false
          - name: F1
            type: f1
            value: 0.7511
            verified: false
      - task:
          type: text-generation
        dataset:
          type: HULU
          name: HuRTE
        metrics:
          - name: ACC
            type: accuracy
            value: 0.879
            verified: false
          - name: MCC
            type: mcc
            value: 0.7553
            verified: false
          - name: F1
            type: f1
            value: 0.879
            verified: false
      - task:
          type: text-generation
        dataset:
          type: HULU
          name: HuWNLI
        metrics:
          - name: ACC
            type: accuracy
            value: 0.5666
            verified: false
          - name: MCC
            type: mcc
            value: 0.1031
            verified: false
          - name: F1
            type: f1
            value: 0.4548
            verified: false
      - task:
          type: text-generation
        dataset:
          type: HULU
          name: HuCB
        metrics:
          - name: ACC
            type: accuracy
            value: 0.6388
            verified: false
          - name: MCC
            type: mcc
            value: 0.4741
            verified: false
          - name: F1
            type: f1
            value: 0.6373
            verified: false
      - task:
          type: text-generation
        dataset:
          type: OpenHuEval
          name: HuWildBench
        metrics:
          - name: WBScore
            type: score
            value: 57.17
            verified: false
      - task:
          type: text-generation
        dataset:
          type: OpenHuEval
          name: HuSimpleQA
        metrics:
          - name: Acc
            type: accuracy
            value: 10.05
            verified: false
      - task:
          type: text-generation
        dataset:
          type: OpenHuEval
          name: HuProverbRea (OE)
        metrics:
          - name: Acc
            type: accuracy
            value: 61.94
            verified: false
      - task:
          type: text-generation
        dataset:
          type: OpenHuEval
          name: HuProverbRea (2CQ)
        metrics:
          - name: Acc
            type: accuracy
            value: 77.53
            verified: false
      - task:
          type: text-generation
        dataset:
          type: LM-Eval-Harness-HU
          name: Arc_hu
        metrics:
          - name: Acc_norm
            type: accuracy
            value: 0.4101
            verified: false
      - task:
          type: text-generation
        dataset:
          type: LM-Eval-Harness-HU
          name: Hellaswag_hu
        metrics:
          - name: Acc_norm
            type: accuracy
            value: 0.451
            verified: false
      - task:
          type: text-generation
        dataset:
          type: LM-Eval-Harness-HU
          name: MMLU_hu
        metrics:
          - name: Acc
            type: accuracy
            value: 0.5378
            verified: false
      - task:
          type: text-generation
        dataset:
          type: LM-Eval-Harness-HU
          name: TruthfulQA_hu_mc2
        metrics:
          - name: Acc
            type: accuracy
            value: 0.5493
            verified: false
      - task:
          type: text-generation
        dataset:
          type: LM-Eval-Harness-HU
          name: GSM8K_hu
        metrics:
          - name: Flexible-extract
            type: accuracy
            value: 0.5329
            verified: false
      - task:
          type: text-generation
        dataset:
          type: LM-Eval-Harness-HU
          name: GSM8K_hu
        metrics:
          - name: Strict-match
            type: accuracy
            value: 0.5299
            verified: false

Racka-4B Model Card

Racka

Racka (Regionális Adatokon Célzottan Kialakított Alapmodell) is a continually pretrained large language model designed to bridge the resource gap between Hungarian and high-resource languages. It employs parameter-efficient continual pretraining via Low-Rank Adaptation (LoRA) on a Qwen3-4B (reasoning/instruct) backbone.

The model was trained on a mixture of 160B tokens (44% Hungarian, 24% German, 21% English, 11% Code) on the Komondor HPC. To better match the training distribution, Racka uses an adapted tokenizer that achieves substantially improved tokenization fertility for Hungarian while maintaining competitive performance in English and German.

Model Details

Developed by: ELTE Faculty of Humanities (Dept. of Digital Humanities) & ELTE Faculty of Informatics (Dept. of Artificial Intelligence)
Backbone Model: Qwen/Qwen3-4B (Reasoning/Instruct version)
Language(s): Hungarian (primary), English, German, Code
License: cc-by-nc-sa-4.0
Architecture: Transformer with LoRA adapters (Rank=64, Alpha=128)
Training Context Length: 4,096 tokens (with sequence packing)
Context Length (Inference): 32,768 natively and 131,072 tokens with YaRN

Usage

Hugging Face Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

model_name = "elte-nlp/Racka-4B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful Hungarian assistant."},
    {"role": "user", "content": "Magyarázd el a gépi tanulás lényegét óvodásoknak egy mondatban!"}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)


generation_config = GenerationConfig(
    do_sample=True,
    temperature=0.6,
    top_p=0.8,
    top_k=50,
    repetition_penalty=1.1,
    presence_penalty=1.1,
)

# conduct text completion
generated_ids = model.generate(
    input_ids = model_inputs["input_ids"],
    attention_mask = model_inputs["attention_mask"],
    max_new_tokens=32768,
    generation_config=generation_config
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

vLLM

vllm serve elte-nlp/Racka-4B --tokenizer elte-nlp/Racka-4B --dtype float16 --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072 --reasoning-parser qwen3

Technical Details

Training Data

The model was trained on a 160B token corpus designed to mitigate catastrophic forgetting via data replay:

Language	BPE Tokens	Ratio	Sources
Hungarian	~70B	44%	Common Crawl (heavily filtered), News, Wikipedia, Court Rulings, Subtitles, Academic Repositories.
English	~38B	24%	The Pile, FineWeb.
German	~34B	21%	Occiglot-FineWeb.
Code	~18B	11%	The Stack v2.

Tokenizer Adaptation

The vocabulary was extended by 32,000 new Hungarian tokens initialized via VIPI (Vocabulary Initialization with Partial Inheritance). This reduced Hungarian subword fertility by ~47%. This fertility reduction results in proportional processing time reduction.

Language	Qwen-3 4B Fertility	Racka-4B Fertility	Change
Hungarian	3.13	1.66	-46.96%
English	1.57	1.94	+23.44%
German	2.05	2.31	+12.62%

Training Configuration

Infrastructure: Komondor HPC (64 $\times$ NVIDIA A100 40GB).
Training time: 287 hours (total GPU time: 2.1 years)
Strategy: Distributed Data Parallel (DDP).
Parameters:
- LoRA Rank: 64, Alpha: 128, Dropout: 0.1.
- Learning Rate: $1\times10^{-4}$ (LoRA), $5\times10^{-5}$ (Non-LoRA).
- Batch Size: 2 per GPU (Effective batch size: 512).
- Steps: 326,357.

Evaluation

The following tables present the performance of Racka-4B compared to its base models (Qwen3-4B and Qwen3-4B-Base) and the SOTA 8B Hungarian model PULI-LlumiX-Llama-3.1 8B.

1. HULU Benchmark (Fine-tuned)

Performance on the Hungarian Language Understanding (HULU) benchmark suite. Results represent the average of multiple runs, taking the best result between LoRA and full fine-tuning.

Dataset	Metric	Qwen3-4B	Racka-4B	Qwen3-4B-Base	PULI-LlumiX-Llama-3.1 8B
HuCOLA	ACC	0.8109	0.8624	0.8254	0.8989
	MCC	0.3482	0.5657	0.4044	0.6920
	F1	0.7840	0.8563	0.8027	0.8969
HuCOPA	ACC	0.5589	0.7990	0.5845	0.9359
	MCC	0.1181	0.5998	0.1705	0.8720
	F1	0.5584	0.7988	0.5837	0.9359
HuSST	ACC	0.7517	0.7603	0.7539	0.7804
	MCC	0.5022	0.5137	0.5082	0.5598
	F1	0.7433	0.7511	0.7513	0.7698
HuRTE	ACC	0.9078	0.8790	0.8872	0.8979
	MCC	0.8142	0.7553	0.7719	0.7936
	F1	0.9078	0.8790	0.8872	0.8977
HuWNLI	ACC	0.5033	0.5666	0.5366	0.3800
	MCC	-0.0980	0.1031	-0.0600	-0.2815
	F1	0.3862	0.4548	0.4069	0.3668
HuCB	ACC	0.7378	0.6388	0.6291	0.4854
	MCC	0.6078	0.4741	0.4733	0.2742
	F1	0.7316	0.6373	0.6112	0.4594
Overall	Avg ACC	0.711	0.751	0.702	0.729
	Avg MCC	0.382	0.502	0.378	0.485
	Avg F1	0.685	0.7295	0.673	0.721

2. OpenHuEval

Evaluation on Hungarian reading comprehension, generation, and reasoning tasks. Qwen and Racka models use a patched implementation of OpenHuEval for compatibility.

Metric	Qwen3-4B	Racka-4B	Qwen3-4B-Base	PULI-LlumiX 8B
HuWildBench (WBScore)	63.03	57.17	52.59	17.77
HuSimpleQA (Acc)	7.30	10.05	5.90	20.03
HuProverbRea (Acc OE)	62.47	61.94	41.15	75.86
HuProverbRea (Acc 2CQ)	74.98	77.53	0.00	77.36
HuMatchingFIB (B Acc)	39.59	38.93	42.30	33.54
HuMatchingFIB (Q Acc)	5.94	4.68	5.58	3.96
HuStandardFIB (B Acc)	13.20	18.98	0.00	29.16
HuStandardFIB (Q Acc)	1.08	2.15	0.00	2.15
Overall	33.44	33.93	18.44	32.47

3. LM-Eval-Harness (Hungarian)

Few-shot evaluation on standard benchmarks translated to Hungarian. Best results are kept (with chat template for Racka-4B and without for others).

Dataset (Metric)	Qwen3-4B	Racka-4B	Qwen3-4B-Base	PULI-LlumiX 8B
Arc_hu (Acc)	0.3202	0.3450	0.3792	0.3861
Arc_hu (Acc_norm)	0.3844	0.4101	0.4169	0.4323
Hellaswag_hu (Acc)	0.3369	0.3656	0.3610	0.4241
Hellaswag_hu (Acc_norm)	0.4095	0.4510	0.4557	0.5606
MMLU_hu (Acc)	0.5427	0.5378	0.5965	0.5310
TruthfulQA_hu_mc1 (Acc)	0.3177	0.3644	0.3281	0.3035
TruthfulQA_hu_mc2 (Acc)	0.5102	0.5493	0.5045	0.4883
GSM8K_hu (Strict-match)	0.6330	0.5299	0.6398	0.4761
GSM8K_hu (Flexible extract)	0.6285	0.5329	0.6421	0.4791
Overall	0.453	0.454	0.4805	0.4546

Limitations

The model is capable of both instruction following chat and English reasoning using the original Qwen settings, this is a preserved capability with no direct training targetting this functionality.
The model has not been aligned and is unsafe for use with end-users.
This model is only to be used for research purposes, commercial or for-profit usage is not permitted.

Team

In alphabetical order:

Zsolt Csibi (ELTE-IK, AI Dept.)
Bence Gortka (ELTE-BTK, DH-Lab)
Natabara Gyöngyössy (ELTE-IK, AI Dept.)
Kornél Nagy (ELTE-BTK, DH-Lab)
Dávid Nemeskey (ELTE-BTK, DH-Lab)
Gábor Palkó (ELTE-BTK, DH-Lab)
Martin Sallai (ELTE-BTK, DH-Lab)
András Simonyi (ELTE-IK, AI Dept.)
András Szekeres (ELTE-BTK, DH-Lab)

Acknowledgements

We acknowledge the Digital Government Development and Project Management Ltd. for awarding us access to the Komondor HPC facility based in Hungary.

This research was supported by the EKÖP-24 University Excellence Scholarship Program of the Ministry for Culture and Innovation, funded by the National Research, Development and Innovation Fund.

The authors acknowledge the support of the National Laboratory for Digital Heritage. Project no. 2022-2.1.1-NL-2022-00009 has been implemented with the support provided by the Ministry of Culture and Innovation of Hungary from the National Research, Development and Innovation Fund, financed under the 2022-2.1.1-NL funding scheme.

We would like to thank Levente Szabados for the name idea and initial informal discussions.

Citation

@article{racka2026,
  title={Racka: Efficient Hungarian LLM Adaptation on Academic Infrastructure},
  author={Csibi, Zsolt and Gortka, Bence Gy\"orgy and Nagy, Korn\'el and Nemeskey, D\'avid M\'ark and Sallai, Martin and Simonyi, Andr\'as and Szekeres, Andr\'as M\'ark and Palk\'o, G\'abor},
  journal={Proceedings of the XXII. Hungarian Computational Linguistics Conference},
  year={2026}
}