Racka-4B / README.md

Update README.md

834ab2c verified 3 days ago

15.4 kB

	---
	extra_gated_heading: "Access Request for Research-Only Model"
	extra_gated_description: "Please provide your professional details and acknowledge the terms of use to request access."
	extra_gated_button_content: "Submit Request"
	extra_gated_prompt: "By requesting access, you acknowledge that this model is provided solely for research purposes, is offered 'as-is' without any guarantees, and cannot be utilized for for-profit tasks or commercial applications."

	extra_gated_fields:
	Full Name: text
	Company / Institution: text
	Role: text
	Intended Use Case: text
	I would like to receive news about the models, publications and events of the research group in Hungarian:
	type: select
	options:
	- Yes
	- No
	I acknowledge that this model is for research-only, comes with no guarantee, and cannot be used for for-profit tasks: checkbox
	language:
	- hu
	- de
	- en
	base_model:
	- Qwen/Qwen3-4B
	pipeline_tag: text-generation
	license: cc-by-nc-sa-4.0
	model-index:
	- name: Racka-4B
	results:
	- task:
	type: text-generation
	dataset:
	type: HULU
	name: HuCOLA
	metrics:
	- name: ACC
	type: accuracy
	value: 0.8624
	verified: false
	- name: MCC
	type: mcc
	value: 0.5657
	verified: false
	- name: F1
	type: f1
	value: 0.8563
	verified: false
	- task:
	type: text-generation
	dataset:
	type: HULU
	name: HuCOPA
	metrics:
	- name: ACC
	type: accuracy
	value: 0.7990
	verified: false
	- name: MCC
	type: mcc
	value: 0.5998
	verified: false
	- name: F1
	type: f1
	value: 0.7988
	verified: false
	- task:
	type: text-generation
	dataset:
	type: HULU
	name: HuSST
	metrics:
	- name: ACC
	type: accuracy
	value: 0.7603
	verified: false
	- name: MCC
	type: mcc
	value: 0.5137
	verified: false
	- name: F1
	type: f1
	value: 0.7511
	verified: false
	- task:
	type: text-generation
	dataset:
	type: HULU
	name: HuRTE
	metrics:
	- name: ACC
	type: accuracy
	value: 0.8790
	verified: false
	- name: MCC
	type: mcc
	value: 0.7553
	verified: false
	- name: F1
	type: f1
	value: 0.8790
	verified: false
	- task:
	type: text-generation
	dataset:
	type: HULU
	name: HuWNLI
	metrics:
	- name: ACC
	type: accuracy
	value: 0.5666
	verified: false
	- name: MCC
	type: mcc
	value: 0.1031
	verified: false
	- name: F1
	type: f1
	value: 0.4548
	verified: false
	- task:
	type: text-generation
	dataset:
	type: HULU
	name: HuCB
	metrics:
	- name: ACC
	type: accuracy
	value: 0.6388
	verified: false
	- name: MCC
	type: mcc
	value: 0.4741
	verified: false
	- name: F1
	type: f1
	value: 0.6373
	verified: false
	- task:
	type: text-generation
	dataset:
	type: OpenHuEval
	name: HuWildBench
	metrics:
	- name: WBScore
	type: score
	value: 57.17
	verified: false
	- task:
	type: text-generation
	dataset:
	type: OpenHuEval
	name: HuSimpleQA
	metrics:
	- name: Acc
	type: accuracy
	value: 10.05
	verified: false
	- task:
	type: text-generation
	dataset:
	type: OpenHuEval
	name: HuProverbRea (OE)
	metrics:
	- name: Acc
	type: accuracy
	value: 61.94
	verified: false
	- task:
	type: text-generation
	dataset:
	type: OpenHuEval
	name: HuProverbRea (2CQ)
	metrics:
	- name: Acc
	type: accuracy
	value: 77.53
	verified: false
	- task:
	type: text-generation
	dataset:
	type: LM-Eval-Harness-HU
	name: Arc_hu
	metrics:
	- name: Acc_norm
	type: accuracy
	value: 0.4101
	verified: false
	- task:
	type: text-generation
	dataset:
	type: LM-Eval-Harness-HU
	name: Hellaswag_hu
	metrics:
	- name: Acc_norm
	type: accuracy
	value: 0.4510
	verified: false
	- task:
	type: text-generation
	dataset:
	type: LM-Eval-Harness-HU
	name: MMLU_hu
	metrics:
	- name: Acc
	type: accuracy
	value: 0.5378
	verified: false
	- task:
	type: text-generation
	dataset:
	type: LM-Eval-Harness-HU
	name: TruthfulQA_hu_mc2
	metrics:
	- name: Acc
	type: accuracy
	value: 0.5493
	verified: false
	- task:
	type: text-generation
	dataset:
	type: LM-Eval-Harness-HU
	name: GSM8K_hu
	metrics:
	- name: Flexible-extract
	type: accuracy
	value: 0.5329
	verified: false
	- task:
	type: text-generation
	dataset:
	type: LM-Eval-Harness-HU
	name: GSM8K_hu
	metrics:
	- name: Strict-match
	type: accuracy
	value: 0.5299
	verified: false
	---

	# Racka-4B Model Card

	<div style="display:flex; align-items:center; gap:12px;"> <img src="https://cdn-uploads.huggingface.co/production/uploads/640edc40208821a59b710e84/KiAGUcITdOXG5gVhn55yS.png" alt="Racka icon" width="100" height="100" style="flex:0 0 auto;"> <h1 style="margin:0;">Racka</h1> </div>

	Racka (Regionális Adatokon Célzottan Kialakított Alapmodell) is a continually pretrained large language model designed to bridge the resource gap between Hungarian and high-resource languages. It employs parameter-efficient continual pretraining via Low-Rank Adaptation (LoRA) on a Qwen3-4B (reasoning/instruct) backbone.

	The model was trained on a mixture of 160B tokens (44% Hungarian, 24% German, 21% English, 11% Code) on the Komondor HPC. To better match the training distribution, Racka uses an adapted tokenizer that achieves substantially improved tokenization fertility for Hungarian while maintaining competitive performance in English and German.

	## Model Details

	* Developed by: ELTE Faculty of Humanities (Dept. of Digital Humanities) & ELTE Faculty of Informatics (Dept. of Artificial Intelligence)
	* Backbone Model: [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) (Reasoning/Instruct version)
	* Language(s): Hungarian (primary), English, German, Code
	* License: cc-by-nc-sa-4.0
	* Architecture: Transformer with LoRA adapters (Rank=64, Alpha=128)
	* Training Context Length: 4,096 tokens (with sequence packing)
	* Context Length (Inference): 32,768 natively and 131,072 tokens with YaRN

	## Usage

	### Hugging Face Transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

	model_name = "elte-nlp/Racka-4B"

	# load the tokenizer and the model
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)

	messages = [
	{"role": "system", "content": "You are a helpful Hungarian assistant."},
	{"role": "user", "content": "Magyarázd el a gépi tanulás lényegét óvodásoknak egy mondatban!"}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)


	generation_config = GenerationConfig(
	do_sample=True,
	temperature=0.6,
	top_p=0.8,
	top_k=50,
	repetition_penalty=1.1,
	presence_penalty=1.1,
	)

	# conduct text completion
	generated_ids = model.generate(
	input_ids = model_inputs["input_ids"],
	attention_mask = model_inputs["attention_mask"],
	max_new_tokens=32768,
	generation_config=generation_config
	)
	output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

	# parsing thinking content
	try:
	# rindex finding 151668 (</think>)
	index = len(output_ids) - output_ids[::-1].index(151668)
	except ValueError:
	index = 0

	thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
	content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

	print("thinking content:", thinking_content)
	print("content:", content)
	```

	### vLLM

	```bash
	vllm serve elte-nlp/Racka-4B --tokenizer elte-nlp/Racka-4B --dtype float16 --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072 --reasoning-parser qwen3
	```

	## Technical Details

	### Training Data

	The model was trained on a 160B token corpus designed to mitigate catastrophic forgetting via data replay:

	\| Language \| BPE Tokens \| Ratio \| Sources \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| Hungarian \| ~70B \| 44% \| Common Crawl (heavily filtered), News, Wikipedia, Court Rulings, Subtitles, Academic Repositories. \|
	\| English \| ~38B \| 24% \| The Pile, FineWeb. \|
	\| German \| ~34B \| 21% \| Occiglot-FineWeb. \|
	\| Code \| ~18B \| 11% \| The Stack v2. \|

	### Tokenizer Adaptation

	The vocabulary was extended by 32,000 new Hungarian tokens initialized via VIPI (Vocabulary Initialization with Partial Inheritance). This reduced Hungarian subword fertility by ~47%. This fertility reduction results in proportional processing time reduction.

	\| Language \| Qwen-3 4B Fertility \| Racka-4B Fertility \| Change \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| Hungarian \| 3.13 \| 1.66 \| -46.96% \|
	\| English \| 1.57 \| 1.94 \| +23.44% \|
	\| German \| 2.05 \| 2.31 \| +12.62% \|

	### Training Configuration

	* Infrastructure: Komondor HPC (64 \\( \times \\) NVIDIA A100 40GB).
	* Training time: 287 hours (total GPU time: 2.1 years)
	* Strategy: Distributed Data Parallel (DDP).
	* Parameters:
	* LoRA Rank: 64, Alpha: 128, Dropout: 0.1.
	* Learning Rate: \\( 1\times10^{-4} \\) (LoRA), \\( 5\times10^{-5} \\) (Non-LoRA).
	* Batch Size: 2 per GPU (Effective batch size: 512).
	* Steps: 326,357.

	## Evaluation

	The following tables present the performance of Racka-4B compared to its base models (Qwen3-4B and Qwen3-4B-Base) and the SOTA 8B Hungarian model PULI-LlumiX-Llama-3.1 8B.

	### 1. HULU Benchmark (Fine-tuned)

	Performance on the Hungarian Language Understanding (HULU) benchmark suite. Results represent the average of multiple runs, taking the best result between LoRA and full fine-tuning.

	\| Dataset \| Metric \| Qwen3-4B \| Racka-4B \| Qwen3-4B-Base \| PULI-LlumiX-Llama-3.1 8B \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| HuCOLA \| ACC \| 0.8109 \| 0.8624 \| 0.8254 \| 0.8989 \|
	\| \| MCC \| 0.3482 \| 0.5657 \| 0.4044 \| 0.6920 \|
	\| \| F1 \| 0.7840 \| 0.8563 \| 0.8027 \| 0.8969 \|
	\| HuCOPA \| ACC \| 0.5589 \| 0.7990 \| 0.5845 \| 0.9359 \|
	\| \| MCC \| 0.1181 \| 0.5998 \| 0.1705 \| 0.8720 \|
	\| \| F1 \| 0.5584 \| 0.7988 \| 0.5837 \| 0.9359 \|
	\| HuSST \| ACC \| 0.7517 \| 0.7603 \| 0.7539 \| 0.7804 \|
	\| \| MCC \| 0.5022 \| 0.5137 \| 0.5082 \| 0.5598 \|
	\| \| F1 \| 0.7433 \| 0.7511 \| 0.7513 \| 0.7698 \|
	\| HuRTE \| ACC \| 0.9078 \| 0.8790 \| 0.8872 \| 0.8979 \|
	\| \| MCC \| 0.8142 \| 0.7553 \| 0.7719 \| 0.7936 \|
	\| \| F1 \| 0.9078 \| 0.8790 \| 0.8872 \| 0.8977 \|
	\| HuWNLI \| ACC \| 0.5033 \| 0.5666 \| 0.5366 \| 0.3800 \|
	\| \| MCC \| -0.0980 \| 0.1031 \| -0.0600 \| -0.2815 \|
	\| \| F1 \| 0.3862 \| 0.4548 \| 0.4069 \| 0.3668 \|
	\| HuCB \| ACC \| 0.7378 \| 0.6388 \| 0.6291 \| 0.4854 \|
	\| \| MCC \| 0.6078 \| 0.4741 \| 0.4733 \| 0.2742 \|
	\| \| F1 \| 0.7316 \| 0.6373 \| 0.6112 \| 0.4594 \|
	\| Overall \| Avg ACC \| 0.711 \| 0.751 \| 0.702 \| 0.729 \|
	\| \| Avg MCC \| 0.382 \| 0.502 \| 0.378 \| 0.485 \|
	\| \| Avg F1 \| 0.685 \| 0.7295 \| 0.673 \| 0.721 \|

	---

	### 2. OpenHuEval

	Evaluation on Hungarian reading comprehension, generation, and reasoning tasks. Qwen and Racka models use a patched implementation of OpenHuEval for compatibility.

	\| Metric \| Qwen3-4B \| Racka-4B \| Qwen3-4B-Base \| PULI-LlumiX 8B \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| HuWildBench (WBScore) \| 63.03 \| 57.17 \| 52.59 \| 17.77 \|
	\| HuSimpleQA (Acc) \| 7.30 \| 10.05 \| 5.90 \| 20.03 \|
	\| HuProverbRea (Acc OE) \| 62.47 \| 61.94 \| 41.15 \| 75.86 \|
	\| HuProverbRea (Acc 2CQ) \| 74.98 \| 77.53 \| 0.00 \| 77.36 \|
	\| HuMatchingFIB (B Acc) \| 39.59 \| 38.93 \| 42.30 \| 33.54 \|
	\| HuMatchingFIB (Q Acc) \| 5.94 \| 4.68 \| 5.58 \| 3.96 \|
	\| HuStandardFIB (B Acc) \| 13.20 \| 18.98 \| 0.00 \| 29.16 \|
	\| HuStandardFIB (Q Acc) \| 1.08 \| 2.15 \| 0.00 \| 2.15 \|
	\| Overall \| 33.44 \| 33.93 \| 18.44 \| 32.47 \|

	---

	### 3. LM-Eval-Harness (Hungarian)

	Few-shot evaluation on standard benchmarks translated to Hungarian. Best results are kept (with chat template for Racka-4B and without for others).

	\| Dataset (Metric) \| Qwen3-4B \| Racka-4B \| Qwen3-4B-Base \| PULI-LlumiX 8B \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| Arc_hu (Acc) \| 0.3202 \| 0.3450 \| 0.3792 \| 0.3861 \|
	\| Arc_hu (Acc_norm) \| 0.3844 \| 0.4101 \| 0.4169 \| 0.4323 \|
	\| Hellaswag_hu (Acc) \| 0.3369 \| 0.3656 \| 0.3610 \| 0.4241 \|
	\| Hellaswag_hu (Acc_norm) \| 0.4095 \| 0.4510 \| 0.4557 \| 0.5606 \|
	\| MMLU_hu (Acc) \| 0.5427 \| 0.5378 \| 0.5965 \| 0.5310 \|
	\| TruthfulQA_hu_mc1 (Acc) \| 0.3177 \| 0.3644 \| 0.3281 \| 0.3035 \|
	\| TruthfulQA_hu_mc2 (Acc) \| 0.5102 \| 0.5493 \| 0.5045 \| 0.4883 \|
	\| GSM8K_hu (Strict-match) \| 0.6330 \| 0.5299 \| 0.6398 \| 0.4761 \|
	\| GSM8K_hu (Flexible extract) \| 0.6285 \| 0.5329 \| 0.6421 \| 0.4791 \|
	\| Overall \| 0.453 \| 0.454 \| 0.4805 \| 0.4546 \|

	---

	## Limitations

	- The model is capable of both instruction following chat and English reasoning using the original Qwen settings, this is a preserved capability with no direct training targetting this functionality.
	- The model has not been aligned and is unsafe for use with end-users.
	- This model is only to be used for research purposes, commercial or for-profit usage is not permitted.

	## Team

	In alphabetical order:

	- Zsolt Csibi (ELTE-IK, AI Dept.)
	- Bence Gortka (ELTE-BTK, DH-Lab)
	- Natabara Gyöngyössy (ELTE-IK, AI Dept.)
	- Kornél Nagy (ELTE-BTK, DH-Lab)
	- Dávid Nemeskey (ELTE-BTK, DH-Lab)
	- Gábor Palkó (ELTE-BTK, DH-Lab)
	- Martin Sallai (ELTE-BTK, DH-Lab)
	- András Simonyi (ELTE-IK, AI Dept.)
	- András Szekeres (ELTE-BTK, DH-Lab)



	## Acknowledgements

	We acknowledge the Digital Government Development and Project Management Ltd. for awarding us access to the Komondor HPC facility based in Hungary.

	This research was supported by the EKÖP-24 University Excellence Scholarship Program of the Ministry for Culture and Innovation, funded by the National Research, Development and Innovation Fund.

	The authors acknowledge the support of the National Laboratory for Digital Heritage. Project no. 2022-2.1.1-NL-2022-00009 has been implemented with the support provided by the Ministry of Culture and Innovation of Hungary from the National Research, Development and Innovation Fund, financed under the 2022-2.1.1-NL funding scheme.

	We would like to thank Levente Szabados for the name idea and initial informal discussions.


	## Citation

	```bibtex
	@article{racka2026,
	title={Racka: Efficient Hungarian LLM Adaptation on Academic Infrastructure},
	author={Csibi, Zsolt and Gortka, Bence Gy\"orgy and Nagy, Korn\'el and Nemeskey, D\'avid M\'ark and Sallai, Martin and Simonyi, Andr\'as and Szekeres, Andr\'as M\'ark and Palk\'o, G\'abor},
	journal={Proceedings of the XXII. Hungarian Computational Linguistics Conference},
	year={2026}
	}
	```