PULI Trio Q 7B base (7.62B billion parameter)
- Trained with LLaMA-Factory github
- The Qwen2.5 7B Instruct model were continual pretrained on Hungarian dataset
Dataset for continued pretraining
Hungarian (8.7 billion words): documents (763K) that exceed 5000 words in length + Hungarian Wikipedia and news
English: Long Context QA (1 billion words), BookSum (42 million words)
Chinese (3 billion Chinese characters): Wudao
The training was completed using a Hungarian-only dataset:
- 626 million Hungarian words (1 epoch): Hungarian Wikipedia + News articles
Limitations
- max_seq_length = 32 768
Usage with pipeline
from transformers import pipeline, Qwen2ForCausalLM, AutoTokenizer
model = Qwen2ForCausalLM.from_pretrained("NYTK/PULI-Trio-Q")
tokenizer = AutoTokenizer.from_pretrained("NYTK/PULI-Trio-Q")
prompt = "Elmesélek egy történetet a nyelvtechnológiáról."
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer, device=0)
print(generator(prompt, max_new_tokens=30)[0]["generated_text"])
Citation
If you use this model, please cite the following paper:
@article{chatpuli,
title={ChatPULI: Enhancement to the first Hungarian conversational model},
author={Yang, Zijian Győző and Bánfi, Ágnes and Dodé, Réka and Ferenczi, Gergő and Földesi, Flóra and Hatvani, Péter and Héja, Enikő and Lengyel, Mariann and Madarász, Gábor and Osváth, Mátyás and Sárossy, Bence and Varga, Kristóf and Váradi, Tamás and Prószéky, Gábor and Ligeti-Nagy, Noémi},
journal={Annales Mathematicae et Informaticae},
doi = {https://doi.org/10.33039/ami.2025.10.010},
url = {https://ami.uni-eszterhazy.hu},
year={2025},
volume={61},
pages={261–-274}
}
- Downloads last month
- 806
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support