PULI Trio Q 7B base (7.62B billion parameter)

Dataset for continued pretraining

  • Hungarian (8.7 billion words): documents (763K) that exceed 5000 words in length + Hungarian Wikipedia and news

  • English: Long Context QA (1 billion words), BookSum (42 million words)

  • Chinese (3 billion Chinese characters): Wudao

  • The training was completed using a Hungarian-only dataset:

    • 626 million Hungarian words (1 epoch): Hungarian Wikipedia + News articles

Limitations

  • max_seq_length = 32 768

Usage with pipeline

from transformers import pipeline, Qwen2ForCausalLM, AutoTokenizer

model = Qwen2ForCausalLM.from_pretrained("NYTK/PULI-Trio-Q")
tokenizer = AutoTokenizer.from_pretrained("NYTK/PULI-Trio-Q")
prompt = "Elmesélek egy történetet a nyelvtechnológiáról."
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer, device=0)

print(generator(prompt, max_new_tokens=30)[0]["generated_text"])

Citation

If you use this model, please cite the following paper:

@article{chatpuli,
  title={ChatPULI: Enhancement to the first Hungarian conversational model},
  author={Yang, Zijian Győző and Bánfi, Ágnes and Dodé, Réka and Ferenczi, Gergő and Földesi, Flóra and Hatvani, Péter and Héja, Enikő and Lengyel, Mariann and Madarász, Gábor and Osváth, Mátyás and Sárossy, Bence and  Varga, Kristóf and  Váradi, Tamás and  Prószéky, Gábor and  Ligeti-Nagy, Noémi},
  journal={Annales Mathematicae et Informaticae},
  doi = {https://doi.org/10.33039/ami.2025.10.010},
  url = {https://ami.uni-eszterhazy.hu},
  year={2025},
  volume={61},
  pages={261–-274}
}
Downloads last month
806
Safetensors
Model size
8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NYTK/PULI-Trio-Q

Finetunes
1 model
Quantizations
3 models