Dango🍡
Dango is a large language model (LLM) trained on an extensively filtered Japanese corpus.
It is intended primarily for research on second language acquisition, but it can also be used as a Japanese speaker simulator.
We release the checkpoint trained on 100B tokens in this repository.
Model Details
Dango is architecturally comparable to the llm-jp-3 family, which adopts a Llama 2-style decoder architecture.
- llm-jp-3-1.8b: https://huggingface.co/llm-jp/llm-jp-3-1.8b
Please visit our github page for filtering and training codes: https://github.com/mattashiho233/dango
License
- Codes and Scripts: CC BY 4.0
- Pre-trained models: CC BY 4.0
- Instruction-tuned models (after L2 acquisition training): Trained with Ichikara instruction data, which prohibits commercial use of derivative models
Citation
If you use Dango in your research, please cite:
@inproceedings{matta2026anlp,
author = {Shiho Matta and Yin Jou Huang and Fei Cheng and Takashi Kodama and Hirokazu Kiyomaru and Yugo Murawaki},
title = {Pretraining a Japanese-Only Large Language Model for Studying Second Language Acquisition},
booktitle = {Proceedings of the Thirty-second Annual Meeting of the Association for Natural Language Processing},
year = {2026},
pages = {3225--3230},
publisher = {Association for Natural Language Processing},
address = {Utsunomiya, Japan}
}
- Downloads last month
- 1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support