What datasets were used to train this model?

by ptrdvn - opened Oct 16, 2025

Oct 16, 2025

First of all, thanks for the great model!

I would just like to know - what datasets were used to train this model?
You mention some open source Huggingface datasets in your model card, so I would like to know which ones were used in training.

Thanks

inuwamobarak

National Centre for Artificial Intelligence and Robotics org Oct 22, 2025

First of all, thanks for the great model!

I would just like to know - what datasets were used to train this model?
You mention some open source Huggingface datasets in your model card, so I would like to know which ones were used in training.

Thanks

Hello, @ptrdvn
We are all excited about ATLaS 🤗. I don't think there is an organized list of datasets for the training or fine-tuning set (at least, I don't think it was published here on HF with the model), but some sources were mentioned in the model card.

Just FYI, a variety of sources were mentioned, but for me the one of particular interest is the Hausa, Yoruba, and Igbo sets, which came from local sources like the BBC Pidgin and Punch News. Although this data (e.g., BBC Pidgin and Punch News) might already be in existing datasets, it underwent manual quality checks, filtering, and prompt-response alignment to ensure fluency and cultural relevance.

For the English set, it is commonly available open-source SFT (supervised fine-tuning) dataset, which is a collection of labeled examples, like prompt-response pairs, that is publicly available and used to train large language models for specific tasks. You can look up Alpaca, OpenAssistant, etc.

Happy to see what you will build

8790fahad

National Centre for Artificial Intelligence and Robotics org Nov 27, 2025

Thank you all for showing interest details will be send ASAP

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment