What datasets were used to train this model?

#2
by ptrdvn - opened

First of all, thanks for the great model!

I would just like to know - what datasets were used to train this model?
You mention some open source Huggingface datasets in your model card, so I would like to know which ones were used in training.

Thanks

National Centre for Artificial Intelligence and Robotics org

First of all, thanks for the great model!

I would just like to know - what datasets were used to train this model?
You mention some open source Huggingface datasets in your model card, so I would like to know which ones were used in training.

Thanks

Hello, @ptrdvn
We are all excited about ATLaS πŸ€—. I don't think there is an organized list of datasets for the training or fine-tuning set (at least, I don't think it was published here on HF with the model), but some sources were mentioned in the model card.

Just FYI, a variety of sources were mentioned, but for me the one of particular interest is the Hausa, Yoruba, and Igbo sets, which came from local sources like the BBC Pidgin and Punch News. Although this data (e.g., BBC Pidgin and Punch News) might already be in existing datasets, it underwent manual quality checks, filtering, and prompt-response alignment to ensure fluency and cultural relevance.

For the English set, it is commonly available open-source SFT (supervised fine-tuning) dataset, which is a collection of labeled examples, like prompt-response pairs, that is publicly available and used to train large language models for specific tasks. You can look up Alpaca, OpenAssistant, etc.

Happy to see what you will build

National Centre for Artificial Intelligence and Robotics org

Thank you all for showing interest details will be send ASAP

Sign up or log in to comment