microsoft
/

phi-1_5

Text Generation

text-generation-inference

Model card Files Files and versions

Plans to release the training dataset?

#44

by monology - opened Oct 6, 2023

At time of writing, all community efforts to create synthetic datasets like the one in Phi-1.5 fall short, either in the quality of the synthetic generations or the sheer size of the synthetic corpus.
Releasing the data used to train Phi-1.5 would be greatly beneficial for further research into the impact of synthetic datasets on large language models.
Would love to hear a response from one of the authors of the Phi-1.5 technical report about whether the community can expect to see the dataset or a subset of it released under any license or usage conditions.

Microsoft org Oct 30, 2023

Hello @monology !

Unfortunately, we are not able to release the dataset at the moment, however, there are some amazing attempts to create public versions, such as https://huggingface.co/datasets/nampdn-ai/tiny-textbooks and https://huggingface.co/datasets/emrgnt-cmplxty/sciphi-textbooks-are-all-you-need.

gugarosa changed discussion status to closed Oct 30, 2023

RaphaelKalandadze

any updates on that topic?

•

edited Feb 13, 2024

any updates on this topic?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment