GPT-4 generated datasets
Collection of some GPT-4 generated datasets. It may be useful for those looking for the best-quality datasets to train competitive LLMs.
Viewer • Updated • 289k • 1.19k • 127Note Should bring benefits for coding and reason since it's crafted by a slightly different prompt method and generated by GPT-4.
ise-uiuc/Magicoder-Evol-Instruct-110K
Viewer • Updated • 111k • 30k • 179Note They did some cleaning; the original dataset is from https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1.
microsoft/orca-math-word-problems-200k
Viewer • Updated • 200k • 6.85k • 483Note They used this dataset to build a strong math model.
Leon-Leee/Wizardlm_Evol_Instruct_v2_196K_backuped
Viewer • Updated • 143k • 23 • 1Note A high-quality dataset by evol-instruct and GPT4.
CollectiveCognition/chats-data-2023-09-22
Viewer • Updated • 156 • 60 • 19Note Not sure if all are by gpt-4.
jondurbin/airoboros-3.2
Viewer • Updated • 58.7k • 772 • 47Note Not sure if all are by gpt-4.
camel-ai/biology
Viewer • Updated • 20k • 2.89k • 56Note might be out-dated.
camel-ai/physics
Viewer • Updated • 20k • 8.29k • 108Note might be out-dated.
Open-Orca/1million-gpt-4
Viewer • Updated • 995k • 57 • 46Note No description is really bad. And they have been quiet for a while.
teknium/OpenHermes-2.5
Viewer • Updated • 1M • 17.2k • 843Note A huge collection and caution: some of the dataset are not by gpt4.
openchat/openchat_sharegpt4_dataset
Updated • 382 • 173Note Their models are really strong.
teknium/GPTeacher-General-Instruct
Viewer • Updated • 89.3k • 130 • 45Note Not sure if out-dated.
Tony-Yuan/TheElements
Viewer • Updated • 1.56k • 21Note Not sure if it is really good.
-
FreedomIntelligence/Evol-Instruct-Chinese-GPT4
Viewer • Updated • 70k • 150 • 47 -
shibing624/sharegpt_gpt4
Viewer • Updated • 103k • 1.13k • 138 -
MAsad789565/Coding_GPT4_Data
Viewer • Updated • 2.21k • 25 • 6 -
bz-arc13/evol_instruct_zh_gpt4
Viewer • Updated • 68.9k • 22 -
LNTANOooo/evol_instruct_zh_gpt4_v3
Viewer • Updated • 68.9k • 21 • 1