Synthetic Data Generation
updated
Textbooks Are All You Need
Paper
• 2306.11644
• Published
• 154
Textbooks Are All You Need II: phi-1.5 technical report
Paper
• 2309.05463
• Published
• 89
TinyStories: How Small Can Language Models Be and Still Speak Coherent
English?
Paper
• 2305.07759
• Published
• 45
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Paper
• 2406.20094
• Published
• 104
Instruction Pre-Training: Language Models are Supervised Multitask
Learners
Paper
• 2406.14491
• Published
• 96
Improving Text Embeddings with Large Language Models
Paper
• 2401.00368
• Published
• 82
Enhancing Chat Language Models by Scaling High-quality Instructional
Conversations
Paper
• 2305.14233
• Published
• 7
Magicoder: Source Code Is All You Need
Paper
• 2312.02120
• Published
• 82
Adapting Large Language Models via Reading Comprehension
Paper
• 2309.09530
• Published
• 82
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language
Models
Paper
• 2401.01335
• Published
• 68
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs
with Nothing
Paper
• 2406.08464
• Published
• 72
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with
Refined Data Generation
Paper
• 2312.14187
• Published
• 49
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language
Modeling
Paper
• 2401.16380
• Published
• 51
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for
Language Models
Paper
• 2402.13064
• Published
• 50
AgentInstruct: Toward Generative Teaching with Agentic Flows
Paper
• 2407.03502
• Published
• 51
Toward General Instruction-Following Alignment for Retrieval-Augmented
Generation
Paper
• 2410.09584
• Published
• 48
Self-Alignment with Instruction Backtranslation
Paper
• 2308.06259
• Published
• 43
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset
Paper
• 2402.10176
• Published
• 38
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM
Workflows
Paper
• 2402.10379
• Published
• 31
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper
• 2404.07503
• Published
• 31
Beyond Human Data: Scaling Self-Training for Problem-Solving with
Language Models
Paper
• 2312.06585
• Published
• 29
Becoming self-instruct: introducing early stopping criteria for minimal
instruct tuning
Paper
• 2307.03692
• Published
• 27
AlpaGasus: Training A Better Alpaca with Fewer Data
Paper
• 2307.08701
• Published
• 24
Simple synthetic data reduces sycophancy in large language models
Paper
• 2308.03958
• Published
• 23
CodecLM: Aligning Language Models with Tailored Synthetic Data
Paper
• 2404.05875
• Published
• 18
Source2Synth: Synthetic Data Generation and Curation Grounded in Real
Data Sources
Paper
• 2409.08239
• Published
• 21
WizardLM: Empowering Large Language Models to Follow Complex
Instructions
Paper
• 2304.12244
• Published
• 13
Learning to Generate Instruction Tuning Datasets for Zero-Shot Task
Adaptation
Paper
• 2402.18334
• Published
• 12
Synthesizing Text-to-SQL Data from Weak and Strong LLMs
Paper
• 2408.03256
• Published
• 10
Self-Instruct: Aligning Language Model with Self Generated Instructions
Paper
• 2212.10560
• Published
• 9
Ensemble-Instruct: Generating Instruction-Tuning Data with a
Heterogeneous Mixture of LMs
Paper
• 2310.13961
• Published
• 5
STaR: Bootstrapping Reasoning With Reasoning
Paper
• 2203.14465
• Published
• 9
M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in
Large Language Models
Paper
• 2406.16783
• Published
• 4
Synthetic Data Generation with Large Language Models for Text
Classification: Potential and Limitations
Paper
• 2310.07849
• Published
• 2
Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through
Active Exploration
Paper
• 2310.09168
• Published
• 2
Increasing Diversity While Maintaining Accuracy: Text Data Generation
with Large Language Models and Human Interventions
Paper
• 2306.04140
• Published
• 2
SALMON: Self-Alignment with Principle-Following Reward Models
Paper
• 2310.05910
• Published
• 2
Better Synthetic Data by Retrieving and Transforming Existing Datasets
Paper
• 2404.14361
• Published
• 2
Impossible Distillation: from Low-Quality Model to High-Quality Dataset
& Model for Summarization and Paraphrasing
Paper
• 2305.16635
• Published
• 1
Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated
Chatbot Arena
Paper
• 2407.10627
• Published
ZeroGen: Efficient Zero-shot Learning via Dataset Generation
Paper
• 2202.07922
• Published
• 1
West-of-N: Synthetic Preference Generation for Improved Reward Modeling
Paper
• 2401.12086
• Published
• 1
Automatic Instruction Evolving for Large Language Models
Paper
• 2406.00770
• Published
• 3
Generative AI for Synthetic Data Generation: Methods, Challenges and the
Future
Paper
• 2403.04190
• Published
• 1
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A
Survey
Paper
• 2406.15126
• Published
• 1
Large Language Models for Data Annotation: A Survey
Paper
• 2402.13446
• Published
• 1
Large Language Model as Attributed Training Data Generator: A Tale of
Diversity and Bias
Paper
• 2306.15895
• Published
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data
Generated by Large Language Models
Paper
• 2404.14445
• Published
TarGEN: Targeted Data Generation with Large Language Models
Paper
• 2310.17876
• Published
#InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of
Large Language Models
Paper
• 2308.07074
• Published
Self-Rewarding Language Models
Paper
• 2401.10020
• Published
• 152
Orca 2: Teaching Small Language Models How to Reason
Paper
• 2311.11045
• Published
• 77
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
Paper
• 2306.02707
• Published
• 51
WizardCoder: Empowering Code Large Language Models with Evol-Instruct
Paper
• 2306.08568
• Published
• 33
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
Paper
• 2309.11998
• Published
• 27
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large
Language Models by Extrapolating Errors from Small Models
Paper
• 2310.13671
• Published
• 19
Self-play with Execution Feedback: Improving Instruction-following
Capabilities of Large Language Models
Paper
• 2406.13542
• Published
• 17
Auto-Instruct: Automatic Instruction Generation and Ranking for
Black-Box Language Models
Paper
• 2310.13127
• Published
• 12
WizardMath: Empowering Mathematical Reasoning for Large Language Models
via Reinforced Evol-Instruct
Paper
• 2308.09583
• Published
• 7
GenQA: Generating Millions of Instructions from a Handful of Prompts
Paper
• 2406.10323
• Published
• 5
UltraFeedback: Boosting Language Models with High-quality Feedback
Paper
• 2310.01377
• Published
• 5
Model Dementia: Generated Data Makes Models Forget
Paper
• 2305.17493
• Published
• 6
Large Language Model as a User Simulator
Paper
• 2308.11534
• Published
• 2
Unnatural Instructions: Tuning Language Models with (Almost) No Human
Labor
Paper
• 2212.09689
• Published
• 1
Aligning Large Language Models through Synthetic Feedback
Paper
• 2305.13735
• Published
• 1
Principle-Driven Self-Alignment of Language Models from Scratch with
Minimal Human Supervision
Paper
• 2305.03047
• Published
• 1
Mixture of Soft Prompts for Controllable Data Generation
Paper
• 2303.01580
• Published
• 1
Refined Direct Preference Optimization with Synthetic Data for
Behavioral Alignment of LLMs
Paper
• 2402.08005
• Published
• 1
Harnessing the Power of David against Goliath: Exploring Instruction
Data Generation without Using Closed-Source Models
Paper
• 2308.12711
• Published
• 1
Generating Training Data with Language Models: Towards Zero-Shot
Language Understanding
Paper
• 2202.04538
• Published
Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data
Generation with Large Language Models
Paper
• 2311.00287
• Published
GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation
Paper
• 2104.08826
• Published
Synthetic Prompting: Generating Chain-of-Thought Demonstrations for
Large Language Models
Paper
• 2302.00618
• Published
MIND: Math Informed syNthetic Dialogues for Pretraining LLMs
Paper
• 2410.12881
• Published
• 1
LAB: Large-Scale Alignment for ChatBots
Paper
• 2403.01081
• Published
Large Language Models Can Self-Improve
Paper
• 2210.11610
• Published
Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation
Paper
• 2305.14327
• Published
Automatically Generating Numerous Context-Driven SFT Data for LLMs
across Diverse Granularity
Paper
• 2405.16579
• Published
Data Augmentation using Pre-trained Transformer Models
Paper
• 2003.02245
• Published
Unsupervised Neural Machine Translation with Generative Language Models
Only
Paper
• 2110.05448
• Published
Instruction Tuning with GPT-4
Paper
• 2304.03277
• Published
Content preserving text generation with attribute controls
Paper
• 1811.01135
• Published
Large Language Models Are Human-Level Prompt Engineers
Paper
• 2211.01910
• Published
• 1
XPersona: Evaluating Multilingual Personalized Chatbot
Paper
• 2003.07568
• Published
PersonaMath: Enhancing Math Reasoning through Persona-Driven Data
Augmentation
Paper
• 2410.01504
• Published
Do Not Worry if You Do Not Have Data: Building Pretrained Language
Models Using Translationese
Paper
• 2403.13638
• Published
• 1