arxiv:2602.10388

Less is Enough: Synthesizing Diverse Data in Feature Space of LLMs

Published on Feb 11

Authors:

Zhongzhi Li ,

Abstract

Feature Activation Coverage measures data diversity in an interpretable feature space and enables diversity-driven data synthesis that improves downstream performance across multiple language model architectures.

AI-generated summary

The diversity of post-training data is critical for effective downstream performance in large language models (LLMs). Many existing approaches to constructing post-training data quantify diversity using text-based metrics that capture linguistic variation, but such metrics provide only weak signals for the task-relevant features that determine downstream performance. In this work, we introduce Feature Activation Coverage (FAC) which measures data diversity in an interpretable feature space. Building upon this metric, we further propose a diversity-driven data synthesis framework, named FAC Synthesis, that first uses a sparse autoencoder to identify missing features from a seed dataset, and then generates synthetic samples that explicitly reflect these features. Experiments show that our approach consistently improves both data diversity and downstream performance on various tasks, including instruction following, toxicity detection, reward modeling, and behavior steering. Interestingly, we identify a shared, interpretable feature space across model families (i.e., LLaMA, Mistral, and Qwen), enabling cross-model knowledge transfer. Our work provides a solid and practical methodology for exploring data-centric optimization of LLMs.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

Zhongzhi1228

Paper author about 16 hours ago

•

edited about 15 hours ago

Less is Enough shows that better data matters more than more data.

Instead of generating massive amounts of synthetic sample, we look inside the model’s hidden features to find what is truly missing. We introduce Feature Activation Coverage (FAC) to measure which important internal features are underrepresented, then generate new samples that specifically activate those features.

Result: FAC exhibits a strong correlation with downstream performance. Increasing FAC brings significantly larger gains than simply adding more samples. With only 2K synthetic samples, we match MAGPIE’s performance on AlpacaEval 2.0 (which uses 300K samples) and outperform strong baselines across instruction following, toxicity detection, reward modeling, and behavior steering.

Interestingly, we further discover a shared, interpretable feature space across LLaMA, Mistral, and Qwen, which enables effective cross-model knowledge transfer between different model families.

Paper: arXiv:2602.10388
Code: GitHub
Website: https://website-sigma-three-35.vercel.app/
Demo: https://huggingface.co/spaces/Zhongzhi1228/synthesis-demo (Work in Progress)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.10388 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.10388 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.10388 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.