arxiv:2606.14971

FastMix: Fast Data Mixture Optimization via Gradient Descent

Published on Jun 12

· Submitted by

Haoru Tan on Jun 23

Tencent Hunyuan

Upvote

Authors:

Abstract

FASTMIX automates optimal data mixture discovery during training by formulating mixture selection as a bilevel optimization problem that jointly optimizes mixture coefficients and model parameters through iterative updates.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

While large and diverse datasets have driven recent advances in large models, identifying the optimal data mixture for pre-training and post-training remains a significant open problem. We address this challenge with FASTMIX, a novel framework that automates data mixture discovery while training only a single proxy model. Instead of relying on predefined heuristics or resource-intensive simulations, FASTMIX jointly optimizes mixture coefficients and model parameters, substantially improving efficiency and scalability over prior approaches. At the core of FASTMIX is a reformulation of mixture selection as a bilevel optimization problem. Under this reformulation, we show that optimizing mixture ratios is mathematically equivalent to assigning per-source loss weights under uniform source sampling. This embeds the mixture coefficients directly into the differentiable iterative optimization objective, enabling efficient, gradient-based optimization of both mixture and model. To solve the optimization problem, FASTMIX implements an approximate iterative optimization procedure, alternating between (i) updating model parameters on data sampled according to current mixture ratios (inner loop) and (ii) updating mixture ratios based on validation feedback (outer loop). Across pre- and post-training, FASTMIX outperforms baselines while drastically reducing search cost. Code (https://github.com/hrtan/fastmix)

View arXiv page View PDF GitHub 2 Add to collection

Community

thrshr

Paper submitter about 15 hours ago

Excited to share our team's (Tencent Hunyuan, HKU, CUHK) work from last year: "Fast Data Mixture Optimization via Gradient Descent", which has been accepted to ICLR 2026!

TL;DR: We reformulate the data sampling ratio assignment into a bilevel optimization problem corresponding to loss proportions under uniform sampling, proving their equivalence in expectation. Consequently, the mixture coefficients naturally enter the PyTorch computation graph, allowing them to be solved directly via gradient descent.

This represents our deep exploration into the field of data mixture optimization. Historically, optimizing data mixtures has faced numerous challenges. Existing approaches either rely on predefined heuristics, which lack flexibility, or require training a large number of small proxy models to simulate the process—a highly resource-intensive and inefficient endeavor that also suffers from a potential scale gap.

Our proposed framework, FastMix, aims to break through these limitations.

The core highlight of FastMix is reformulating the data sampling ratio assignment into a bilevel optimization problem linked to the loss proportions under uniform sampling. In principle, optimizing the mixture ratios is equivalent to appropriately allocating loss weights for each data source under uniform source sampling. In this way, the mixture coefficients are directly integrated into a differentiable iterative optimization objective, enabling highly efficient, gradient-based optimization for both the data mixture and the model.

To implement this, FastMix constructs an approximate iterative optimization process. During iterations, it updates the model parameters on data sampled according to the current mixture ratios (inner loop) while simultaneously updating the mixture ratios based on validation feedback (outer loop) in an alternating fashion (Figure 2).

In our experiments, FastMix consistently matches or outperforms the strongest baselines across both pre-training and post-training stages, while drastically reducing search costs. During pre-training, it is up to 550× faster than the state-of-the-art method RegMix; during post-training, the time cost is reduced by 52× compared to RegMix.

Of course, there are still many exciting open questions worth exploring (see Section 4.3). We sincerely hope this paper can provide new perspectives and inspiration for researchers working in related fields.

The code is now open-source! For those interested, please check out our repository here 👉 [https://github.com/hrtan/fastmix] to dive deeper. We warmly welcome discussions, collaborations, and growing together!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.14971

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.14971 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.14971 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.14971 in a Space README.md to link it from this page.