arxiv:2603.12228

Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights

Published on Mar 12

· Submitted by

Yulu Gan on Mar 13

Massachusetts Institute of Technology

Upvote

Authors:

Abstract

Pretraining creates a parameter distribution where task-specific experts become more densely populated in large models, enabling effective ensemble methods for post-training adaptation.

AI-generated summary

Pretraining produces a learned parameter vector that is typically treated as a starting point for further iterative adaptation. In this work, we instead view the outcome of pretraining as a distribution over parameter vectors, whose support already contains task-specific experts. We show that in small models such expert solutions occupy a negligible fraction of the volume of this distribution, making their discovery reliant on structured optimization methods such as gradient descent. In contrast, in large, well-pretrained models the density of task-experts increases dramatically, so that diverse, task-improving specialists populate a substantial fraction of the neighborhood around the pretrained weights. Motivated by this perspective, we explore a simple, fully parallel post-training method that samples N parameter perturbations at random, selects the top K, and ensembles predictions via majority vote. Despite its simplicity, this approach is competitive with standard post-training methods such as PPO, GRPO, and ES for contemporary large-scale models.

View arXiv page View PDF Project page GitHub 83 Add to collection

Community

yulu2

Paper submitter 1 day ago

Pretraining produces a learned parameter vector that is typically treated as a starting point for further
iterative adaptation. In this work, we instead view the outcome of pretraining as a distribution over
parameter vectors, whose support already contains task-specific experts. We show that in small
models such expert solutions occupy a negligible fraction of the volume of this distribution, making
their discovery reliant on structured optimization methods such as gradient descent. In contrast,
in large, well-pretrained models the density of task-experts increases dramatically, so that diverse,
task-improving specialists populate a substantial fraction of the neighborhood around the pretrained
weights. Motivated by this perspective, we explore a simple, fully parallel post-training method that
samples N parameter perturbations at random, selects the top K, and ensembles predictions via
majority vote. Despite its simplicity, this approach is competitive with standard post-training methods
such as PPO, GRPO, and ES for contemporary large-scale models.

librarian-bot

about 21 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.12228 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.12228 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.12228 in a Space README.md to link it from this page.