Papers
arxiv:2505.19527

Rolling Ball Optimizer: Learning by ironing out loss landscape wrinkles

Published on Oct 24, 2025
Authors:
,
,
,
,

Abstract

The Rolling Ball Optimizer introduces a novel approach to neural network optimization by incorporating global loss landscape information through a spherical rolling mechanism, improving convergence and generalization over traditional gradient-based methods.

Training large neural networks (NNs) requires optimizing high-dimensional data-dependent loss functions. The optimization landscape of these functions is often highly complex and textured, even fractal-like, with many spurious local minima, ill-conditioned valleys, degenerate points, and saddle points. Complicating things further is the fact that these landscape characteristics are a function of the data, meaning that noise in the training data can propagate forward and give rise to unrepresentative small-scale geometry. This poses a difficulty for gradient-based optimization methods, which rely on local geometry to compute updates and are, therefore, vulnerable to being derailed by noisy data. In practice,this translates to a strong dependence of the optimization dynamics on the noise in the data, i.e., poor generalization performance. To remediate this problem, we propose a new optimization procedure: Rolling Ball Optimizer (RBO), that breaks this spatial locality by incorporating information from a larger region of the loss landscape in its updates. We achieve this by simulating the motion of a rigid sphere of finite radius rolling on the loss landscape, a straightforward generalization of Gradient Descent (GD) that simplifies into it in the infinitesimal limit. The radius serves as a hyperparameter that determines the scale at which RBO sees the loss landscape, allowing control over the granularity of its interaction therewith. We are motivated by the intuition that the large-scale geometry of the loss landscape is less data-specific than its fine-grained structure, and that it is easier to optimize. We support this intuition by proving that our algorithm has a smoothing effect on the loss function. Evaluation against SGD, SAM, and Entropy-SGD, on MNIST and CIFAR-10/100 demonstrates promising results in terms of convergence speed, training accuracy, and generalization performance.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2505.19527
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.19527 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.19527 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.