Get trending papers in your email inbox once a day!
Get trending papers in your email inbox!
SubscribeEfficient MPC-Based Energy Management System for Secure and Cost-Effective Microgrid Operations
Model predictive control (MPC)-based energy management systems (EMS) are essential for ensuring optimal, secure, and stable operation in microgrids with high penetrations of distributed energy resources. However, due to the high computational cost for the decision-making, the conventional MPC-based EMS typically adopts a simplified integrated-bus power balance model. While this simplification is effective for small networks, large-scale systems require a more detailed branch flow model to account for the increased impact of grid power losses and security constraints. This work proposes an efficient and reliable MPC-based EMS that incorporates power-loss effects and grid-security constraints. %, while adaptively shaping the battery power profile in response to online renewable inputs, achieving reduced operational costs. It enhances system reliability, reduces operational costs, and shows strong potential for online implementation due to its reduced computational effort. Specifically, a second-order cone program (SOCP) branch flow relaxation is integrated into the constraint set, yielding a convex formulation that guarantees globally optimal solutions with high computational efficiency. Owing to the radial topology of the microgrid, this relaxation is practically tight, ensuring equivalence to the original problem. Building on this foundation, an online demand response (DR) module is designed to further reduce the operation cost through peak shaving. To the best of our knowledge, no prior MPC-EMS framework has simultaneously modeled losses and security constraints while coordinating flexible loads within a unified architecture. The developed framework enables secure operation with effective peak shaving and reduced total cost. The effectiveness of the proposed method is validated on 10-bus, 18-bus, and 33-bus systems.
Safe Learning-Based Control of Elastic Joint Robots via Control Barrier Functions
Ensuring safety is of paramount importance in physical human-robot interaction applications. This requires both adherence to safety constraints defined on the system state, as well as guaranteeing compliant behavior of the robot. If the underlying dynamical system is known exactly, the former can be addressed with the help of control barrier functions. The incorporation of elastic actuators in the robot's mechanical design can address the latter requirement. However, this elasticity can increase the complexity of the resulting system, leading to unmodeled dynamics, such that control barrier functions cannot directly ensure safety. In this paper, we mitigate this issue by learning the unknown dynamics using Gaussian process regression. By employing the model in a feedback linearizing control law, the safety conditions resulting from control barrier functions can be robustified to take into account model errors, while remaining feasible. In order to enforce them on-line, we formulate the derived safety conditions in the form of a second-order cone program. We demonstrate our proposed approach with simulations on a two-degree-of-freedom planar robot with elastic joints.
Dojo: A Differentiable Physics Engine for Robotics
We present Dojo, a differentiable physics engine for robotics that prioritizes stable simulation, accurate contact physics, and differentiability with respect to states, actions, and system parameters. Dojo models hard contact and friction with a nonlinear complementarity problem with second-order cone constraints. We introduce a custom primal-dual interior-point method to solve the second order cone program for stable forward simulation over a broad range of sample rates. We obtain smooth gradient approximations with this solver through the implicit function theorem, giving gradients that are useful for downstream trajectory optimization, policy optimization, and system identification applications. Specifically, we propose to use the central path parameter threshold in the interior point solver as a user-tunable design parameter. A high value gives a smooth approximation to contact dynamics with smooth gradients for optimization and learning, while a low value gives precise simulation rollouts with hard contact. We demonstrate Dojo's differentiability in trajectory optimization, policy learning, and system identification examples. We also benchmark Dojo against MuJoCo, PyBullet, Drake, and Brax on a variety of robot models, and study the stability and simulation quality over a range of sample frequencies and accuracy tolerances. Finally, we evaluate the sim-to-real gap in hardware experiments with a Ufactory xArm 6 robot. Dojo is an open source project implemented in Julia with Python bindings, with code available at https://github.com/dojo-sim/Dojo.jl.
Second-order regression models exhibit progressive sharpening to the edge of stability
Recent studies of gradient descent with large step sizes have shown that there is often a regime with an initial increase in the largest eigenvalue of the loss Hessian (progressive sharpening), followed by a stabilization of the eigenvalue near the maximum value which allows convergence (edge of stability). These phenomena are intrinsically non-linear and do not happen for models in the constant Neural Tangent Kernel (NTK) regime, for which the predictive function is approximately linear in the parameters. As such, we consider the next simplest class of predictive models, namely those that are quadratic in the parameters, which we call second-order regression models. For quadratic objectives in two dimensions, we prove that this second-order regression model exhibits progressive sharpening of the NTK eigenvalue towards a value that differs slightly from the edge of stability, which we explicitly compute. In higher dimensions, the model generically shows similar behavior, even without the specific structure of a neural network, suggesting that progressive sharpening and edge-of-stability behavior aren't unique features of neural networks, and could be a more general property of discrete learning algorithms in high-dimensional non-linear models.
On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width
Second-order optimization has been developed to accelerate the training of deep neural networks and it is being applied to increasingly larger-scale models. In this study, towards training on further larger scales, we identify a specific parameterization for second-order optimization that promotes feature learning in a stable manner even if the network width increases significantly. Inspired by a maximal update parameterization, we consider a one-step update of the gradient and reveal the appropriate scales of hyperparameters including random initialization, learning rates, and damping terms. Our approach covers two major second-order optimization algorithms, K-FAC and Shampoo, and we demonstrate that our parameterization achieves higher generalization performance in feature learning. In particular, it enables us to transfer the hyperparameters across models with different widths.
Machine Learning Modeling for Multi-order Human Visual Motion Processing
Our research aims to develop machines that learn to perceive visual motion as do humans. While recent advances in computer vision (CV) have enabled DNN-based models to accurately estimate optical flow in naturalistic images, a significant disparity remains between CV models and the biological visual system in both architecture and behavior. This disparity includes humans' ability to perceive the motion of higher-order image features (second-order motion), which many CV models fail to capture because of their reliance on the intensity conservation law. Our model architecture mimics the cortical V1-MT motion processing pathway, utilizing a trainable motion energy sensor bank and a recurrent graph network. Supervised learning employing diverse naturalistic videos allows the model to replicate psychophysical and physiological findings about first-order (luminance-based) motion perception. For second-order motion, inspired by neuroscientific findings, the model includes an additional sensing pathway with nonlinear preprocessing before motion energy sensing, implemented using a simple multilayer 3D CNN block. When exploring how the brain acquired the ability to perceive second-order motion in natural environments, in which pure second-order signals are rare, we hypothesized that second-order mechanisms were critical when estimating robust object motion amidst optical fluctuations, such as highlights on glossy surfaces. We trained our dual-pathway model on novel motion datasets with varying material properties of moving objects. We found that training to estimate object motion from non-Lambertian materials naturally endowed the model with the capacity to perceive second-order motion, as can humans. The resulting model effectively aligns with biological systems while generalizing to both first- and second-order motion phenomena in natural scenes.
Thermodynamic Natural Gradient Descent
Second-order training methods have better convergence properties than gradient descent but are rarely used in practice for large-scale training due to their computational overhead. This can be viewed as a hardware limitation (imposed by digital computers). Here we show that natural gradient descent (NGD), a second-order method, can have a similar computational complexity per iteration to a first-order method, when employing appropriate hardware. We present a new hybrid digital-analog algorithm for training neural networks that is equivalent to NGD in a certain parameter regime but avoids prohibitively costly linear system solves. Our algorithm exploits the thermodynamic properties of an analog system at equilibrium, and hence requires an analog thermodynamic computer. The training occurs in a hybrid digital-analog loop, where the gradient and Fisher information matrix (or any other positive semi-definite curvature matrix) are calculated at given time intervals while the analog dynamics take place. We numerically demonstrate the superiority of this approach over state-of-the-art digital first- and second-order training methods on classification tasks and language model fine-tuning tasks.
How Well Do Sparse Imagenet Models Transfer?
Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream" specialized datasets. Generally, more accurate models on the "upstream" dataset tend to provide better transfer accuracy "downstream". In this work, we perform an in-depth investigation of this phenomenon in the context of convolutional neural networks (CNNs) trained on the ImageNet dataset, which have been pruned - that is, compressed by sparsifying their connections. We consider transfer using unstructured pruned models obtained by applying several state-of-the-art pruning methods, including magnitude-based, second-order, re-growth, lottery-ticket, and regularization approaches, in the context of twelve standard transfer tasks. In a nutshell, our study shows that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities, and, while doing so, can lead to significant inference and even training speedups. At the same time, we observe and analyze significant differences in the behaviour of different pruning methods.
Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning
Reinforcement Learning, particularly through policy gradient methods, has played a central role in enabling reasoning capabilities of Large Language Models. However, the optimization stability of policy gradients in this setting remains understudied. As a result, existing implementations often resort to conservative hyperparameter choices to ensure stability, which requires more training samples and increases computational costs. Hence, developing models for reliably tracking the underlying optimization dynamics and leveraging them into training enables more sample-efficient regimes and further unleashes scalable post-training. We address this gap by formalizing the stochastic optimization problem of policy gradients with explicit consideration of second-order geometry. We propose a tractable computational framework that tracks and leverages curvature information during policy updates. We further employ this framework to design interventions in the optimization process through data selection. The resultant algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out. Theoretically, we establish monotonic improvement guarantees under realistic assumptions. On standard math reasoning benchmarks, we empirically show that CAPO ensures stable updates under aggressive learning regimes where baselines catastrophically fail. With minimal intervention (rejecting fewer than 8% of tokens), CAPO achieves up to 30x improvement in sample efficiency over standard GRPO for LLM reasoning.
Holographic Superconductors from Einstein-Maxwell-Dilaton Gravity
We construct holographic superconductors from Einstein-Maxwell-dilaton gravity in 3+1 dimensions with two adjustable couplings alpha and the charge q carried by the scalar field. For the values of alpha and q we consider, there is always a critical temperature at which a second order phase transition occurs between a hairy black hole and the AdS RN black hole in the canonical ensemble, which can be identified with the superconducting phase transition of the dual field theory. We calculate the electric conductivity of the dual superconductor and find that for the values of alpha and q where alpha/q is small the dual superconductor has similar properties to the minimal model, while for the values of alpha and q where alpha/q is large enough, the electric conductivity of the dual superconductor exhibits novel properties at low frequencies where it shows a "Drude Peak" in the real part of the conductivity.
AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse
Diffusion models have demonstrated remarkable success in generative tasks, yet their iterative denoising process results in slow inference, limiting their practicality. While existing acceleration methods exploit the well-known U-shaped similarity pattern between adjacent steps through caching mechanisms, they lack theoretical foundation and rely on simplistic computation reuse, often leading to performance degradation. In this work, we provide a theoretical understanding by analyzing the denoising process through the second-order Adams-Bashforth method, revealing a linear relationship between the outputs of consecutive steps. This analysis explains why the outputs of adjacent steps exhibit a U-shaped pattern. Furthermore, extending Adams-Bashforth method to higher order, we propose a novel caching-based acceleration approach for diffusion models, instead of directly reusing cached results, with a truncation error bound of only \(O(h^k)\) where h is the step size. Extensive validation across diverse image and video diffusion models (including HunyuanVideo and FLUX.1-dev) with various schedulers demonstrates our method's effectiveness in achieving nearly 3times speedup while maintaining original performance levels, offering a practical real-time solution without compromising generation quality.
APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models
Large Language Models (LLMs) have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer's weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model. We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization, ensuring an informed precision reduction that retains model performance. Experiments show APTQ surpasses previous quantization methods, achieving an average of 4 bit width a 5.22 perplexity nearly equivalent to full precision in the C4 dataset. In addition, APTQ attains state-of-the-art zero-shot accuracy of 68.24\% and 70.48\% at an average bitwidth of 3.8 in LLaMa-7B and LLaMa-13B, respectively, demonstrating its effectiveness to produce high-quality quantized LLMs.
SEGNO: Generalizing Equivariant Graph Neural Networks with Physical Inductive Biases
Graph Neural Networks (GNNs) with equivariant properties have emerged as powerful tools for modeling complex dynamics of multi-object physical systems. However, their generalization ability is limited by the inadequate consideration of physical inductive biases: (1) Existing studies overlook the continuity of transitions among system states, opting to employ several discrete transformation layers to learn the direct mapping between two adjacent states; (2) Most models only account for first-order velocity information, despite the fact that many physical systems are governed by second-order motion laws. To incorporate these inductive biases, we propose the Second-order Equivariant Graph Neural Ordinary Differential Equation (SEGNO). Specifically, we show how the second-order continuity can be incorporated into GNNs while maintaining the equivariant property. Furthermore, we offer theoretical insights into SEGNO, highlighting that it can learn a unique trajectory between adjacent states, which is crucial for model generalization. Additionally, we prove that the discrepancy between this learned trajectory of SEGNO and the true trajectory is bounded. Extensive experiments on complex dynamical systems including molecular dynamics and motion capture demonstrate that our model yields a significant improvement over the state-of-the-art baselines.
The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections
How should we evaluate the robustness of language model defenses? Current defenses against jailbreaks and prompt injections (which aim to prevent an attacker from eliciting harmful knowledge or remotely triggering malicious actions, respectively) are typically evaluated either against a static set of harmful attack strings, or against computationally weak optimization methods that were not designed with the defense in mind. We argue that this evaluation process is flawed. Instead, we should evaluate defenses against adaptive attackers who explicitly modify their attack strategy to counter a defense's design while spending considerable resources to optimize their objective. By systematically tuning and scaling general optimization techniques-gradient descent, reinforcement learning, random search, and human-guided exploration-we bypass 12 recent defenses (based on a diverse set of techniques) with attack success rate above 90% for most; importantly, the majority of defenses originally reported near-zero attack success rates. We believe that future defense work must consider stronger attacks, such as the ones we describe, in order to make reliable and convincing claims of robustness.
Shadow Cones: A Generalized Framework for Partial Order Embeddings
Hyperbolic space has proven to be well-suited for capturing hierarchical relations in data, such as trees and directed acyclic graphs. Prior work introduced the concept of entailment cones, which uses partial orders defined by nested cones in the Poincar\'e ball to model hierarchies. Here, we introduce the ``shadow cones" framework, a physics-inspired entailment cone construction. Specifically, we model partial orders as subset relations between shadows formed by a light source and opaque objects in hyperbolic space. The shadow cones framework generalizes entailment cones to a broad class of formulations and hyperbolic space models beyond the Poincar\'e ball. This results in clear advantages over existing constructions: for example, shadow cones possess better optimization properties over constructions limited to the Poincar\'e ball. Our experiments on datasets of various sizes and hierarchical structures show that shadow cones consistently and significantly outperform existing entailment cone constructions. These results indicate that shadow cones are an effective way to model partial orders in hyperbolic space, offering physically intuitive and novel insights about the nature of such structures.
Curvature-Aware Training for Coordinate Networks
Coordinate networks are widely used in computer vision due to their ability to represent signals as compressed, continuous entities. However, training these networks with first-order optimizers can be slow, hindering their use in real-time applications. Recent works have opted for shallow voxel-based representations to achieve faster training, but this sacrifices memory efficiency. This work proposes a solution that leverages second-order optimization methods to significantly reduce training times for coordinate networks while maintaining their compressibility. Experiments demonstrate the effectiveness of this approach on various signal modalities, such as audio, images, videos, shape reconstruction, and neural radiance fields.
Projections onto Spectral Matrix Cones
Semidefinite programming is a fundamental problem class in convex optimization, but despite recent advances in solvers, solving large-scale semidefinite programs remains challenging. Generally the matrix functions involved are spectral or unitarily invariant, i.e., they depend only on the eigenvalues or singular values of the matrix. This paper investigates how spectral matrix cones -- cones defined from epigraphs and perspectives of spectral or unitarily invariant functions -- can be used to enhance first-order conic solvers for semidefinite programs. Our main result shows that projecting a matrix can be reduced to projecting its eigenvalues or singular values, which we demonstrate can be done at a negligible cost compared to the eigenvalue or singular value decomposition itself. We have integrated support for spectral matrix cone projections into the Splitting Conic Solver (SCS). Numerical experiments show that SCS with this enhancement can achieve speedups of up to an order of magnitude for solving semidefinite programs arising in experimental design, robust principal component analysis, and graph partitioning.
Escaping saddle points in zeroth-order optimization: the power of two-point estimators
Two-point zeroth order methods are important in many applications of zeroth-order optimization, such as robotics, wind farms, power systems, online optimization, and adversarial robustness to black-box attacks in deep neural networks, where the problem may be high-dimensional and/or time-varying. Most problems in these applications are nonconvex and contain saddle points. While existing works have shown that zeroth-order methods utilizing Omega(d) function valuations per iteration (with d denoting the problem dimension) can escape saddle points efficiently, it remains an open question if zeroth-order methods based on two-point estimators can escape saddle points. In this paper, we show that by adding an appropriate isotropic perturbation at each iteration, a zeroth-order algorithm based on 2m (for any 1 leq m leq d) function evaluations per iteration can not only find epsilon-second order stationary points polynomially fast, but do so using only Oleft(d{mepsilon^{2}psi}right) function evaluations, where psi geq Omegaleft(epsilonright) is a parameter capturing the extent to which the function of interest exhibits the strict saddle property.
Second-order difference subspace
Subspace representation is a fundamental technique in various fields of machine learning. Analyzing a geometrical relationship among multiple subspaces is essential for understanding subspace series' temporal and/or spatial dynamics. This paper proposes the second-order difference subspace, a higher-order extension of the first-order difference subspace between two subspaces that can analyze the geometrical difference between them. As a preliminary for that, we extend the definition of the first-order difference subspace to the more general setting that two subspaces with different dimensions have an intersection. We then define the second-order difference subspace by combining the concept of first-order difference subspace and principal component subspace (Karcher mean) between two subspaces, motivated by the second-order central difference method. We can understand that the first/second-order difference subspaces correspond to the velocity and acceleration of subspace dynamics from the viewpoint of a geodesic on a Grassmann manifold. We demonstrate the validity and naturalness of our second-order difference subspace by showing numerical results on two applications: temporal shape analysis of a 3D object and time series analysis of a biometric signal.
ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning
We introduce ADAHESSIAN, a second order stochastic optimization algorithm which dynamically incorporates the curvature of the loss function via ADAptive estimates of the HESSIAN. Second order algorithms are among the most powerful optimization algorithms with superior convergence properties as compared to first order methods such as SGD and Adam. The main disadvantage of traditional second order methods is their heavier per iteration computation and poor accuracy as compared to first order methods. To address these, we incorporate several novel approaches in ADAHESSIAN, including: (i) a fast Hutchinson based method to approximate the curvature matrix with low computational overhead; (ii) a root-mean-square exponential moving average to smooth out variations of the Hessian diagonal across different iterations; and (iii) a block diagonal averaging to reduce the variance of Hessian diagonal elements. We show that ADAHESSIAN achieves new state-of-the-art results by a large margin as compared to other adaptive optimization methods, including variants of Adam. In particular, we perform extensive tests on CV, NLP, and recommendation system tasks and find that ADAHESSIAN: (i) achieves 1.80%/1.45% higher accuracy on ResNets20/32 on Cifar10, and 5.55% higher accuracy on ImageNet as compared to Adam; (ii) outperforms AdamW for transformers by 0.13/0.33 BLEU score on IWSLT14/WMT14 and 2.7/1.0 PPL on PTB/Wikitext-103; (iii) outperforms AdamW for SqueezeBert by 0.41 points on GLUE; and (iv) achieves 0.032% better score than Adagrad for DLRM on the Criteo Ad Kaggle dataset. Importantly, we show that the cost per iteration of ADAHESSIAN is comparable to first order methods, and that it exhibits robustness towards its hyperparameters.
FOSI: Hybrid First and Second Order Optimization
Popular machine learning approaches forgo second-order information due to the difficulty of computing curvature in high dimensions. We present FOSI, a novel meta-algorithm that improves the performance of any base first-order optimizer by efficiently incorporating second-order information during the optimization process. In each iteration, FOSI implicitly splits the function into two quadratic functions defined on orthogonal subspaces, then uses a second-order method to minimize the first, and the base optimizer to minimize the other. We formally analyze FOSI's convergence and the conditions under which it improves a base optimizer. Our empirical evaluation demonstrates that FOSI improves the convergence rate and optimization time of first-order methods such as Heavy-Ball and Adam, and outperforms second-order methods (K-FAC and L-BFGS).
Scalable Second Order Optimization for Deep Learning
Optimization in machine learning, both theoretical and applied, is presently dominated by first-order gradient methods such as stochastic gradient descent. Second-order optimization methods, that involve second derivatives and/or second order statistics of the data, are far less prevalent despite strong theoretical properties, due to their prohibitive computation, memory and communication costs. In an attempt to bridge this gap between theoretical and practical optimization, we present a scalable implementation of a second-order preconditioned method (concretely, a variant of full-matrix Adagrad), that along with several critical algorithmic and numerical improvements, provides significant convergence and wall-clock time improvements compared to conventional first-order methods on state-of-the-art deep models. Our novel design effectively utilizes the prevalent heterogeneous hardware architecture for training deep models, consisting of a multicore CPU coupled with multiple accelerator units. We demonstrate superior performance compared to state-of-the-art on very large learning tasks such as machine translation with Transformers, language modeling with BERT, click-through rate prediction on Criteo, and image classification on ImageNet with ResNet-50.
Second-Order Kernel Online Convex Optimization with Adaptive Sketching
Kernel online convex optimization (KOCO) is a framework combining the expressiveness of non-parametric kernel models with the regret guarantees of online learning. First-order KOCO methods such as functional gradient descent require only O(t) time and space per iteration, and, when the only information on the losses is their convexity, achieve a minimax optimal O(T) regret. Nonetheless, many common losses in kernel problems, such as squared loss, logistic loss, and squared hinge loss posses stronger curvature that can be exploited. In this case, second-order KOCO methods achieve O(log(Det(K))) regret, which we show scales as O(d_{eff}log T), where d_{eff} is the effective dimension of the problem and is usually much smaller than O(T). The main drawback of second-order methods is their much higher O(t^2) space and time complexity. In this paper, we introduce kernel online Newton step (KONS), a new second-order KOCO method that also achieves O(d_{eff}log T) regret. To address the computational complexity of second-order methods, we introduce a new matrix sketching algorithm for the kernel matrix K_t, and show that for a chosen parameter γleq 1 our Sketched-KONS reduces the space and time complexity by a factor of γ^2 to O(t^2γ^2) space and time per iteration, while incurring only 1/γ times more regret.
Interpreting the Second-Order Effects of Neurons in CLIP
We interpret the function of individual neurons in CLIP by automatically describing them using text. Analyzing the direct effects (i.e. the flow from a neuron through the residual stream to the output) or the indirect effects (overall contribution) fails to capture the neurons' function in CLIP. Therefore, we present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output. We find that these effects are highly selective: for each neuron, the effect is significant for <2% of the images. Moreover, each effect can be approximated by a single direction in the text-image space of CLIP. We describe neurons by decomposing these directions into sparse sets of text representations. The sets reveal polysemantic behavior - each neuron corresponds to multiple, often unrelated, concepts (e.g. ships and cars). Exploiting this neuron polysemy, we mass-produce "semantic" adversarial examples by generating images with concepts spuriously correlated to the incorrect class. Additionally, we use the second-order effects for zero-shot segmentation and attribute discovery in images. Our results indicate that a scalable understanding of neurons can be used for model deception and for introducing new model capabilities.
4-bit Shampoo for Memory-Efficient Network Training
Second-order optimizers, maintaining a matrix termed a preconditioner, are superior to first-order optimizers in both theory and practice. The states forming the preconditioner and its inverse root restrict the maximum size of models trained by second-order optimizers. To address this, compressing 32-bit optimizer states to lower bitwidths has shown promise in reducing memory usage. However, current approaches only pertain to first-order optimizers. In this paper, we propose the first 4-bit second-order optimizers, exemplified by 4-bit Shampoo, maintaining performance similar to that of 32-bit ones. We show that quantizing the eigenvector matrix of the preconditioner in 4-bit Shampoo is remarkably better than quantizing the preconditioner itself both theoretically and experimentally. By rectifying the orthogonality of the quantized eigenvector matrix, we enhance the approximation of the preconditioner's eigenvector matrix, which also benefits the computation of its inverse 4-th root. Besides, we find that linear square quantization slightly outperforms dynamic tree quantization when quantizing second-order optimizer states. Evaluation on various networks for image classification demonstrates that our 4-bit Shampoo achieves comparable test accuracy to its 32-bit counterpart while being more memory-efficient. The source code will be made available.
The fractional chromatic number of double cones over graphs
Assume n, m are positive integers and G is a graph. Let P_{n,m} be the graph obtained from the path with vertices {-m, -(m-1), ldots, 0, ldots, n} by adding a loop at vertex 0. The double cone Delta_{n,m}(G) over a graph G is obtained from the direct product G times P_{n,m} by identifying V(G) times {n} into a single vertex (star, n), identifying V(G) times {-m} into a single vertex (star, -m), and adding an edge connecting (star, -m) and (star, n). This paper determines the fractional chromatic number of Delta_{n,m}(G). In particular, if n < m or n=m is even, then chi_f(Delta_{n,m}(G)) = chi_f(Delta_n(G)), where Delta_n(G) is the nth cone over G. If n=m is odd, then chi_f(Delta_{n,m}(G)) > chi_f(Delta_n(G)). The chromatic number of Delta_{n,m}(G) is also discussed.
MKOR: Momentum-Enabled Kronecker-Factor-Based Optimizer Using Rank-1 Updates
This work proposes a Momentum-Enabled Kronecker-Factor-Based Optimizer Using Rank-1 updates, called MKOR, that improves the training time and convergence properties of deep neural networks (DNNs). Second-order techniques, while enjoying higher convergence rates vs first-order counterparts, have cubic complexity with respect to either the model size and/or the training batch size. Hence they exhibit poor scalability and performance in transformer models, e.g. large language models (LLMs), because the batch sizes in these models scale by the attention mechanism sequence length, leading to large model size and batch sizes. MKOR's complexity is quadratic with respect to the model size, alleviating the computation bottlenecks in second-order methods. Because of their high computation complexity, state-of-the-art implementations of second-order methods can only afford to update the second order information infrequently, and thus do not fully exploit the promise of better convergence from these updates. By reducing the communication complexity of the second-order updates as well as achieving a linear communication complexity, MKOR increases the frequency of second order updates. We also propose a hybrid version of MKOR (called MKOR-H) that mid-training falls backs to a first order optimizer if the second order updates no longer accelerate convergence. Our experiments show that MKOR outperforms state -of-the-art first order methods, e.g. the LAMB optimizer, and best implementations of second-order methods, i.e. KAISA/KFAC, up to 2.57x and 1.85x respectively on BERT-Large-Uncased on 64 GPUs.
Optimal Input Gain: All You Need to Supercharge a Feed-Forward Neural Network
Linear transformation of the inputs alters the training performance of feed-forward networks that are otherwise equivalent. However, most linear transforms are viewed as a pre-processing operation separate from the actual training. Starting from equivalent networks, it is shown that pre-processing inputs using linear transformation are equivalent to multiplying the negative gradient matrix with an autocorrelation matrix per training iteration. Second order method is proposed to find the autocorrelation matrix that maximizes learning in a given iteration. When the autocorrelation matrix is diagonal, the method optimizes input gains. This optimal input gain (OIG) approach is used to improve two first-order two-stage training algorithms, namely back-propagation (BP) and hidden weight optimization (HWO), which alternately update the input weights and solve linear equations for output weights. Results show that the proposed OIG approach greatly enhances the performance of the first-order algorithms, often allowing them to rival the popular Levenberg-Marquardt approach with far less computation. It is shown that HWO is equivalent to BP with Whitening transformation applied to the inputs. HWO effectively combines Whitening transformation with learning. Thus, OIG improved HWO could be a significant building block to more complex deep learning architectures.
AdaFisher: Adaptive Second Order Optimization via Fisher Information
First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order counterparts e.g. Adam and SGD. However, their practicality in training DNNs are still limited due to increased per-iteration computations and suboptimal accuracy compared to the first order methods. We present AdaFisher--an adaptive second-order optimizer that leverages a block-diagonal approximation to the Fisher information matrix for adaptive gradient preconditioning. AdaFisher aims to bridge the gap between enhanced convergence capabilities and computational efficiency in second-order optimization framework for training DNNs. Despite the slow pace of second-order optimizers, we showcase that AdaFisher can be reliably adopted for image classification, language modelling and stand out for its stability and robustness in hyperparameter tuning. We demonstrate that AdaFisher outperforms the SOTA optimizers in terms of both accuracy and convergence speed. Code available from https://github.com/AtlasAnalyticsLab/AdaFisher{https://github.com/AtlasAnalyticsLab/AdaFisher}
A Fast and Provable Algorithm for Sparse Phase Retrieval
We study the sparse phase retrieval problem, which seeks to recover a sparse signal from a limited set of magnitude-only measurements. In contrast to prevalent sparse phase retrieval algorithms that primarily use first-order methods, we propose an innovative second-order algorithm that employs a Newton-type method with hard thresholding. This algorithm overcomes the linear convergence limitations of first-order methods while preserving their hallmark per-iteration computational efficiency. We provide theoretical guarantees that our algorithm converges to the s-sparse ground truth signal x^{natural} in R^n (up to a global sign) at a quadratic convergence rate after at most O(log (Vertx^{natural} Vert /x_{min}^{natural})) iterations, using Omega(s^2log n) Gaussian random samples. Numerical experiments show that our algorithm achieves a significantly faster convergence rate than state-of-the-art methods.
Old Optimizer, New Norm: An Anthology
Deep learning optimizers are often motivated through a mix of convex and approximate second-order theory. We select three such methods -- Adam, Shampoo and Prodigy -- and argue that each method can instead be understood as a squarely first-order method without convexity assumptions. In fact, after switching off exponential moving averages, each method is equivalent to steepest descent under a particular norm. By generalizing this observation, we chart a new design space for training algorithms. Different operator norms should be assigned to different tensors based on the role that the tensor plays within the network. For example, while linear and embedding layers may have the same weight space of R^{mtimes n}, these layers play different roles and should be assigned different norms. We hope that this idea of carefully metrizing the neural architecture might lead to more stable, scalable and indeed faster training.
Bridging Discrete and Backpropagation: Straight-Through and Beyond
Backpropagation, the cornerstone of deep learning, is limited to computing gradients for continuous variables. This limitation poses challenges for problems involving discrete latent variables. To address this issue, we propose a novel approach to approximate the gradient of parameters involved in generating discrete latent variables. First, we examine the widely used Straight-Through (ST) heuristic and demonstrate that it works as a first-order approximation of the gradient. Guided by our findings, we propose ReinMax, which achieves second-order accuracy by integrating Heun's method, a second-order numerical method for solving ODEs. ReinMax does not require Hessian or other second-order derivatives, thus having negligible computation overheads. Extensive experimental results on various tasks demonstrate the superiority of ReinMax over the state of the art. Implementations are released at https://github.com/microsoft/ReinMax.
Gradient-Normalized Smoothness for Optimization with Approximate Hessians
In this work, we develop new optimization algorithms that use approximate second-order information combined with the gradient regularization technique to achieve fast global convergence rates for both convex and non-convex objectives. The key innovation of our analysis is a novel notion called Gradient-Normalized Smoothness, which characterizes the maximum radius of a ball around the current point that yields a good relative approximation of the gradient field. Our theory establishes a natural intrinsic connection between Hessian approximation and the linearization of the gradient. Importantly, Gradient-Normalized Smoothness does not depend on the specific problem class of the objective functions, while effectively translating local information about the gradient field and Hessian approximation into the global behavior of the method. This new concept equips approximate second-order algorithms with universal global convergence guarantees, recovering state-of-the-art rates for functions with H\"older-continuous Hessians and third derivatives, quasi-self-concordant functions, as well as smooth classes in first-order optimization. These rates are achieved automatically and extend to broader classes, such as generalized self-concordant functions. We demonstrate direct applications of our results for global linear rates in logistic regression and softmax problems with approximate Hessians, as well as in non-convex optimization using Fisher and Gauss-Newton approximations.
On Second-Order Scoring Rules for Epistemic Uncertainty Quantification
It is well known that accurate probabilistic predictors can be trained through empirical risk minimisation with proper scoring rules as loss functions. While such learners capture so-called aleatoric uncertainty of predictions, various machine learning methods have recently been developed with the goal to let the learner also represent its epistemic uncertainty, i.e., the uncertainty caused by a lack of knowledge and data. An emerging branch of the literature proposes the use of a second-order learner that provides predictions in terms of distributions on probability distributions. However, recent work has revealed serious theoretical shortcomings for second-order predictors based on loss minimisation. In this paper, we generalise these findings and prove a more fundamental result: There seems to be no loss function that provides an incentive for a second-order learner to faithfully represent its epistemic uncertainty in the same manner as proper scoring rules do for standard (first-order) learners. As a main mathematical tool to prove this result, we introduce the generalised notion of second-order scoring rules.
First Integrals of Geodesic Flows on Cones
In this paper we study the behavior of geodesics on cones over arbitrary C^3-smooth closed Riemannian manifolds. We show that the geodesic flow on such cones admits first integrals whose values uniquely determine almost all geodesics except for cone generatrices. This investigation is inspired by our results on billiards inside cones over manifolds where similar results hold true.
Dual Lagrangian Learning for Conic Optimization
This paper presents Dual Lagrangian Learning (DLL), a principled learning methodology for dual conic optimization proxies. DLL leverages conic duality and the representation power of ML models to provide high-duality, dual-feasible solutions, and therefore valid Lagrangian dual bounds, for linear and nonlinear conic optimization problems. The paper introduces a systematic dual completion procedure, differentiable conic projection layers, and a self-supervised learning framework based on Lagrangian duality. It also provides closed-form dual completion formulae for broad classes of conic problems, which eliminate the need for costly implicit layers. The effectiveness of DLL is demonstrated on linear and nonlinear conic optimization problems. The proposed methodology significantly outperforms a state-of-the-art learning-based method, and achieves 1000x speedups over commercial interior-point solvers with optimality gaps under 0.5\% on average.
Second-Order Uncertainty Quantification: A Distance-Based Approach
In the past couple of years, various approaches to representing and quantifying different types of predictive uncertainty in machine learning, notably in the setting of classification, have been proposed on the basis of second-order probability distributions, i.e., predictions in the form of distributions on probability distributions. A completely conclusive solution has not yet been found, however, as shown by recent criticisms of commonly used uncertainty measures associated with second-order distributions, identifying undesirable theoretical properties of these measures. In light of these criticisms, we propose a set of formal criteria that meaningful uncertainty measures for predictive uncertainty based on second-order distributions should obey. Moreover, we provide a general framework for developing uncertainty measures to account for these criteria, and offer an instantiation based on the Wasserstein distance, for which we prove that all criteria are satisfied.
Second-order optimization with lazy Hessians
We analyze Newton's method with lazy Hessian updates for solving general possibly non-convex optimization problems. We propose to reuse a previously seen Hessian for several iterations while computing new gradients at each step of the method. This significantly reduces the overall arithmetical complexity of second-order optimization schemes. By using the cubic regularization technique, we establish fast global convergence of our method to a second-order stationary point, while the Hessian does not need to be updated each iteration. For convex problems, we justify global and local superlinear rates for lazy Newton steps with quadratic regularization, which is easier to compute. The optimal frequency for updating the Hessian is once every d iterations, where d is the dimension of the problem. This provably improves the total arithmetical complexity of second-order algorithms by a factor d.
Is Fast Adaptation All You Need?
Gradient-based meta-learning has proven to be highly effective at learning model initializations, representations, and update rules that allow fast adaptation from a few samples. The core idea behind these approaches is to use fast adaptation and generalization -- two second-order metrics -- as training signals on a meta-training dataset. However, little attention has been given to other possible second-order metrics. In this paper, we investigate a different training signal -- robustness to catastrophic interference -- and demonstrate that representations learned by directing minimizing interference are more conducive to incremental learning than those learned by just maximizing fast adaptation.
Sensitivity Analysis On Loss Landscape
Gradients can be employed for sensitivity analysis. Here, we leverage the advantages of the Loss Landscape to comprehend which independent variables impact the dependent variable. We seek to grasp the loss landscape by utilizing first, second, and third derivatives through automatic differentiation. we know that Spearman's rank correlation coefficient can detect the monotonic relationship between two variables. However, I have found that second-order gradients, with certain configurations and parameters, provide information that can be visualized similarly to Spearman results, In this approach, we incorporate a loss function with an activation function, resulting in a non-linear pattern. Each exploration of the loss landscape through retraining yields new valuable information. Furthermore, the first and third derivatives are also beneficial, as they indicate the extent to which independent variables influence the dependent variable.
Optimal Stochastic Non-smooth Non-convex Optimization through Online-to-Non-convex Conversion
We present new algorithms for optimizing non-smooth, non-convex stochastic objectives based on a novel analysis technique. This improves the current best-known complexity for finding a (delta,epsilon)-stationary point from O(epsilon^{-4}delta^{-1}) stochastic gradient queries to O(epsilon^{-3}delta^{-1}), which we also show to be optimal. Our primary technique is a reduction from non-smooth non-convex optimization to online learning, after which our results follow from standard regret bounds in online learning. For deterministic and second-order smooth objectives, applying more advanced optimistic online learning techniques enables a new complexity of O(epsilon^{-1.5}delta^{-0.5}). Our techniques also recover all optimal or best-known results for finding epsilon stationary points of smooth or second-order smooth objectives in both stochastic and deterministic settings.
On cusp holonomies in strictly convex projective geometry
We give a complete characterization of the holonomies of strictly convex cusps and of round cusps in convex projective geometry. We build families of generalized cusps of non-maximal rank associated to each strictly convex or round cusp. We also extend Ballas-Cooper-Leitner's definition of generalized cusp to allow for virtually solvable fundamental group, and we produce the first such example with non-virtually nilpotent fundamental group. Along with a companion paper, this allows to build strictly convex cusps and generalized cusps whose fundamental group is any finitely generated virtually nilpotent group. This also has interesting consequences for the theory of relatively Anosov representations.
Bolstering Stochastic Gradient Descent with Model Building
Stochastic gradient descent method and its variants constitute the core optimization algorithms that achieve good convergence rates for solving machine learning problems. These rates are obtained especially when these algorithms are fine-tuned for the application at hand. Although this tuning process can require large computational costs, recent work has shown that these costs can be reduced by line search methods that iteratively adjust the stepsize. We propose an alternative approach to stochastic line search by using a new algorithm based on forward step model building. This model building step incorporates second-order information that allows adjusting not only the stepsize but also the search direction. Noting that deep learning model parameters come in groups (layers of tensors), our method builds its model and calculates a new step for each parameter group. This novel diagonalization approach makes the selected step lengths adaptive. We provide convergence rate analysis, and experimentally show that the proposed algorithm achieves faster convergence and better generalization in well-known test problems. More precisely, SMB requires less tuning, and shows comparable performance to other adaptive methods.
Barycentric Subspace Analysis on Manifolds
This paper investigates the generalization of Principal Component Analysis (PCA) to Riemannian manifolds. We first propose a new and general type of family of subspaces in manifolds that we call barycentric subspaces. They are implicitly defined as the locus of points which are weighted means of k+1 reference points. As this definition relies on points and not on tangent vectors, it can also be extended to geodesic spaces which are not Riemannian. For instance, in stratified spaces, it naturally allows principal subspaces that span several strata, which is impossible in previous generalizations of PCA. We show that barycentric subspaces locally define a submanifold of dimension k which generalizes geodesic subspaces.Second, we rephrase PCA in Euclidean spaces as an optimization on flags of linear subspaces (a hierarchy of properly embedded linear subspaces of increasing dimension). We show that the Euclidean PCA minimizes the Accumulated Unexplained Variances by all the subspaces of the flag (AUV). Barycentric subspaces are naturally nested, allowing the construction of hierarchically nested subspaces. Optimizing the AUV criterion to optimally approximate data points with flags of affine spans in Riemannian manifolds lead to a particularly appealing generalization of PCA on manifolds called Barycentric Subspaces Analysis (BSA).
Polychrony as Chinampas
In this paper, we study the flow of signals through linear paths with the nonlinear condition that a node emits a signal when it receives external stimuli or when two incoming signals from other nodes arrive coincidentally with a combined amplitude above a fixed threshold. Sets of such nodes form a polychrony group and can sometimes lead to cascades. In the context of this work, cascades are polychrony groups in which the number of nodes activated as a consequence of other nodes is greater than the number of externally activated nodes. The difference between these two numbers is the so-called profit. Given the initial conditions, we predict the conditions for a vertex to activate at a prescribed time and provide an algorithm to efficiently reconstruct a cascade. We develop a dictionary between polychrony groups and graph theory. We call the graph corresponding to a cascade a chinampa. This link leads to a topological classification of chinampas. We enumerate the chinampas of profits zero and one and the description of a family of chinampas isomorphic to a family of partially ordered sets, which implies that the enumeration problem of this family is equivalent to computing the Stanley-order polynomials of those partially ordered sets.
Faces of highest weight modules and the universal Weyl polyhedron
Let V be a highest weight module over a Kac-Moody algebra g, and let conv V denote the convex hull of its weights. We determine the combinatorial isomorphism type of conv V, i.e. we completely classify the faces and their inclusions. In the special case where g is semisimple, this brings closure to a question studied by Cellini-Marietti [IMRN 2015] for the adjoint representation, and by Khare [J. Algebra 2016; Trans. Amer. Math. Soc. 2017] for most modules. The determination of faces of finite-dimensional modules up to the Weyl group action and some of their inclusions also appears in previous work of Satake [Ann. of Math. 1960], Borel-Tits [IHES Publ. Math. 1965], Vinberg [Izv. Akad. Nauk 1990], and Casselman [Austral. Math. Soc. 1997]. For any subset of the simple roots, we introduce a remarkable convex cone which we call the universal Weyl polyhedron, which controls the convex hulls of all modules parabolically induced from the corresponding Levi factor. Namely, the combinatorial isomorphism type of the cone stores the classification of faces for all such highest weight modules, as well as how faces degenerate as the highest weight gets increasingly singular. To our knowledge, this cone is new in finite and infinite type. We further answer a question of Michel Brion, by showing that the localization of conv V along a face is always the convex hull of the weights of a parabolically induced module. Finally, as we determine the inclusion relations between faces representation-theoretically from the set of weights, without recourse to convexity, we answer a similar question for highest weight modules over symmetrizable quantum groups.
Learning Preconditioner for Conjugate Gradient PDE Solvers
Efficient numerical solvers for partial differential equations empower science and engineering. One of the commonly employed numerical solvers is the preconditioned conjugate gradient (PCG) algorithm which can solve large systems to a given precision level. One challenge in PCG solvers is the selection of preconditioners, as different problem-dependent systems can benefit from different preconditioners. We present a new method to introduce inductive bias in preconditioning conjugate gradient algorithm. Given a system matrix and a set of solution vectors arise from an underlying distribution, we train a graph neural network to obtain an approximate decomposition to the system matrix to be used as a preconditioner in the context of PCG solvers. We conduct extensive experiments to demonstrate the efficacy and generalizability of our proposed approach in solving various 2D and 3D linear second-order PDEs.
Regularity of shadows and the geometry of the singular set associated to a Monge-Ampere equation
Illuminating the surface of a convex body with parallel beams of light in a given direction generates a shadow region. We prove sharp regularity results for the boundary of this shadow in every direction of illumination. Moreover, techniques are developed for investigating the regularity of the region generated by orthogonally projecting a convex set onto another. As an application we study the geometry and Hausdorff dimension of the singular set corresponding to a Monge-Ampere equation.
Advancing the lower bounds: An accelerated, stochastic, second-order method with optimal adaptation to inexactness
We present a new accelerated stochastic second-order method that is robust to both gradient and Hessian inexactness, which occurs typically in machine learning. We establish theoretical lower bounds and prove that our algorithm achieves optimal convergence in both gradient and Hessian inexactness in this key setting. We further introduce a tensor generalization for stochastic higher-order derivatives. When the oracles are non-stochastic, the proposed tensor algorithm matches the global convergence of Nesterov Accelerated Tensor method. Both algorithms allow for approximate solutions of their auxiliary subproblems with verifiable conditions on the accuracy of the solution.
Order Theory in the Context of Machine Learning
The paper ``Tropical Geometry of Deep Neural Networks'' by L. Zhang et al. introduces an equivalence between integer-valued neural networks (IVNN) with ReLU_{t} and tropical rational functions, which come with a map to polytopes. Here, IVNN refers to a network with integer weights but real biases, and ReLU_{t} is defined as ReLU_{t}(x)=max(x,t) for tinRcup{-infty}. For every poset with n points, there exists a corresponding order polytope, i.e., a convex polytope in the unit cube [0,1]^n whose coordinates obey the inequalities of the poset. We study neural networks whose associated polytope is an order polytope. We then explain how posets with four points induce neural networks that can be interpreted as 2times 2 convolutional filters. These poset filters can be added to any neural network, not only IVNN. Similarly to maxout, poset pooling filters update the weights of the neural network during backpropagation with more precision than average pooling, max pooling, or mixed pooling, without the need to train extra parameters. We report experiments that support our statements. We also define the structure of algebra over the operad of posets on poset neural networks and tropical polynomials. This formalism allows us to study the composition of poset neural network arquitectures and the effect on their corresponding Newton polytopes, via the introduction of the generalization of two operations on polytopes: the Minkowski sum and the convex envelope.
Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion
Sampling from the posterior distribution poses a major computational challenge in solving inverse problems using latent diffusion models. Common methods rely on Tweedie's first-order moments, which are known to induce a quality-limiting bias. Existing second-order approximations are impractical due to prohibitive computational costs, making standard reverse diffusion processes intractable for posterior sampling. This paper introduces Second-order Tweedie sampler from Surrogate Loss (STSL), a novel sampler that offers efficiency comparable to first-order Tweedie with a tractable reverse process using second-order approximation. Our theoretical results reveal that the second-order approximation is lower bounded by our surrogate loss that only requires O(1) compute using the trace of the Hessian, and by the lower bound we derive a new drift term to make the reverse process tractable. Our method surpasses SoTA solvers PSLD and P2L, achieving 4X and 8X reduction in neural function evaluations, respectively, while notably enhancing sampling quality on FFHQ, ImageNet, and COCO benchmarks. In addition, we show STSL extends to text-guided image editing and addresses residual distortions present from corrupted images in leading text-guided image editing methods. To our best knowledge, this is the first work to offer an efficient second-order approximation in solving inverse problems using latent diffusion and editing real-world images with corruptions.
Approximating the Convex Hull via Metric Space Magnitude
Magnitude of a finite metric space and the related notion of magnitude functions on metric spaces is an active area of research in algebraic topology. Magnitude originally arose in the context of biology, where it represents the number of effective species in an environment; when applied to a one-parameter family of metric spaces tX with scale parameter t, the magnitude captures much of the underlying geometry of the space. Prior work has mostly focussed on properties of magnitude in a global sense; in this paper we restrict the sets to finite subsets of Euclidean space and investigate its individual components. We give an explicit formula for the corrected inclusion-exclusion principle, and define a quantity associated with each point, called the moment which gives an intrinsic ordering to the points. We exploit this in order to form an algorithm which approximates the convex hull.
A New Perspective on Shampoo's Preconditioner
Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community. The preconditioner used by Shampoo can be viewed either as an approximation of the Gauss--Newton component of the Hessian or the covariance matrix of the gradients maintained by Adagrad. We provide an explicit and novel connection between the optimal Kronecker product approximation of these matrices and the approximation made by Shampoo. Our connection highlights a subtle but common misconception about Shampoo's approximation. In particular, the square of the approximation used by the Shampoo optimizer is equivalent to a single step of the power iteration algorithm for computing the aforementioned optimal Kronecker product approximation. Across a variety of datasets and architectures we empirically demonstrate that this is close to the optimal Kronecker product approximation. Additionally, for the Hessian approximation viewpoint, we empirically study the impact of various practical tricks to make Shampoo more computationally efficient (such as using the batch gradient and the empirical Fisher) on the quality of Hessian approximation.
Exact Gauss-Newton Optimization for Training Deep Neural Networks
We present EGN, a stochastic second-order optimization algorithm that combines the generalized Gauss-Newton (GN) Hessian approximation with low-rank linear algebra to compute the descent direction. Leveraging the Duncan-Guttman matrix identity, the parameter update is obtained by factorizing a matrix which has the size of the mini-batch. This is particularly advantageous for large-scale machine learning problems where the dimension of the neural network parameter vector is several orders of magnitude larger than the batch size. Additionally, we show how improvements such as line search, adaptive regularization, and momentum can be seamlessly added to EGN to further accelerate the algorithm. Moreover, under mild assumptions, we prove that our algorithm converges to an epsilon-stationary point at a linear rate. Finally, our numerical experiments demonstrate that EGN consistently exceeds, or at most matches the generalization performance of well-tuned SGD, Adam, and SGN optimizers across various supervised and reinforcement learning tasks.
Ellipse R-CNN: Learning to Infer Elliptical Object from Clustering and Occlusion
Images of heavily occluded objects in cluttered scenes, such as fruit clusters in trees, are hard to segment. To further retrieve the 3D size and 6D pose of each individual object in such cases, bounding boxes are not reliable from multiple views since only a little portion of the object's geometry is captured. We introduce the first CNN-based ellipse detector, called Ellipse R-CNN, to represent and infer occluded objects as ellipses. We first propose a robust and compact ellipse regression based on the Mask R-CNN architecture for elliptical object detection. Our method can infer the parameters of multiple elliptical objects even they are occluded by other neighboring objects. For better occlusion handling, we exploit refined feature regions for the regression stage, and integrate the U-Net structure for learning different occlusion patterns to compute the final detection score. The correctness of ellipse regression is validated through experiments performed on synthetic data of clustered ellipses. We further quantitatively and qualitatively demonstrate that our approach outperforms the state-of-the-art model (i.e., Mask R-CNN followed by ellipse fitting) and its three variants on both synthetic and real datasets of occluded and clustered elliptical objects.
Bootstrability in Line-Defect CFT with Improved Truncation Methods
We study the conformal bootstrap of 1D CFTs on the straight Maldacena-Wilson line in 4D {cal N}=4 super-Yang-Mills theory. We introduce an improved truncation scheme with an 'OPE tail' approximation and use it to reproduce the 'bootstrability' results of Cavagli\`a et al. for the OPE-coefficients squared of the first three unprotected operators. For example, for the first OPE-coefficient squared at 't Hooft coupling (4pi)^2, linear-functional methods with two sum rules from integrated correlators give the rigorous result 0.294014873 pm 4.88 cdot 10^{-8}, whereas our methods give with machine-precision computations 0.294014228 pm 6.77 cdot 10^{-7}. For our numerical searches, we benchmark the Reinforcement Learning Soft Actor-Critic algorithm against an Interior Point Method algorithm (IPOPT) and comment on the merits of each algorithm.
Spacetime Neural Network for High Dimensional Quantum Dynamics
We develop a spacetime neural network method with second order optimization for solving quantum dynamics from the high dimensional Schr\"{o}dinger equation. In contrast to the standard iterative first order optimization and the time-dependent variational principle, our approach utilizes the implicit mid-point method and generates the solution for all spatial and temporal values simultaneously after optimization. We demonstrate the method in the Schr\"{o}dinger equation with a self-normalized autoregressive spacetime neural network construction. Future explorations for solving different high dimensional differential equations are discussed.
Training Neural Networks in Single vs Double Precision
The commitment to single-precision floating-point arithmetic is widespread in the deep learning community. To evaluate whether this commitment is justified, the influence of computing precision (single and double precision) on the optimization performance of the Conjugate Gradient (CG) method (a second-order optimization algorithm) and RMSprop (a first-order algorithm) has been investigated. Tests of neural networks with one to five fully connected hidden layers and moderate or strong nonlinearity with up to 4 million network parameters have been optimized for Mean Square Error (MSE). The training tasks have been set up so that their MSE minimum was known to be zero. Computing experiments have disclosed that single-precision can keep up (with superlinear convergence) with double-precision as long as line search finds an improvement. First-order methods such as RMSprop do not benefit from double precision. However, for moderately nonlinear tasks, CG is clearly superior. For strongly nonlinear tasks, both algorithm classes find only solutions fairly poor in terms of mean square error as related to the output variance. CG with double floating-point precision is superior whenever the solutions have the potential to be useful for the application goal.
Nearest Neighbour Based Estimates of Gradients: Sharp Nonasymptotic Bounds and Applications
Motivated by a wide variety of applications, ranging from stochastic optimization to dimension reduction through variable selection, the problem of estimating gradients accurately is of crucial importance in statistics and learning theory. We consider here the classic regression setup, where a real valued square integrable r.v. Y is to be predicted upon observing a (possibly high dimensional) random vector X by means of a predictive function f(X) as accurately as possible in the mean-squared sense and study a nearest-neighbour-based pointwise estimate of the gradient of the optimal predictive function, the regression function m(x)=E[Ymid X=x]. Under classic smoothness conditions combined with the assumption that the tails of Y-m(X) are sub-Gaussian, we prove nonasymptotic bounds improving upon those obtained for alternative estimation methods. Beyond the novel theoretical results established, several illustrative numerical experiments have been carried out. The latter provide strong empirical evidence that the estimation method proposed works very well for various statistical problems involving gradient estimation, namely dimensionality reduction, stochastic gradient descent optimization and quantifying disentanglement.
More on the Weak Gravity Conjecture via Convexity of Charged Operators
The Weak Gravity Conjecture has recently been re-formulated in terms of a particle with non-negative self-binding energy. Because of the dual conformal field theory (CFT) formulation in the anti-de Sitter space the conformal dimension Delta (Q) of the lowest-dimension operator with charge Q under some global U(1) symmetry must be a convex function of Q. This property has been conjectured to hold for any (unitary) conformal field theory and generalized to larger global symmetry groups. Here we refine and further test the convex charge conjecture via semiclassical computations for fixed charge sectors of different theories in different dimensions. We analyze the convexity properties of the leading and next-to-leading order terms stemming from the semiclassical computation, de facto, extending previous tests beyond the leading perturbative contributions and to arbitrary charges. In particular, the leading contribution is sufficient to test convexity in the semiclassical computations. We also consider intriguing cases in which the models feature a transition from real to complex conformal dimensions either as a function of the charge or number of matter fields. As a relevant example of the first kind, we investigate the O(N) model in 4+epsilon dimensions. As an example of the second type we consider the U(N)times U(M) model in 4-epsilon dimensions. Both models display a rich dynamics where, by changing the number of matter fields and/or charge, one can achieve dramatically different physical regimes. We discover that whenever a complex conformal dimension appears, the real part satisfies the convexity property.
An elementary and unified proof of Grothendieck's inequality
We present an elementary, self-contained proof of Grothendieck's inequality that unifies the real and complex cases and yields both the Krivine and Haagerup bounds, the current best-known explicit bounds for the real and complex Grothendieck constants respectively. This article is intended to be pedagogical, combining and streamlining known ideas of Lindenstrauss--Pe{\l}czy\'nski, Krivine, and Haagerup into a proof that need only univariate calculus, basic complex variables, and a modicum of linear algebra as prerequisites.
Towards Gradient Free and Projection Free Stochastic Optimization
This paper focuses on the problem of constrained stochastic optimization. A zeroth order Frank-Wolfe algorithm is proposed, which in addition to the projection-free nature of the vanilla Frank-Wolfe algorithm makes it gradient free. Under convexity and smoothness assumption, we show that the proposed algorithm converges to the optimal objective function at a rate Oleft(1/T^{1/3}right), where T denotes the iteration count. In particular, the primal sub-optimality gap is shown to have a dimension dependence of Oleft(d^{1/3}right), which is the best known dimension dependence among all zeroth order optimization algorithms with one directional derivative per iteration. For non-convex functions, we obtain the Frank-Wolfe gap to be Oleft(d^{1/3}T^{-1/4}right). Experiments on black-box optimization setups demonstrate the efficacy of the proposed algorithm.
The Numerical Stability of Hyperbolic Representation Learning
Given the exponential growth of the volume of the ball w.r.t. its radius, the hyperbolic space is capable of embedding trees with arbitrarily small distortion and hence has received wide attention for representing hierarchical datasets. However, this exponential growth property comes at a price of numerical instability such that training hyperbolic learning models will sometimes lead to catastrophic NaN problems, encountering unrepresentable values in floating point arithmetic. In this work, we carefully analyze the limitation of two popular models for the hyperbolic space, namely, the Poincar\'e ball and the Lorentz model. We first show that, under the 64 bit arithmetic system, the Poincar\'e ball has a relatively larger capacity than the Lorentz model for correctly representing points. Then, we theoretically validate the superiority of the Lorentz model over the Poincar\'e ball from the perspective of optimization. Given the numerical limitations of both models, we identify one Euclidean parametrization of the hyperbolic space which can alleviate these limitations. We further extend this Euclidean parametrization to hyperbolic hyperplanes and exhibits its ability in improving the performance of hyperbolic SVM.
Scaling Spherical CNNs
Spherical CNNs generalize CNNs to functions on the sphere, by using spherical convolutions as the main linear operation. The most accurate and efficient way to compute spherical convolutions is in the spectral domain (via the convolution theorem), which is still costlier than the usual planar convolutions. For this reason, applications of spherical CNNs have so far been limited to small problems that can be approached with low model capacity. In this work, we show how spherical CNNs can be scaled for much larger problems. To achieve this, we make critical improvements including novel variants of common model components, an implementation of core operations to exploit hardware accelerator characteristics, and application-specific input representations that exploit the properties of our model. Experiments show our larger spherical CNNs reach state-of-the-art on several targets of the QM9 molecular benchmark, which was previously dominated by equivariant graph neural networks, and achieve competitive performance on multiple weather forecasting tasks. Our code is available at https://github.com/google-research/spherical-cnn.
Approximately Optimal Core Shapes for Tensor Decompositions
This work studies the combinatorial optimization problem of finding an optimal core tensor shape, also called multilinear rank, for a size-constrained Tucker decomposition. We give an algorithm with provable approximation guarantees for its reconstruction error via connections to higher-order singular values. Specifically, we introduce a novel Tucker packing problem, which we prove is NP-hard, and give a polynomial-time approximation scheme based on a reduction to the 2-dimensional knapsack problem with a matroid constraint. We also generalize our techniques to tree tensor network decompositions. We implement our algorithm using an integer programming solver, and show that its solution quality is competitive with (and sometimes better than) the greedy algorithm that uses the true Tucker decomposition loss at each step, while also running up to 1000x faster.
Stochastic Hessian Fitting on Lie Group
This paper studies the fitting of Hessian or its inverse with stochastic Hessian-vector products. A Hessian fitting criterion, which can be used to derive most of the commonly used methods, e.g., BFGS, Gaussian-Newton, AdaGrad, etc., is used for the analysis. Our studies reveal different convergence rates for different Hessian fitting methods, e.g., sublinear rates for gradient descent in the Euclidean space and a commonly used closed-form solution, linear rates for gradient descent on the manifold of symmetric positive definite (SPL) matrices and certain Lie groups. The Hessian fitting problem is further shown to be strongly convex under mild conditions on a specific yet general enough Lie group. To confirm our analysis, these methods are tested under different settings like noisy Hessian-vector products, time varying Hessians, and low precision arithmetic. These findings are useful for stochastic second order optimizations that rely on fast, robust and accurate Hessian estimations.
How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation
In the classical transformer attention scheme, we are given three n times d size matrices Q, K, V (the query, key, and value tokens), and the goal is to compute a new n times d size matrix D^{-1} exp(QK^top) V where D = diag( exp(QK^top) {bf 1}_n ). In this work, we study a generalization of attention which captures triple-wise correlations. This generalization is able to solve problems about detecting triple-wise connections that were shown to be impossible for transformers. The potential downside of this generalization is that it appears as though computations are even more difficult, since the straightforward algorithm requires cubic time in n. However, we show that in the bounded-entry setting (which arises in practice, and which is well-studied in both theory and practice), there is actually a near-linear time algorithm. More precisely, we show that bounded entries are both necessary and sufficient for quickly performing generalized computations: bullet On the positive side, if all entries of the input matrices are bounded above by o(sqrt[3]{log n}) then we show how to approximate the ``tensor-type'' attention matrix in n^{1+o(1)} time. bullet On the negative side, we show that if the entries of the input matrices may be as large as Omega(sqrt[3]{log n}), then there is no algorithm that runs faster than n^{3-o(1)} (assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory). We also show that our construction, algorithms, and lower bounds naturally generalize to higher-order tensors and correlations. Interestingly, the higher the order of the tensors, the lower the bound on the entries needs to be for an efficient algorithm. Our results thus yield a natural tradeoff between the boundedness of the entries, and order of the tensor one may use for more expressive, efficient attention computation.
Positive Geometries and Canonical Forms
Recent years have seen a surprising connection between the physics of scattering amplitudes and a class of mathematical objects--the positive Grassmannian, positive loop Grassmannians, tree and loop Amplituhedra--which have been loosely referred to as "positive geometries". The connection between the geometry and physics is provided by a unique differential form canonically determined by the property of having logarithmic singularities (only) on all the boundaries of the space, with residues on each boundary given by the canonical form on that boundary. In this paper we initiate an exploration of "positive geometries" and "canonical forms" as objects of study in their own right in a more general mathematical setting. We give a precise definition of positive geometries and canonical forms, introduce general methods for finding forms for more complicated positive geometries from simpler ones, and present numerous examples of positive geometries in projective spaces, Grassmannians, and toric, cluster and flag varieties. We also illustrate a number of strategies for computing canonical forms which yield interesting representations for the forms associated with wide classes of positive geometries, ranging from the simplest Amplituhedra to new expressions for the volume of arbitrary convex polytopes.
Beyond Euclid: An Illustrated Guide to Modern Machine Learning with Geometric, Topological, and Algebraic Structures
The enduring legacy of Euclidean geometry underpins classical machine learning, which, for decades, has been primarily developed for data lying in Euclidean space. Yet, modern machine learning increasingly encounters richly structured data that is inherently nonEuclidean. This data can exhibit intricate geometric, topological and algebraic structure: from the geometry of the curvature of space-time, to topologically complex interactions between neurons in the brain, to the algebraic transformations describing symmetries of physical systems. Extracting knowledge from such non-Euclidean data necessitates a broader mathematical perspective. Echoing the 19th-century revolutions that gave rise to non-Euclidean geometry, an emerging line of research is redefining modern machine learning with non-Euclidean structures. Its goal: generalizing classical methods to unconventional data types with geometry, topology, and algebra. In this review, we provide an accessible gateway to this fast-growing field and propose a graphical taxonomy that integrates recent advances into an intuitive unified framework. We subsequently extract insights into current challenges and highlight exciting opportunities for future development in this field.
Two-timescale Extragradient for Finding Local Minimax Points
Minimax problems are notoriously challenging to optimize. However, we demonstrate that the two-timescale extragradient can be a viable solution. By utilizing dynamical systems theory, we show that it converges to points that satisfy the second-order necessary condition of local minimax points, under a mild condition. This work surpasses all previous results as we eliminate a crucial assumption that the Hessian, with respect to the maximization variable, is nondegenerate.
Stochastic Taylor Derivative Estimator: Efficient amortization for arbitrary differential operators
Optimizing neural networks with loss that contain high-dimensional and high-order differential operators is expensive to evaluate with back-propagation due to O(d^{k}) scaling of the derivative tensor size and the O(2^{k-1}L) scaling in the computation graph, where d is the dimension of the domain, L is the number of ops in the forward computation graph, and k is the derivative order. In previous works, the polynomial scaling in d was addressed by amortizing the computation over the optimization process via randomization. Separately, the exponential scaling in k for univariate functions (d=1) was addressed with high-order auto-differentiation (AD). In this work, we show how to efficiently perform arbitrary contraction of the derivative tensor of arbitrary order for multivariate functions, by properly constructing the input tangents to univariate high-order AD, which can be used to efficiently randomize any differential operator. When applied to Physics-Informed Neural Networks (PINNs), our method provides >1000times speed-up and >30times memory reduction over randomization with first-order AD, and we can now solve 1-million-dimensional PDEs in 8 minutes on a single NVIDIA A100 GPU. This work opens the possibility of using high-order differential operators in large-scale problems.
Energy-conserving equivariant GNN for elasticity of lattice architected metamaterials
Lattices are architected metamaterials whose properties strongly depend on their geometrical design. The analogy between lattices and graphs enables the use of graph neural networks (GNNs) as a faster surrogate model compared to traditional methods such as finite element modelling. In this work, we generate a big dataset of structure-property relationships for strut-based lattices. The dataset is made available to the community which can fuel the development of methods anchored in physical principles for the fitting of fourth-order tensors. In addition, we present a higher-order GNN model trained on this dataset. The key features of the model are (i) SE(3) equivariance, and (ii) consistency with the thermodynamic law of conservation of energy. We compare the model to non-equivariant models based on a number of error metrics and demonstrate its benefits in terms of predictive performance and reduced training requirements. Finally, we demonstrate an example application of the model to an architected material design task. The methods which we developed are applicable to fourth-order tensors beyond elasticity such as piezo-optical tensor etc.
Space-time tradeoffs of lenses and optics via higher category theory
Optics and lenses are abstract categorical gadgets that model systems with bidirectional data flow. In this paper we observe that the denotational definition of optics - identifying two optics as equivalent by observing their behaviour from the outside - is not suitable for operational, software oriented approaches where optics are not merely observed, but built with their internal setups in mind. We identify operational differences between denotationally isomorphic categories of cartesian optics and lenses: their different composition rule and corresponding space-time tradeoffs, positioning them at two opposite ends of a spectrum. With these motivations we lift the existing categorical constructions and their relationships to the 2-categorical level, showing that the relevant operational concerns become visible. We define the 2-category 2-Optic(C) whose 2-cells explicitly track optics' internal configuration. We show that the 1-category Optic(C) arises by locally quotienting out the connected components of this 2-category. We show that the embedding of lenses into cartesian optics gets weakened from a functor to an oplax functor whose oplaxator now detects the different composition rule. We determine the difficulties in showing this functor forms a part of an adjunction in any of the standard 2-categories. We establish a conjecture that the well-known isomorphism between cartesian lenses and optics arises out of the lax 2-adjunction between their double-categorical counterparts. In addition to presenting new research, this paper is also meant to be an accessible introduction to the topic.
A Massively Parallel Dynamic Programming for Approximate Rectangle Escape Problem
Sublinear time complexity is required by the massively parallel computation (MPC) model. Breaking dynamic programs into a set of sparse dynamic programs that can be divided, solved, and merged in sublinear time. The rectangle escape problem (REP) is defined as follows: For n axis-aligned rectangles inside an axis-aligned bounding box B, extend each rectangle in only one of the four directions: up, down, left, or right until it reaches B and the density k is minimized, where k is the maximum number of extensions of rectangles to the boundary that pass through a point inside bounding box B. REP is NP-hard for k>1. If the rectangles are points of a grid (or unit squares of a grid), the problem is called the square escape problem (SEP) and it is still NP-hard. We give a 2-approximation algorithm for SEP with kgeq2 with time complexity O(n^{3/2}k^2). This improves the time complexity of existing algorithms which are at least quadratic. Also, the approximation ratio of our algorithm for kgeq 3 is 3/2 which is tight. We also give a 8-approximation algorithm for REP with time complexity O(nlog n+nk) and give a MPC version of this algorithm for k=O(1) which is the first parallel algorithm for this problem.
Understanding Gradient Orthogonalization for Deep Learning via Non-Euclidean Trust-Region Optimization
Optimization with matrix gradient orthogonalization has recently demonstrated impressive results in the training of deep neural networks (Jordan et al., 2024; Liu et al., 2025). In this paper, we provide a theoretical analysis of this approach. In particular, we show that the orthogonalized gradient method can be seen as a first-order trust-region optimization method, where the trust-region is defined in terms of the matrix spectral norm. Motivated by this observation, we develop the stochastic non-Euclidean trust-region gradient method with momentum, which recovers the Muon optimizer (Jordan et al., 2024) as a special case, along with normalized SGD and signSGD with momentum (Cutkosky and Mehta, 2020; Sun et al., 2023). In addition, we prove state-of-the-art convergence results for the proposed algorithm in a range of scenarios, which involve arbitrary non-Euclidean norms, constrained and composite problems, and non-convex, star-convex, first- and second-order smooth functions. Finally, our theoretical findings provide an explanation for several practical observations, including the practical superiority of Muon compared to the Orthogonal-SGDM algorithm of Tuddenham et al. (2022) and the importance of weight decay in the training of large-scale language models.
Fast Differentiable Matrix Square Root
Computing the matrix square root or its inverse in a differentiable manner is important in a variety of computer vision tasks. Previous methods either adopt the Singular Value Decomposition (SVD) to explicitly factorize the matrix or use the Newton-Schulz iteration (NS iteration) to derive the approximate solution. However, both methods are not computationally efficient enough in either the forward pass or in the backward pass. In this paper, we propose two more efficient variants to compute the differentiable matrix square root. For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad\'e Approximants (MPA). The backward gradient is computed by iteratively solving the continuous-time Lyapunov equation using the matrix sign function. Both methods yield considerable speed-up compared with the SVD or the Newton-Schulz iteration. Experimental results on the de-correlated batch normalization and second-order vision transformer demonstrate that our methods can also achieve competitive and even slightly better performances. The code is available at https://github.com/KingJamesSong/FastDifferentiableMatSqrt{https://github.com/KingJamesSong/FastDifferentiableMatSqrt}.
M-FAC: Efficient Matrix-Free Approximations of Second-Order Information
Efficiently approximating local curvature information of the loss function is a key tool for optimization and compression of deep neural networks. Yet, most existing methods to approximate second-order information have high computational or storage costs, which can limit their practicality. In this work, we investigate matrix-free, linear-time approaches for estimating Inverse-Hessian Vector Products (IHVPs) for the case when the Hessian can be approximated as a sum of rank-one matrices, as in the classic approximation of the Hessian by the empirical Fisher matrix. We propose two new algorithms as part of a framework called M-FAC: the first algorithm is tailored towards network compression and can compute the IHVP for dimension d, if the Hessian is given as a sum of m rank-one matrices, using O(dm^2) precomputation, O(dm) cost for computing the IHVP, and query cost O(m) for any single element of the inverse Hessian. The second algorithm targets an optimization setting, where we wish to compute the product between the inverse Hessian, estimated over a sliding window of optimization steps, and a given gradient direction, as required for preconditioned SGD. We give an algorithm with cost O(dm + m^2) for computing the IHVP and O(dm + m^3) for adding or removing any gradient from the sliding window. These two algorithms yield state-of-the-art results for network pruning and optimization with lower computational overhead relative to existing second-order methods. Implementations are available at [9] and [17].
Tuning Pre-trained Model via Moment Probing
Recently, efficient fine-tuning of large-scale pre-trained models has attracted increasing research interests, where linear probing (LP) as a fundamental module is involved in exploiting the final representations for task-dependent classification. However, most of the existing methods focus on how to effectively introduce a few of learnable parameters, and little work pays attention to the commonly used LP module. In this paper, we propose a novel Moment Probing (MP) method to further explore the potential of LP. Distinguished from LP which builds a linear classification head based on the mean of final features (e.g., word tokens for ViT) or classification tokens, our MP performs a linear classifier on feature distribution, which provides the stronger representation ability by exploiting richer statistical information inherent in features. Specifically, we represent feature distribution by its characteristic function, which is efficiently approximated by using first- and second-order moments of features. Furthermore, we propose a multi-head convolutional cross-covariance (MHC^3) to compute second-order moments in an efficient and effective manner. By considering that MP could affect feature learning, we introduce a partially shared module to learn two recalibrating parameters (PSRP) for backbones based on MP, namely MP_{+}. Extensive experiments on ten benchmarks using various models show that our MP significantly outperforms LP and is competitive with counterparts at less training cost, while our MP_{+} achieves state-of-the-art performance.
Compressing Latent Space via Least Volume
This paper introduces Least Volume-a simple yet effective regularization inspired by geometric intuition-that can reduce the necessary number of latent dimensions needed by an autoencoder without requiring any prior knowledge of the intrinsic dimensionality of the dataset. We show that the Lipschitz continuity of the decoder is the key to making it work, provide a proof that PCA is just a linear special case of it, and reveal that it has a similar PCA-like importance ordering effect when applied to nonlinear models. We demonstrate the intuition behind the regularization on some pedagogical toy problems, and its effectiveness on several benchmark problems, including MNIST, CIFAR-10 and CelebA.
ConvMesh: Reimagining Mesh Quality Through Convex Optimization
Mesh generation has become a critical topic in recent years, forming the foundation of all 3D objects used across various applications, such as virtual reality, gaming, and 3D printing. With advancements in computational resources and machine learning, neural networks have emerged as powerful tools for generating high-quality 3D object representations, enabling accurate scene and object reconstructions. Despite these advancements, many methods produce meshes that lack realism or exhibit geometric and textural flaws, necessitating additional processing to improve their quality. This research introduces a convex optimization programming called disciplined convex programming to enhance existing meshes by refining their texture and geometry with a conic solver. By focusing on a sparse set of point clouds from both the original and target meshes, this method demonstrates significant improvements in mesh quality with minimal data requirements. To evaluate the approach, the classical dolphin mesh dataset from Facebook AI was used as a case study, with optimization performed using the CVXPY library. The results reveal promising potential for streamlined and effective mesh refinement.
Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information
We present a novel adaptive optimization algorithm for large-scale machine learning problems. Equipped with a low-cost estimate of local curvature and Lipschitz smoothness, our method dynamically adapts the search direction and step-size. The search direction contains gradient information preconditioned by a well-scaled diagonal preconditioning matrix that captures the local curvature information. Our methodology does not require the tedious task of learning rate tuning, as the learning rate is updated automatically without adding an extra hyperparameter. We provide convergence guarantees on a comprehensive collection of optimization problems, including convex, strongly convex, and nonconvex problems, in both deterministic and stochastic regimes. We also conduct an extensive empirical evaluation on standard machine learning problems, justifying our algorithm's versatility and demonstrating its strong performance compared to other start-of-the-art first-order and second-order methods.
Unveiling the Unseen: Identifiable Clusters in Trained Depthwise Convolutional Kernels
Recent advances in depthwise-separable convolutional neural networks (DS-CNNs) have led to novel architectures, that surpass the performance of classical CNNs, by a considerable scalability and accuracy margin. This paper reveals another striking property of DS-CNN architectures: discernible and explainable patterns emerge in their trained depthwise convolutional kernels in all layers. Through an extensive analysis of millions of trained filters, with different sizes and from various models, we employed unsupervised clustering with autoencoders, to categorize these filters. Astonishingly, the patterns converged into a few main clusters, each resembling the difference of Gaussian (DoG) functions, and their first and second-order derivatives. Notably, we were able to classify over 95\% and 90\% of the filters from state-of-the-art ConvNextV2 and ConvNeXt models, respectively. This finding is not merely a technological curiosity; it echoes the foundational models neuroscientists have long proposed for the vision systems of mammals. Our results thus deepen our understanding of the emergent properties of trained DS-CNNs and provide a bridge between artificial and biological visual processing systems. More broadly, they pave the way for more interpretable and biologically-inspired neural network designs in the future.
Toward a Better Understanding of Fourier Neural Operators: Analysis and Improvement from a Spectral Perspective
In solving partial differential equations (PDEs), Fourier Neural Operators (FNOs) have exhibited notable effectiveness compared to Convolutional Neural Networks (CNNs). This paper presents clear empirical evidence through spectral analysis to elucidate the superiority of FNO over CNNs: FNO is significantly more capable of learning low-frequencies. This empirical evidence also unveils FNO's distinct low-frequency bias, which limits FNO's effectiveness in learning high-frequency information from PDE data. To tackle this challenge, we introduce SpecBoost, an ensemble learning framework that employs multiple FNOs to better capture high-frequency information. Specifically, a secondary FNO is utilized to learn the overlooked high-frequency information from the prediction residual of the initial FNO. Experiments demonstrate that SpecBoost noticeably enhances FNO's prediction accuracy on diverse PDE applications, achieving an up to 71% improvement.
On Computational Limits and Provably Efficient Criteria of Visual Autoregressive Models: A Fine-Grained Complexity Analysis
Recently, Visual Autoregressive (VAR) Models introduced a groundbreaking advancement in the field of image generation, offering a scalable approach through a coarse-to-fine "next-scale prediction" paradigm. However, the state-of-the-art algorithm of VAR models in [Tian, Jiang, Yuan, Peng and Wang, NeurIPS 2024] takes O(n^4) time, which is computationally inefficient. In this work, we analyze the computational limits and efficiency criteria of VAR Models through a fine-grained complexity lens. Our key contribution is identifying the conditions under which VAR computations can achieve sub-quadratic time complexity. Specifically, we establish a critical threshold for the norm of input matrices used in VAR attention mechanisms. Above this threshold, assuming the Strong Exponential Time Hypothesis (SETH) from fine-grained complexity theory, a sub-quartic time algorithm for VAR models is impossible. To substantiate our theoretical findings, we present efficient constructions leveraging low-rank approximations that align with the derived criteria. This work initiates the study of the computational efficiency of the VAR model from a theoretical perspective. Our technique will shed light on advancing scalable and efficient image generation in VAR frameworks.
