Upload 2 files
Browse files- emo-v38-paper(ENG).txt +70 -31
- emo-v38-paper(JPN).txt +39 -12
emo-v38-paper(ENG).txt
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
-
Paper: Improving Time-Series SNR Estimation and Regret Bound in the Autonomous Optimization Algorithm emoPulse
|
| 2 |
|
| 3 |
-
— Establishing “Emotion-Driven” Learning Rate Control through Dynamic Inspection of Loss
|
| 4 |
|
| 5 |
Abstract
|
| 6 |
|
|
@@ -10,16 +10,23 @@ Abstract
|
|
| 10 |
|
| 11 |
This method autonomously generates an optimal learning rate based on the signal-to-noise ratio by capturing the “undulations” of the loss landscape from a three-stage exponential moving average (Multi-EMA) and utilizing sentiment scalars and a confidence indicator (Trust).
|
| 12 |
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
This achieves robust convergence independent of hyperparameter settings, providing a democratic foundation for research environments in developing countries with limited computational resources and for multilingual learning aimed at preserving diverse cultural heritage.
|
| 16 |
|
| 17 |
Finally, I append my thoughts and predictions regarding Grocking.
|
|
|
|
| 18 |
|
| 19 |
|
| 20 |
1. Introduction
|
| 21 |
|
| 22 |
-
This paper presents a unified theory for the optimizers EmoSens / EmoAiry / EmoCats / EmoTion.
|
| 23 |
|
| 24 |
This method centers on the emoPulse mechanism, which autonomously generates learning rates by layering the exponential moving average (EMA) of loss values and extracting “Trust” from the time-series statistics of the loss function.
|
| 25 |
|
|
@@ -35,29 +42,21 @@ Abstract
|
|
| 35 |
|
| 36 |
This approach achieves the replacement of higher-order moment calculations with scalar control and optimization for low-precision and quantized environments through encoded updates.
|
| 37 |
|
| 38 |
-
Its most significant feature lies in integrating local solutions from multiple emo-based optimizers with distinct characteristics as “multiple positioning.” This enables reaching the flat minimum—previously requiring lengthy iterative learning—through short-term learning and synthesis.
|
| 39 |
|
| 40 |
This approach achieved the following three outcomes:
|
| 41 |
|
| 42 |
Dramatic improvement in computational efficiency: Complex calculations of higher-order moments were replaced with scalar control via temporal accumulation of loss, reducing computational load through temporal accumulation approximation.
|
| 43 |
|
| 44 |
-
Optimization for low precision and quantization: Matrix decomposition in EmoAiry, complete elimination of second moments in EmoCats, and
|
| 45 |
|
| 46 |
Autonomous Convergence: By introspecting the S/N ratio of the loss landscape, it eliminates the need for manual schedulers and minimizes the user's trial cost.
|
| 47 |
|
| 48 |
-
※ Higher-order
|
| 49 |
|
| 50 |
Mathematically, this represents an advanced fusion of D-adaptation theory and time-series signal processing, forming the foundation for realizing “democratic AI learning” that preserves research environments and diverse cultures in developing countries.
|
| 51 |
|
| 52 |
-
※
|
| 53 |
-
|
| 54 |
-
This method effectively approximates higher-order moments from the third (skewness) to the seventh (confidence amplification) order by accumulating the loss over time. This is not a static terrain analysis, but rather an attempt to extract the “system's confidence level” as a physical quantity within the dynamic process of learning.
|
| 55 |
-
|
| 56 |
-
The Multi-EMA structure in this method functions as a dynamic temporal approximation of higher-order moments in statistics.
|
| 57 |
-
|
| 58 |
-
3rd to 5th-Order Approximation: The differences between Short, Medium, and Long EMAs extract the temporal evolution of higher-order information such as skewness, kurtosis, and fluctuations in the loss distribution.
|
| 59 |
-
6th-order approximation: The integrated emotion scalar sigma_t and confidence metric trust_t become sixth-order meta-statistics that indicate ‘learning phase stability’ beyond mere gradient variance.
|
| 60 |
-
7th-order approximation (dNR): In deriving dNR, squaring the ratio of these 6th-order information components (d_base/noise_base)² exponentially amplifies subtle differences in confidence, yielding an extremely sensitive control signal equivalent to a 7th-order moment.
|
| 61 |
|
| 62 |
|
| 63 |
2. Theoretical Framework: Emotional Circulation
|
|
@@ -77,10 +76,25 @@ Abstract
|
|
| 77 |
|
| 78 |
※ Note on the Time-Series Formation of Higher-Order Moments:
|
| 79 |
|
| 80 |
-
The higher-order moment approximation in this method is not calculated from single-step gradient information but is formed through temporal accumulation.
|
| 81 |
|
| 82 |
This means it observes not the static curvature of the terrain but the “dynamic rate of change in the terrain as learning progresses.”
|
| 83 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
2.2 Definition of the trust level metric trust_t
|
| 85 |
|
| 86 |
Define the core metric trust_t that determines the “quality” of updates as follows.
|
|
@@ -96,7 +110,7 @@ Abstract
|
|
| 96 |
|
| 97 |
3. emoPulse: Learning Rate Generation via Autonomous Pulsation
|
| 98 |
|
| 99 |
-
In v3.7, the conventional emoDrive (acceleration mechanism) has been integrated into emoPulse. This represents an evolution based on an approximation of dynamic distance estimation (D-adaptation) using the time-series signal-to-noise ratio (S/N ratio).
|
| 100 |
|
| 101 |
3.1 Dynamic Estimation of Noise and Distance
|
| 102 |
|
|
@@ -183,7 +197,7 @@ Abstract
|
|
| 183 |
|
| 184 |
d_base = abs(N_t - d_t) + ε_t
|
| 185 |
|
| 186 |
-
N_t is guaranteed to be positive definite by max(noise_est,
|
| 187 |
By adding a safety factor (+0.1) to these temporal statistical differences, it is mathematically guaranteed that **“even when history is unstable in an extremely low-precision environment, the minimum step size (lower limit of the numerator) is always ensured.”**
|
| 188 |
|
| 189 |
3. Conclusions on Boundedness and Constraints on emoPulse:
|
|
@@ -205,19 +219,45 @@ Abstract
|
|
| 205 |
Due to the nature of exponential moving averages, the effect of this initial value persists as “history” for approximately 100 steps. During this period, the system maintains a high acceleration pressure while providing convergence power only to “truly reliable signals” that have passed the strict screening by the emotional mechanism.
|
| 206 |
|
| 207 |
|
| 208 |
-
5.
|
| 209 |
|
| 210 |
This chapter describes sign-based normalization for applying the theoretical framework of emoPulse to low-precision environments.
|
| 211 |
|
| 212 |
-
To eliminate reliance on precise floating-point calculations and support ultra-low precision environments (ultra-quantization), the following update rules are adopted (EmoAiry, EmoCats,
|
| 213 |
|
| 214 |
delta_w_t = -emoPulse_t * sign( m_t / ( sqrt(v_t) + ε ) )
|
| 215 |
|
| 216 |
This enables EmoAiry to resolve the imbalance in accuracy between one-dimensional vectors and two-dimensional moments, achieving a “unification of will” that extracts only the consensus on direction.
|
| 217 |
※ EmoCats supports encoding based on Lion with WD separation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 218 |
|
|
|
|
|
|
|
| 219 |
|
| 220 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 221 |
|
| 222 |
EmoSens Generation v3.7 and later has completed the “emotional cycle” that begins with observing the loss function.
|
| 223 |
|
|
@@ -267,14 +307,13 @@ Supplementary Material (1): Analysis of emoPulse Dynamics in v3.7 and later
|
|
| 267 |
Sigma_t [Minus] |---(-)---0.5---(+)---0---(+)---0.5---(-)---| [Plus]
|
| 268 |
|--Hist(-)-|-Hist(Non)|--Hist(+)-|--Hist(-)-| Reglet
|
| 269 |
|
| 270 |
-
|
| 271 |
-
|
| 272 |
-
|
| 273 |
|
| 274 |
4. Conclusions on Numerical Stability
|
| 275 |
|
| 276 |
This design, which pits the difference between the “time axis (history)” and the “instant axis (present)” against each other, is not merely a matter of decay. The system autonomously “constantly recalculates the ratio of ‘Doubt’ (Noise) to ‘Certainty’ (Distance)”, enabling dynamic control akin to “heartbeats responding to terrain complexity”—something impossible with manual schedulers.
|
| 277 |
-
|
| 278 |
※ EmoTion is an original model implemented in v3.8.
|
| 279 |
※ dNR_hist has different coefficients in v3.7 and v3.8; v3.8 is more aggressive, designed to produce larger fluctuations than v3.7.
|
| 280 |
|
|
@@ -285,7 +324,7 @@ I hope this intuition will be refined into a rigorous mathematical proof by the
|
|
| 285 |
|
| 286 |
Autonomous Flat-Minima Generation via multiple Positioning of Heterogeneous Optimizers
|
| 287 |
|
| 288 |
-
-Proposal of a New Learning Method: Prediction of “Evolutionary Flat Minimum Formation” via Local Synthesis Using
|
| 289 |
|
| 290 |
|
| 291 |
1. Purpose: To resolve the high cost associated with achieving flat minimization.
|
|
@@ -299,7 +338,7 @@ Autonomous Flat-Minima Generation via multiple Positioning of Heterogeneous Opti
|
|
| 299 |
|
| 300 |
2. Proposal: Don't “search” for flat minimalism—create it yourself.
|
| 301 |
|
| 302 |
-
|
| 303 |
Integrating these divergent learning outcomes constitutes a synthesis of local solutions, and we anticipate that this synthesis may broaden and flatten the local solutions. In other words, it may bring local solutions closer to flat minima or transform them into flat minima themselves.
|
| 304 |
|
| 305 |
Acquiring these local solutions as full-layer LoRA and integrating them using synthesis methods such as TALL-Mask-Merge,
|
|
@@ -307,8 +346,8 @@ Autonomous Flat-Minima Generation via multiple Positioning of Heterogeneous Opti
|
|
| 307 |
∨∨∨ → \___/ Composite image of local solutions
|
| 308 |
(multiple local solutions) (Post-synthesis flattening)
|
| 309 |
|
| 310 |
-
・The “commonly low areas” of local solutions in
|
| 311 |
-
・The sharp edges on multiple (sharp
|
| 312 |
・As a result, a shape close to a flat valley bottom (flat minimum) is reconstructed.
|
| 313 |
|
| 314 |
This treats the local solution as multiple positioning (multiple-axis positioning),
|
|
@@ -340,7 +379,7 @@ Autonomous Flat-Minima Generation via multiple Positioning of Heterogeneous Opti
|
|
| 340 |
|
| 341 |
The multiple models were integrated by combining their respective learning results into the original model, and this new multiple-model system was then merged back into the original model using TM-merge.
|
| 342 |
|
| 343 |
-
Original Model (org) ≪= TM Integration ≪= Model S (Sens), Model A (Airy), Model C (Cats), Model
|
| 344 |
|
| 345 |
Instead of directly integrating with LoRA alone, we integrated it into the base model and then reduced these multiple models back to the base model using TM-merge.
|
| 346 |
FFT predicts that simply merging the multiple models after FFT back to the original model via TM-merge will yield equivalent results.
|
|
|
|
| 1 |
+
Paper: Improving Time-Series SNR Estimation and Regret Bound in the Autonomous Optimization Algorithm emoPulse and Exploring Second-Moment-Free Updates via “Geometric Orthogonality of Weights and Gradients”
|
| 2 |
|
| 3 |
+
— Establishing “Emotion-Driven” Learning Rate Control through Dynamic Inspection of Loss Landscapes and Proposing Next-Generation Optimization through Interaction with Loss Landscapes —
|
| 4 |
|
| 5 |
Abstract
|
| 6 |
|
|
|
|
| 10 |
|
| 11 |
This method autonomously generates an optimal learning rate based on the signal-to-noise ratio by capturing the “undulations” of the loss landscape from a three-stage exponential moving average (Multi-EMA) and utilizing sentiment scalars and a confidence indicator (Trust).
|
| 12 |
|
| 13 |
+
Next, we propose the W-Ref Geometry update rule, which focuses on the geometric relationship between weights and gradients.
|
| 14 |
+
|
| 15 |
+
This achieves a “second-moment-free” update that does not retain the second moment and responds immediately to terrain changes by dynamically controlling inertia based on the orthogonality between weights and gradients.
|
| 16 |
+
|
| 17 |
+
This simultaneously reduces VRAM usage, providing a democratic foundation for multilingual learning in research environments with limited computational resources and for multicultural coexistence.
|
| 18 |
+
|
| 19 |
+
Furthermore, by synthesizing the learning results of optimizers (Sens / Airy / Cats / Tion) belonging to this family and possessing distinct update characteristics, we present a method that integrates local solutions in a “multiple positioning” manner to artificially create flat minima.
|
| 20 |
|
| 21 |
This achieves robust convergence independent of hyperparameter settings, providing a democratic foundation for research environments in developing countries with limited computational resources and for multilingual learning aimed at preserving diverse cultural heritage.
|
| 22 |
|
| 23 |
Finally, I append my thoughts and predictions regarding Grocking.
|
| 24 |
+
※ Version 3.7 excludes EmoTion (EmoTion is newly developed in version 3.8). The only difference between versions 3.7 and 3.8 lies in the dNR_hist of the emoPulse mechanism described later; all other aspects are identical.
|
| 25 |
|
| 26 |
|
| 27 |
1. Introduction
|
| 28 |
|
| 29 |
+
This paper presents a unified theory for the optimizers EmoSens / EmoAiry / EmoCats / EmoTion (v3.7 and later).
|
| 30 |
|
| 31 |
This method centers on the emoPulse mechanism, which autonomously generates learning rates by layering the exponential moving average (EMA) of loss values and extracting “Trust” from the time-series statistics of the loss function.
|
| 32 |
|
|
|
|
| 42 |
|
| 43 |
This approach achieves the replacement of higher-order moment calculations with scalar control and optimization for low-precision and quantized environments through encoded updates.
|
| 44 |
|
| 45 |
+
Its most significant feature lies in integrating local solutions from multiple emo-based optimizers with distinct characteristics as “multiple positioning.” This enables reaching the flat minimum—previously requiring lengthy iterative learning—through short-term learning and synthesis.
|
| 46 |
|
| 47 |
This approach achieved the following three outcomes:
|
| 48 |
|
| 49 |
Dramatic improvement in computational efficiency: Complex calculations of higher-order moments were replaced with scalar control via temporal accumulation of loss, reducing computational load through temporal accumulation approximation.
|
| 50 |
|
| 51 |
+
Optimization for low precision and quantization: Matrix decomposition in EmoAiry, complete elimination of second moments in EmoCats, and the original (proprietary) EmoTion's “geometric orthogonal update” and complete second moment elimination enabled large-scale learning in low-resource environments through update encoding.
|
| 52 |
|
| 53 |
Autonomous Convergence: By introspecting the S/N ratio of the loss landscape, it eliminates the need for manual schedulers and minimizes the user's trial cost.
|
| 54 |
|
| 55 |
+
※ Higher-order moment approximation: Aggregation to higher-order statistics in the time series
|
| 56 |
|
| 57 |
Mathematically, this represents an advanced fusion of D-adaptation theory and time-series signal processing, forming the foundation for realizing “democratic AI learning” that preserves research environments and diverse cultures in developing countries.
|
| 58 |
|
| 59 |
+
※ EmoTion achieves a lightweight structure that does not require second-order moments by not only replacing higher-order moment calculations with scalar control, but also by using the geometric information inherent in the weights themselves as a guideline for updates (detailed in Chapter 6).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
|
| 62 |
2. Theoretical Framework: Emotional Circulation
|
|
|
|
| 76 |
|
| 77 |
※ Note on the Time-Series Formation of Higher-Order Moments:
|
| 78 |
|
| 79 |
+
The higher-order moment approximation in this method is not calculated from single-step gradient information but is formed through temporal accumulation.
|
| 80 |
|
| 81 |
This means it observes not the static curvature of the terrain but the “dynamic rate of change in the terrain as learning progresses.”
|
| 82 |
|
| 83 |
+
※ Hierarchical Structure of Higher-Order Moment Approximation:
|
| 84 |
+
|
| 85 |
+
This method effectively approximates higher-order moments from the third (skewness) to the seventh (confidence amplification) order by accumulating loss over time.
|
| 86 |
+
|
| 87 |
+
This is not a static terrain analysis, but rather an attempt to extract the “system's confidence” as a physical quantity within the dynamic process of learning.
|
| 88 |
+
|
| 89 |
+
The Multi-EMA structure in this method functions as a dynamic temporal approximation of higher-order moments in statistics.
|
| 90 |
+
|
| 91 |
+
Third to Fifth Order Approximation: The differences between Short, Medium, and Long EMAs extract the temporal evolution of higher-order information such as skewness, kurtosis, and fluctuations in the loss distribution.
|
| 92 |
+
|
| 93 |
+
Sixth-order approximation: The integrated emotion scalar sigma_t and confidence metric trust_t become sixth-order meta-statistics that indicate “learning phase stability” beyond mere gradient variance.
|
| 94 |
+
|
| 95 |
+
7th-order approximation (dNR): In deriving dNR, squaring the ratio of these 6th-order information components (d_base/noise_base)^2 exponentially amplifies subtle differences in confidence, yielding an extremely sensitive control signal equivalent to a 7th-order moment.
|
| 96 |
+
|
| 97 |
+
|
| 98 |
2.2 Definition of the trust level metric trust_t
|
| 99 |
|
| 100 |
Define the core metric trust_t that determines the “quality” of updates as follows.
|
|
|
|
| 110 |
|
| 111 |
3. emoPulse: Learning Rate Generation via Autonomous Pulsation
|
| 112 |
|
| 113 |
+
In v3.7 and later, the conventional emoDrive (acceleration mechanism) has been integrated into emoPulse. This represents an evolution based on an approximation of dynamic distance estimation (D-adaptation) using the time-series signal-to-noise ratio (S/N ratio).
|
| 114 |
|
| 115 |
3.1 Dynamic Estimation of Noise and Distance
|
| 116 |
|
|
|
|
| 197 |
|
| 198 |
d_base = abs(N_t - d_t) + ε_t
|
| 199 |
|
| 200 |
+
N_t is guaranteed to be positive definite by max(noise_est, ν_r), and d_t is updated by the cumulative sum of abs(trust_t), regardless of improvement or deterioration.
|
| 201 |
By adding a safety factor (+0.1) to these temporal statistical differences, it is mathematically guaranteed that **“even when history is unstable in an extremely low-precision environment, the minimum step size (lower limit of the numerator) is always ensured.”**
|
| 202 |
|
| 203 |
3. Conclusions on Boundedness and Constraints on emoPulse:
|
|
|
|
| 219 |
Due to the nature of exponential moving averages, the effect of this initial value persists as “history” for approximately 100 steps. During this period, the system maintains a high acceleration pressure while providing convergence power only to “truly reliable signals” that have passed the strict screening by the emotional mechanism.
|
| 220 |
|
| 221 |
|
| 222 |
+
5. Polarized Normalization: Adaptation to Low-Precision Environments
|
| 223 |
|
| 224 |
This chapter describes sign-based normalization for applying the theoretical framework of emoPulse to low-precision environments.
|
| 225 |
|
| 226 |
+
To eliminate reliance on precise floating-point calculations and support ultra-low precision environments (ultra-quantization), the following update rules are adopted (EmoAiry, EmoCats, EmoTion.)
|
| 227 |
|
| 228 |
delta_w_t = -emoPulse_t * sign( m_t / ( sqrt(v_t) + ε ) )
|
| 229 |
|
| 230 |
This enables EmoAiry to resolve the imbalance in accuracy between one-dimensional vectors and two-dimensional moments, achieving a “unification of will” that extracts only the consensus on direction.
|
| 231 |
※ EmoCats supports encoding based on Lion with WD separation.
|
| 232 |
+
※ EmoTion encodes a proprietary update method called “Geometric Orthogonal Update.”
|
| 233 |
+
|
| 234 |
+
|
| 235 |
+
6. Respect for Existing Methods and EmoTion's Position
|
| 236 |
+
|
| 237 |
+
The EmoTion update algorithm stems from deep respect for AdamW, a pinnacle of modern deep learning. The concept of “adaptive learning rate” demonstrated by AdamW established the conditions for effective optimization and significantly lowered the barriers to its adoption.
|
| 238 |
+
|
| 239 |
+
EmoTion inherits this spirit while taking a different approach: using geometry (W-Ref Geometry) and emotion (emoPulse) instead of statistics.
|
| 240 |
|
| 241 |
+
A New Form of Precision:
|
| 242 |
+
While AdamW meticulously carves a path from past statistics, EmoTion navigates terrain more flexibly through dialogue with current weights (Geometric interaction with current weights) and the pulse of loss. This approach aims for natural convergence that suppresses overfitting while maintaining accuracy on par with AdamW. (Orthogonality as Freshness)
|
| 243 |
|
| 244 |
+
Resource-Friendly Design (Reduced VRAM):
|
| 245 |
+
Computational resources are finite, and not everyone has access to high-performance, abundant resources. By entrusting the precise mechanism of second-order moments—which AdamW has carefully preserved—to “scalar control,” EmoTion was able to reduce VRAM load by approximately half. We believe this forms the foundation for a “democratic learning environment” where more people can conduct AI training.
|
| 246 |
+
|
| 247 |
+
Geometric Inertia Control Using W-Ref Geometry:
|
| 248 |
+
The core of this method lies in its geometric update rule based on the orthogonality between the weight vector W and the gradient vector G.
|
| 249 |
+
Whereas conventional statistical methods rely on the accumulated gradient history (shadow), W-Ref Geometry uses the current weight W as the “substance” and derives the freshness of gradient G from the following cosine similarity ρ(rho).
|
| 250 |
+
|
| 251 |
+
rho = |W * G| / ( ||W|| * ||G|| + eps )
|
| 252 |
+
|
| 253 |
+
The smaller ρ (rho) is (the closer it is to orthogonal), the more the current gradient is judged to contain “unknown information” not present in the existing weight structure. This allows the current gradient to be strongly incorporated, overcoming inertia. This geometric “information selection” simultaneously achieves high-precision directional changes without statistical delay and a regularization effect by suppressing redundant updates. (Dynamic Inertia Calibration)
|
| 254 |
+
|
| 255 |
+
Reason it holds true based solely on the first moment:
|
| 256 |
+
The absence of second-order moments (variance estimation) is not merely for weight reduction. W-Ref Geometry updates based on the “freshness of direction” rather than the “magnitude” of gradients, rendering much of the role traditionally fulfilled by second-order moments unnecessary. (Departure from Second-Order Moments)
|
| 257 |
+
Direction selection via W-Ref Geometry determines that gradients G containing unknown information are those most orthogonal to weight W, thereby reducing inertia and steering toward new directions. Conversely, gradients parallel to W are deemed redundant, prioritizing inertia. This selection based on “direction purity” is more direct than variance estimation, robust against noise, and suppresses overfitting.
|
| 258 |
+
|
| 259 |
+
|
| 260 |
+
7. Conclusion
|
| 261 |
|
| 262 |
EmoSens Generation v3.7 and later has completed the “emotional cycle” that begins with observing the loss function.
|
| 263 |
|
|
|
|
| 307 |
Sigma_t [Minus] |---(-)---0.5---(+)---0---(+)---0.5---(-)---| [Plus]
|
| 308 |
|--Hist(-)-|-Hist(Non)|--Hist(+)-|--Hist(-)-| Reglet
|
| 309 |
|
| 310 |
+
μ_g and μ_d:
|
| 311 |
+
v3.7:[Acceleration:LR Growth Max 1.05x] / [Deceleration:LR Decay 0.98x]
|
| 312 |
+
v3.8:[Acceleration:LR Growth Max 1.50x] / [Deceleration:LR Decay 0.80x]
|
| 313 |
|
| 314 |
4. Conclusions on Numerical Stability
|
| 315 |
|
| 316 |
This design, which pits the difference between the “time axis (history)” and the “instant axis (present)” against each other, is not merely a matter of decay. The system autonomously “constantly recalculates the ratio of ‘Doubt’ (Noise) to ‘Certainty’ (Distance)”, enabling dynamic control akin to “heartbeats responding to terrain complexity”—something impossible with manual schedulers.
|
|
|
|
| 317 |
※ EmoTion is an original model implemented in v3.8.
|
| 318 |
※ dNR_hist has different coefficients in v3.7 and v3.8; v3.8 is more aggressive, designed to produce larger fluctuations than v3.7.
|
| 319 |
|
|
|
|
| 324 |
|
| 325 |
Autonomous Flat-Minima Generation via multiple Positioning of Heterogeneous Optimizers
|
| 326 |
|
| 327 |
+
-Proposal of a New Learning Method: Prediction of “Evolutionary Flat Minimum Formation” via Local Synthesis Using of Emo Systems-
|
| 328 |
|
| 329 |
|
| 330 |
1. Purpose: To resolve the high cost associated with achieving flat minimization.
|
|
|
|
| 338 |
|
| 339 |
2. Proposal: Don't “search” for flat minimalism—create it yourself.
|
| 340 |
|
| 341 |
+
Emo-style models (EmoSens, EmoAiry, EmoCats, EmoTion) share a common learning structure despite differing update mechanisms. When trained under identical conditions, they yield learning results with differences representing “local solutions from different directions.”
|
| 342 |
Integrating these divergent learning outcomes constitutes a synthesis of local solutions, and we anticipate that this synthesis may broaden and flatten the local solutions. In other words, it may bring local solutions closer to flat minima or transform them into flat minima themselves.
|
| 343 |
|
| 344 |
Acquiring these local solutions as full-layer LoRA and integrating them using synthesis methods such as TALL-Mask-Merge,
|
|
|
|
| 346 |
∨∨∨ → \___/ Composite image of local solutions
|
| 347 |
(multiple local solutions) (Post-synthesis flattening)
|
| 348 |
|
| 349 |
+
・The “commonly low areas” of local solutions in multiple directions are emphasized.
|
| 350 |
+
・The sharp edges on multiple (sharp minima) cancel each other out
|
| 351 |
・As a result, a shape close to a flat valley bottom (flat minimum) is reconstructed.
|
| 352 |
|
| 353 |
This treats the local solution as multiple positioning (multiple-axis positioning),
|
|
|
|
| 379 |
|
| 380 |
The multiple models were integrated by combining their respective learning results into the original model, and this new multiple-model system was then merged back into the original model using TM-merge.
|
| 381 |
|
| 382 |
+
Original Model (org) ≪= TM Integration ≪= Model S (Sens), Model A (Airy), Model C (Cats), Model T (Tion)
|
| 383 |
|
| 384 |
Instead of directly integrating with LoRA alone, we integrated it into the base model and then reduced these multiple models back to the base model using TM-merge.
|
| 385 |
FFT predicts that simply merging the multiple models after FFT back to the original model via TM-merge will yield equivalent results.
|
emo-v38-paper(JPN).txt
CHANGED
|
@@ -1,11 +1,12 @@
|
|
| 1 |
-
論文:自律的最適化アルゴリズム emoPulse における時系列 SNR 推定と Regret Bound
|
| 2 |
|
| 3 |
-
〜 損失地形の動的内察による「感情駆動型」学習率制御の確立 〜
|
| 4 |
|
| 5 |
|
| 6 |
要旨 (Abstract)
|
| 7 |
ディープラーニングの最適化において学習率の調整と汎化性能の確保は中心的な課題である。 既存手法は精緻な勾配推定に依存し、極低精度環境下でのノイズに対して脆弱であった。 本稿では、損失関数 (Loss) の時系列的な多角解析を主軸に置いた自律的アルゴリズム emoPulse (v3.7以降) を提案する。 本手法は、3段階の指数移動平均 (Multi-EMA) から損失地形の「うねり」を捉え、感情スカラーおよび信頼度指標 (Trust) を介し、S/N比に基づく最適な学習率を自律的に生成する。
|
| 8 |
-
|
|
|
|
| 9 |
最後にグロッキングへの考察と予想を付録する。
|
| 10 |
※ v3.7版は EmoTion を除く (EmoTion は v3.8版で新規開発) 後述する emoPulse 機構の dNR_hist で v3.7 と v3.8 に違いがあるだけで他はすべて同一である。
|
| 11 |
|
|
@@ -22,14 +23,16 @@
|
|
| 22 |
|
| 23 |
計算効率の劇的向上: 高次モーメントの複雑な計算を Loss の時間的積算によるスカラー制御に置換し時間的積算による近似で演算負荷を軽減した。
|
| 24 |
|
| 25 |
-
低精度・量子化への最適化: EmoAiry における行列分解、EmoCats における2
|
| 26 |
|
| 27 |
自律的収束: 損失地形の S/N 比を内察することで、手動のスケジューラを不要とし、ユーザーの試行コストを最小化した。
|
| 28 |
|
| 29 |
-
※
|
| 30 |
|
| 31 |
これは数学的には、D-adaptation 理論と時系列信号処理の高度な融合であり、途上国の研究環境や多様な文化を遺すための「民主的なAI学習」を実現する基盤となる。
|
| 32 |
|
|
|
|
|
|
|
| 33 |
|
| 34 |
2. 理論的フレームワーク:感情循環系 (Emotional Circulation)
|
| 35 |
|
|
@@ -70,7 +73,7 @@
|
|
| 70 |
|
| 71 |
3. emoPulse��自律的拍動による学習率生成
|
| 72 |
|
| 73 |
-
v3.7
|
| 74 |
|
| 75 |
3.1 Noise および Distance の動的推定
|
| 76 |
|
|
@@ -166,7 +169,7 @@
|
|
| 166 |
初期値設定による安定化:
|
| 167 |
※ データセットが非常に小さい環境や初期ノイズが大きい環境では、マルチ EMA が「履歴」を安定させるまでの間、d_t と N_t の初期値を再設定することを推奨する (例:d-est:0.2, Noise-est:0.2) これにより、初期の確率的ノイズによる発散を抑制できる。 特に、N_0 を d_0 と同等に初期化することで、システムは本質的に「慎重モード」から開始される。 これは、初期の重要なステップにおいて、過度に攻撃的な更新を避け、地形の観察を優先する有機的なウォームアップ・フェーズとして機能する。
|
| 168 |
初期値設定による「更新圧力」の維持と安全性の両立:
|
| 169 |
-
※ 本手法において emoPulse の分子を形成する d_base
|
| 170 |
|
| 171 |
|
| 172 |
5. 符号化正規化:低精度環境への適応
|
|
@@ -178,12 +181,36 @@
|
|
| 178 |
delta_w_t = -emoPulse_t * sign( m_t / ( sqrt(v_t) + ε ) )
|
| 179 |
|
| 180 |
これにより、 EmoAiry では、1次元ベクトルと2次元モーメントの精度のアンバランスを解消し、方向性の合意のみを抽出する「意志の統一」を実現している。
|
| 181 |
-
※ EmoCats
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 182 |
|
| 183 |
|
| 184 |
-
|
| 185 |
|
| 186 |
-
EmoSens世代 v3.7
|
| 187 |
|
| 188 |
観測 (Multi-EMA):地形のうねりを捉える。
|
| 189 |
判断 (Trust):確信と逡巡を ±0.5 の境界で切り替える。
|
|
@@ -203,7 +230,7 @@
|
|
| 203 |
|
| 204 |
|
| 205 |
|
| 206 |
-
補足資料(1):v3.7
|
| 207 |
|
| 208 |
1. 目的
|
| 209 |
|
|
@@ -301,7 +328,7 @@
|
|
| 301 |
|
| 302 |
emo系による統合は、元モデルにそれぞれの学習結果を統合し、この新しい多種モデルを TM-merge にて元モデルへ統合した。
|
| 303 |
|
| 304 |
-
元モデル(org) ≪= TM統合 ≪= モデルS(Sens)、モデルA(Airy)、モデルC(Cats)、モデル
|
| 305 |
|
| 306 |
LoRAだけで直接統合せず元モデルへ統合し、これら新モデルを元モデルへ TM-merge で還元した。
|
| 307 |
FFTではFFT後のモデルを元モデルへ TM-merge するだけで同等の効果を持つものと予測する。
|
|
|
|
| 1 |
+
論文:自律的最適化アルゴリズム emoPulse における時系列 SNR 推定と Regret Bound の改善と 「重みと勾配の幾何学的直交性」による2次モーメント・フリー更新の探究
|
| 2 |
|
| 3 |
+
〜 損失地形の動的内察による「感情駆動型」学習率制御の確立 と 損失地形との対話による次世代最適化の提案 〜
|
| 4 |
|
| 5 |
|
| 6 |
要旨 (Abstract)
|
| 7 |
ディープラーニングの最適化において学習率の調整と汎化性能の確保は中心的な課題である。 既存手法は精緻な勾配推定に依存し、極低精度環境下でのノイズに対して脆弱であった。 本稿では、損失関数 (Loss) の時系列的な多角解析を主軸に置いた自律的アルゴリズム emoPulse (v3.7以降) を提案する。 本手法は、3段階の指数移動平均 (Multi-EMA) から損失地形の「うねり」を捉え、感情スカラーおよび信頼度指標 (Trust) を介し、S/N比に基づく最適な学習率を自律的に生成する。
|
| 8 |
+
次に、重みと勾配の幾何学的関係に着目した更新則 W-Ref Geometry を提案する。 これは、重みと勾配の直交性 (Orthogonality) に基づいて慣性を動的に制御することで、2次モーメントを保持せず、地形の変化に即応する「2次モーメント・フリー」な更新を実現する。 これによりVRAM削減を両立し、計算資源の限られた研究環境や多文化共生のための多言語学習に民主的な基盤を提供する。
|
| 9 |
+
さらに、本系に属する4種の異なる更新特性を持つ最適化器 ( Sens / Airy / Cats / Tion ) の学習結果を合成することで、局所解を「多元測位」的に統合し、人工的にフラットミニマを創出する手法を提示する。 これによりハイパーパラメータの設定に依存しない頑健な収束を実現し、計算資源の限られた途上国の研究環境や、多様な文化遺産の継承を目指す多言語学習において民主的な基盤を提供する。
|
| 10 |
最後にグロッキングへの考察と予想を付録する。
|
| 11 |
※ v3.7版は EmoTion を除く (EmoTion は v3.8版で新規開発) 後述する emoPulse 機構の dNR_hist で v3.7 と v3.8 に違いがあるだけで他はすべて同一である。
|
| 12 |
|
|
|
|
| 23 |
|
| 24 |
計算効率の劇的向上: 高次モーメントの複雑な計算を Loss の時間的積算によるスカラー制御に置換し時間的積算による近似で演算負荷を軽減した。
|
| 25 |
|
| 26 |
+
低精度・量子化への最適化: EmoAiry における行列分解、EmoCats における2次モーメントの完全排除、と、オリジナル(独自型) EmoTion による「幾何学的直交更新」と2次モーメント完全排除を含む、更新の符号化により低リソース環境での大規模学習を可能にした。
|
| 27 |
|
| 28 |
自律的収束: 損失地形の S/N 比を内察することで、手動のスケジューラを不要とし、ユーザーの試行コストを最小化した。
|
| 29 |
|
| 30 |
+
※ 高次モーメント近似:時間軸における高次統計量 (Time-series Higher-order Statistics) への集約
|
| 31 |
|
| 32 |
これは数学的には、D-adaptation 理論と時系列信号処理の高度な融合であり、途上国の研究環境や多様な文化を遺すための「民主的なAI学習」を実現する基盤となる。
|
| 33 |
|
| 34 |
+
※ EmoTion は、高次モーメントの計算をスカラー制御へ置換するだけでなく、重み自身が持つ幾何学的な情報を更新の指針とすることで、2次モーメントを必要としない軽量な構造を実現している (第6章にて詳述)
|
| 35 |
+
|
| 36 |
|
| 37 |
2. 理論的フレームワーク:感情循環系 (Emotional Circulation)
|
| 38 |
|
|
|
|
| 73 |
|
| 74 |
3. emoPulse��自律的拍動による学習率生成
|
| 75 |
|
| 76 |
+
v3.7以降において、従来の emoDrive (加速機構) は emoPulse へと統合された。 これは時系列の S/N 比 (Signal-to-Noise Ratio) に基づく動的距離推定 (D-adaptation) の近似による進化形である。
|
| 77 |
|
| 78 |
3.1 Noise および Distance の動的推定
|
| 79 |
|
|
|
|
| 169 |
初期値設定による安定化:
|
| 170 |
※ データセットが非常に小さい環境や初期ノイズが大きい環境では、マルチ EMA が「履歴」を安定させるまでの間、d_t と N_t の初期値を再設定することを推奨する (例:d-est:0.2, Noise-est:0.2) これにより、初期の確率的ノイズによる発散を抑制できる。 特に、N_0 を d_0 と同等に初期化することで、システムは本質的に「慎重モード」から開始される。 これは、初期の重要なステップにおいて、過度に攻撃的な更新を避け、地形の観察を優先する有機的なウォームアップ・フェーズとして機能する。
|
| 171 |
初期値設定による「更新圧力」の維持と安全性の両立:
|
| 172 |
+
※ 本手法において emoPulse の分子を形成する d_base は、システムの「潜在的な更新力」を決定する。ここで初期値を N0 = 1.0, d0 = 0.02 と設定することは、学習初期から高い加速ポテンシャルを意図的に確保しておくことを意味する。 この初期値の影響は、指数移動平均の特性上、約100ステップにわたって「履歴」として残留する。 この期間システムは高い加速圧力を背景に持ちつつも、感情機構による厳格な選別をクリアした「真に信頼できる信号」に対してのみ収束力を提供する。
|
| 173 |
|
| 174 |
|
| 175 |
5. 符号化正規化:低精度環境への適応
|
|
|
|
| 181 |
delta_w_t = -emoPulse_t * sign( m_t / ( sqrt(v_t) + ε ) )
|
| 182 |
|
| 183 |
これにより、 EmoAiry では、1次元ベクトルと2次元モーメントの精度のアンバランスを解消し、方向性の合意のみを抽出する「意志の統一」を実現している。
|
| 184 |
+
※ EmoCats は、Lionベースに WD分離をした符号化で対応している
|
| 185 |
+
※ EmoTion は、独自更新式「幾何学的直交更新」を符号化している
|
| 186 |
+
|
| 187 |
+
|
| 188 |
+
6. 既存手法への敬意と、EmoTion の立ち位置
|
| 189 |
+
|
| 190 |
+
EmoTion の更新アルゴリズムは、現代のディープラーニングの金字塔である AdamW への深い敬意から出発している。 AdamW が示した「適応的学習率」という概念は最適化を実施できる条件を整え普及へのハードルを大きく下げた。
|
| 191 |
+
|
| 192 |
+
EmoTion はその精神を継承しつつ、異なるアプローチとして「統計の代わりに、幾何学(W-Ref Geometry)と感情(emoPulse)」を用いる。
|
| 193 |
+
|
| 194 |
+
正確さの新しい形:
|
| 195 |
+
AdamW が「過去の統計」から緻密に道を切り拓くのに対し、EmoTion は「現在の重みとの対話」と「Lossの鼓動」を通じて、よりしなやかに地形を歩む。 これにより、AdamW と並び立つ正確さを維持しながら、過学習を抑えた「自然な収束」を目指した。
|
| 196 |
+
|
| 197 |
+
リソースへの優しさ(VRAM削減):
|
| 198 |
+
計算資源は有限であり、誰もが高性能で潤沢なリソースを使えるわけではない。 EmoTion は AdamW が大切に保持してきた2次モーメントという正確な仕組みを「スカラー制御」に委ねることで、VRAM 負荷を約半分に抑えることができた。 これは、より多くの人がAI学習を実施できる「民主的な学習環境」の基盤になると考える。
|
| 199 |
+
|
| 200 |
+
W-Ref Geometry による幾何学的慣性制御:
|
| 201 |
+
本手法の核心は、重みベクトル W と勾配ベクトル G の直交性(Orthogonality)に基づく幾何学的更新則にある。 従来の統計的手法が過去の勾配の蓄積(影)に依存するのに対し、W-Ref Geometry は現在の重み W という「実���」を基準とし、勾配 G の新鮮度(Freshness)を以下の余弦類似度 ρ(rho)から導出する。
|
| 202 |
+
|
| 203 |
+
rho = |W * G| / ( ||W|| * ||G|| + eps )
|
| 204 |
+
|
| 205 |
+
ρ (rho)が小さい(直交に近い)ほど、現在の勾配は既存の重み構造に含まれない「未知の情報」を持っていると判断し、慣性を排して現時点の勾配を強く取り込む。 この幾何学的な「情報の選別」により、統計的遅延のない高精度な方向転換と、冗長な更新の抑制による正則化効果を同時に達成している。
|
| 206 |
+
|
| 207 |
+
1次モーメントのみで成立する理由:
|
| 208 |
+
EmoTion が 2次モーメント(分散推定)を持たないのは単なる軽量化ではない。 W-Ref Geometry により、勾配の「大きさ」ではなく「方向の新鮮さ」を基準に更新を行うため、2次モーメントが担う役割の多くが不要になる。 W-Ref Geometry による方向の選別は、勾配 G が 重み W と直交に近いほど、未知の情報を含むと判断し、慣性を弱めて新しい方向へ舵を切る。 逆に、W と平行な勾配は冗長とみなし、慣性を優先する。 この「方向の純度」に基づく選別は、分散推定よりも直接的で、ノイズに強く、過学習を抑える効果を持つ。
|
| 209 |
|
| 210 |
|
| 211 |
+
7. 結論
|
| 212 |
|
| 213 |
+
EmoSens世代 v3.7以降 は、損失関数の観察から始まる「感情の循環」を完結させた。
|
| 214 |
|
| 215 |
観測 (Multi-EMA):地形のうねりを捉える。
|
| 216 |
判断 (Trust):確信と逡巡を ±0.5 の境界で切り替える。
|
|
|
|
| 230 |
|
| 231 |
|
| 232 |
|
| 233 |
+
補足資料(1):v3.7以降 における emoPulse のダイナミクスの解析
|
| 234 |
|
| 235 |
1. 目的
|
| 236 |
|
|
|
|
| 328 |
|
| 329 |
emo系による統合は、元モデルにそれぞれの学習結果を統合し、この新しい多種モデルを TM-merge にて元モデルへ統合した。
|
| 330 |
|
| 331 |
+
元モデル(org) ≪= TM統合 ≪= モデルS(Sens)、モデルA(Airy)、モデルC(Cats)、モデルT(Tion)
|
| 332 |
|
| 333 |
LoRAだけで直接統合せず元モデルへ統合し、これら新モデルを元モデルへ TM-merge で還元した。
|
| 334 |
FFTではFFT後のモデルを元モデルへ TM-merge するだけで同等の効果を持つものと予測する。
|