muooon commited on
Commit
f95e85a
·
verified ·
1 Parent(s): aafb7f8

Upload 2 files

Browse files
Files changed (2) hide show
  1. emo-v38-paper(ENG).txt +70 -31
  2. emo-v38-paper(JPN).txt +39 -12
emo-v38-paper(ENG).txt CHANGED
@@ -1,6 +1,6 @@
1
- Paper: Improving Time-Series SNR Estimation and Regret Bound in the Autonomous Optimization Algorithm emoPulse
2
 
3
- — Establishing “Emotion-Driven” Learning Rate Control through Dynamic Inspection of Loss Topography
4
 
5
  Abstract
6
 
@@ -10,16 +10,23 @@ Abstract
10
 
11
  This method autonomously generates an optimal learning rate based on the signal-to-noise ratio by capturing the “undulations” of the loss landscape from a three-stage exponential moving average (Multi-EMA) and utilizing sentiment scalars and a confidence indicator (Trust).
12
 
13
- Furthermore, by synthesizing the learning results of three optimizers (Sens/Airy/Cats/Tion) belonging to this family and possessing distinct update characteristics, we present a method that integrates local solutions in a “multiple positioning” manner to artificially create flat minima.
 
 
 
 
 
 
14
 
15
  This achieves robust convergence independent of hyperparameter settings, providing a democratic foundation for research environments in developing countries with limited computational resources and for multilingual learning aimed at preserving diverse cultural heritage.
16
 
17
  Finally, I append my thoughts and predictions regarding Grocking.
 
18
 
19
 
20
  1. Introduction
21
 
22
- This paper presents a unified theory for the optimizers EmoSens / EmoAiry / EmoCats / EmoTion.
23
 
24
  This method centers on the emoPulse mechanism, which autonomously generates learning rates by layering the exponential moving average (EMA) of loss values and extracting “Trust” from the time-series statistics of the loss function.
25
 
@@ -35,29 +42,21 @@ Abstract
35
 
36
  This approach achieves the replacement of higher-order moment calculations with scalar control and optimization for low-precision and quantized environments through encoded updates.
37
 
38
- Its most significant feature lies in integrating local solutions from multiple emo-based optimizers with distinct characteristics as “multiple positioning.” This enables reaching the flat minimum—previously requiring lengthy iterative learning—through short-term learning and synthesis.
39
 
40
  This approach achieved the following three outcomes:
41
 
42
  Dramatic improvement in computational efficiency: Complex calculations of higher-order moments were replaced with scalar control via temporal accumulation of loss, reducing computational load through temporal accumulation approximation.
43
 
44
- Optimization for low precision and quantization: Matrix decomposition in EmoAiry, complete elimination of second moments in EmoCats, and encoding by both methods enabled large-scale learning in low-resource environments.
45
 
46
  Autonomous Convergence: By introspecting the S/N ratio of the loss landscape, it eliminates the need for manual schedulers and minimizes the user's trial cost.
47
 
48
- ※ Higher-order moments: Aggregation into higher-order statistics on the time axis
49
 
50
  Mathematically, this represents an advanced fusion of D-adaptation theory and time-series signal processing, forming the foundation for realizing “democratic AI learning” that preserves research environments and diverse cultures in developing countries.
51
 
52
- Hierarchical Structure of Higher-Order Moment Approximation:
53
-
54
- This method effectively approximates higher-order moments from the third (skewness) to the seventh (confidence amplification) order by accumulating the loss over time. This is not a static terrain analysis, but rather an attempt to extract the “system's confidence level” as a physical quantity within the dynamic process of learning.
55
-
56
- The Multi-EMA structure in this method functions as a dynamic temporal approximation of higher-order moments in statistics.
57
-
58
- 3rd to 5th-Order Approximation: The differences between Short, Medium, and Long EMAs extract the temporal evolution of higher-order information such as skewness, kurtosis, and fluctuations in the loss distribution.
59
- 6th-order approximation: The integrated emotion scalar sigma_t and confidence metric trust_t become sixth-order meta-statistics that indicate ‘learning phase stability’ beyond mere gradient variance.
60
- 7th-order approximation (dNR): In deriving dNR, squaring the ratio of these 6th-order information components (d_base/noise_base)² exponentially amplifies subtle differences in confidence, yielding an extremely sensitive control signal equivalent to a 7th-order moment.
61
 
62
 
63
  2. Theoretical Framework: Emotional Circulation
@@ -77,10 +76,25 @@ Abstract
77
 
78
  ※ Note on the Time-Series Formation of Higher-Order Moments:
79
 
80
- The higher-order moment approximation in this method is not calculated from single-step gradient information but is formed through temporal accumulation.
81
 
82
  This means it observes not the static curvature of the terrain but the “dynamic rate of change in the terrain as learning progresses.”
83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
  2.2 Definition of the trust level metric trust_t
85
 
86
  Define the core metric trust_t that determines the “quality” of updates as follows.
@@ -96,7 +110,7 @@ Abstract
96
 
97
  3. emoPulse: Learning Rate Generation via Autonomous Pulsation
98
 
99
- In v3.7, the conventional emoDrive (acceleration mechanism) has been integrated into emoPulse. This represents an evolution based on an approximation of dynamic distance estimation (D-adaptation) using the time-series signal-to-noise ratio (S/N ratio).
100
 
101
  3.1 Dynamic Estimation of Noise and Distance
102
 
@@ -183,7 +197,7 @@ Abstract
183
 
184
  d_base = abs(N_t - d_t) + ε_t
185
 
186
- N_t is guaranteed to be positive definite by max(noise_est, ν_r), and d_t is updated by the cumulative sum of abs(trust_t), regardless of improvement or deterioration.
187
  By adding a safety factor (+0.1) to these temporal statistical differences, it is mathematically guaranteed that **“even when history is unstable in an extremely low-precision environment, the minimum step size (lower limit of the numerator) is always ensured.”**
188
 
189
  3. Conclusions on Boundedness and Constraints on emoPulse:
@@ -205,19 +219,45 @@ Abstract
205
  Due to the nature of exponential moving averages, the effect of this initial value persists as “history” for approximately 100 steps. During this period, the system maintains a high acceleration pressure while providing convergence power only to “truly reliable signals” that have passed the strict screening by the emotional mechanism.
206
 
207
 
208
- 5. Coding Normalization: Adaptation to Low-Precision Environments
209
 
210
  This chapter describes sign-based normalization for applying the theoretical framework of emoPulse to low-precision environments.
211
 
212
- To eliminate reliance on precise floating-point calculations and support ultra-low precision environments (ultra-quantization), the following update rules are adopted (EmoAiry, EmoCats, etc.)
213
 
214
  delta_w_t = -emoPulse_t * sign( m_t / ( sqrt(v_t) + ε ) )
215
 
216
  This enables EmoAiry to resolve the imbalance in accuracy between one-dimensional vectors and two-dimensional moments, achieving a “unification of will” that extracts only the consensus on direction.
217
  ※ EmoCats supports encoding based on Lion with WD separation.
 
 
 
 
 
 
 
 
218
 
 
 
219
 
220
- 6. Conclusion
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
221
 
222
  EmoSens Generation v3.7 and later has completed the “emotional cycle” that begins with observing the loss function.
223
 
@@ -267,14 +307,13 @@ Supplementary Material (1): Analysis of emoPulse Dynamics in v3.7 and later
267
  Sigma_t [Minus] |---(-)---0.5---(+)---0---(+)---0.5---(-)---| [Plus]
268
  |--Hist(-)-|-Hist(Non)|--Hist(+)-|--Hist(-)-| Reglet
269
 
270
- μ_g and μ_d:
271
- v3.7:[Acceleration:LR Growth Max 1.05x] / [Deceleration:LR Decay 0.98x]
272
- v3.8:[Acceleration:LR Growth Max 1.50x] / [Deceleration:LR Decay 0.80x]
273
 
274
  4. Conclusions on Numerical Stability
275
 
276
  This design, which pits the difference between the “time axis (history)” and the “instant axis (present)” against each other, is not merely a matter of decay. The system autonomously “constantly recalculates the ratio of ‘Doubt’ (Noise) to ‘Certainty’ (Distance)”, enabling dynamic control akin to “heartbeats responding to terrain complexity”—something impossible with manual schedulers.
277
-
278
  ※ EmoTion is an original model implemented in v3.8.
279
  ※ dNR_hist has different coefficients in v3.7 and v3.8; v3.8 is more aggressive, designed to produce larger fluctuations than v3.7.
280
 
@@ -285,7 +324,7 @@ I hope this intuition will be refined into a rigorous mathematical proof by the
285
 
286
  Autonomous Flat-Minima Generation via multiple Positioning of Heterogeneous Optimizers
287
 
288
- -Proposal of a New Learning Method: Prediction of “Evolutionary Flat Minimum Formation” via Local Synthesis Using Three Types of Emo Systems-
289
 
290
 
291
  1. Purpose: To resolve the high cost associated with achieving flat minimization.
@@ -299,7 +338,7 @@ Autonomous Flat-Minima Generation via multiple Positioning of Heterogeneous Opti
299
 
300
  2. Proposal: Don't “search” for flat minimalism—create it yourself.
301
 
302
- The three emo variants (EmoSens, EmoAiry, EmoCats, EmoTion) share a common learning structure despite differing update mechanisms. When trained under identical conditions, they yield learning outcomes with divergent local solutions from different directions.
303
  Integrating these divergent learning outcomes constitutes a synthesis of local solutions, and we anticipate that this synthesis may broaden and flatten the local solutions. In other words, it may bring local solutions closer to flat minima or transform them into flat minima themselves.
304
 
305
  Acquiring these local solutions as full-layer LoRA and integrating them using synthesis methods such as TALL-Mask-Merge,
@@ -307,8 +346,8 @@ Autonomous Flat-Minima Generation via multiple Positioning of Heterogeneous Opti
307
  ∨∨∨ → \___/ Composite image of local solutions
308
  (multiple local solutions) (Post-synthesis flattening)
309
 
310
- ・The “commonly low areas” of local solutions in three directions are emphasized.
311
- ・The sharp edges on multiple (sharp minim) cancel each other out
312
  ・As a result, a shape close to a flat valley bottom (flat minimum) is reconstructed.
313
 
314
  This treats the local solution as multiple positioning (multiple-axis positioning),
@@ -340,7 +379,7 @@ Autonomous Flat-Minima Generation via multiple Positioning of Heterogeneous Opti
340
 
341
  The multiple models were integrated by combining their respective learning results into the original model, and this new multiple-model system was then merged back into the original model using TM-merge.
342
 
343
- Original Model (org) ≪= TM Integration ≪= Model S (Sens), Model A (Airy), Model C (Cats), Model D (Tion)
344
 
345
  Instead of directly integrating with LoRA alone, we integrated it into the base model and then reduced these multiple models back to the base model using TM-merge.
346
  FFT predicts that simply merging the multiple models after FFT back to the original model via TM-merge will yield equivalent results.
 
1
+ Paper: Improving Time-Series SNR Estimation and Regret Bound in the Autonomous Optimization Algorithm emoPulse and Exploring Second-Moment-Free Updates via “Geometric Orthogonality of Weights and Gradients”
2
 
3
+ — Establishing “Emotion-Driven” Learning Rate Control through Dynamic Inspection of Loss Landscapes and Proposing Next-Generation Optimization through Interaction with Loss Landscapes
4
 
5
  Abstract
6
 
 
10
 
11
  This method autonomously generates an optimal learning rate based on the signal-to-noise ratio by capturing the “undulations” of the loss landscape from a three-stage exponential moving average (Multi-EMA) and utilizing sentiment scalars and a confidence indicator (Trust).
12
 
13
+ Next, we propose the W-Ref Geometry update rule, which focuses on the geometric relationship between weights and gradients.
14
+
15
+ This achieves a “second-moment-free” update that does not retain the second moment and responds immediately to terrain changes by dynamically controlling inertia based on the orthogonality between weights and gradients.
16
+
17
+ This simultaneously reduces VRAM usage, providing a democratic foundation for multilingual learning in research environments with limited computational resources and for multicultural coexistence.
18
+
19
+ Furthermore, by synthesizing the learning results of optimizers (Sens / Airy / Cats / Tion) belonging to this family and possessing distinct update characteristics, we present a method that integrates local solutions in a “multiple positioning” manner to artificially create flat minima.
20
 
21
  This achieves robust convergence independent of hyperparameter settings, providing a democratic foundation for research environments in developing countries with limited computational resources and for multilingual learning aimed at preserving diverse cultural heritage.
22
 
23
  Finally, I append my thoughts and predictions regarding Grocking.
24
+ ※ Version 3.7 excludes EmoTion (EmoTion is newly developed in version 3.8). The only difference between versions 3.7 and 3.8 lies in the dNR_hist of the emoPulse mechanism described later; all other aspects are identical.
25
 
26
 
27
  1. Introduction
28
 
29
+ This paper presents a unified theory for the optimizers EmoSens / EmoAiry / EmoCats / EmoTion (v3.7 and later).
30
 
31
  This method centers on the emoPulse mechanism, which autonomously generates learning rates by layering the exponential moving average (EMA) of loss values and extracting “Trust” from the time-series statistics of the loss function.
32
 
 
42
 
43
  This approach achieves the replacement of higher-order moment calculations with scalar control and optimization for low-precision and quantized environments through encoded updates.
44
 
45
+ Its most significant feature lies in integrating local solutions from multiple emo-based optimizers with distinct characteristics as “multiple positioning.” This enables reaching the flat minimum—previously requiring lengthy iterative learning—through short-term learning and synthesis.
46
 
47
  This approach achieved the following three outcomes:
48
 
49
  Dramatic improvement in computational efficiency: Complex calculations of higher-order moments were replaced with scalar control via temporal accumulation of loss, reducing computational load through temporal accumulation approximation.
50
 
51
+ Optimization for low precision and quantization: Matrix decomposition in EmoAiry, complete elimination of second moments in EmoCats, and the original (proprietary) EmoTion's “geometric orthogonal update” and complete second moment elimination enabled large-scale learning in low-resource environments through update encoding.
52
 
53
  Autonomous Convergence: By introspecting the S/N ratio of the loss landscape, it eliminates the need for manual schedulers and minimizes the user's trial cost.
54
 
55
+ ※ Higher-order moment approximation: Aggregation to higher-order statistics in the time series
56
 
57
  Mathematically, this represents an advanced fusion of D-adaptation theory and time-series signal processing, forming the foundation for realizing “democratic AI learning” that preserves research environments and diverse cultures in developing countries.
58
 
59
+ EmoTion achieves a lightweight structure that does not require second-order moments by not only replacing higher-order moment calculations with scalar control, but also by using the geometric information inherent in the weights themselves as a guideline for updates (detailed in Chapter 6).
 
 
 
 
 
 
 
 
60
 
61
 
62
  2. Theoretical Framework: Emotional Circulation
 
76
 
77
  ※ Note on the Time-Series Formation of Higher-Order Moments:
78
 
79
+ The higher-order moment approximation in this method is not calculated from single-step gradient information but is formed through temporal accumulation.
80
 
81
  This means it observes not the static curvature of the terrain but the “dynamic rate of change in the terrain as learning progresses.”
82
 
83
+ ※ Hierarchical Structure of Higher-Order Moment Approximation:
84
+
85
+ This method effectively approximates higher-order moments from the third (skewness) to the seventh (confidence amplification) order by accumulating loss over time.
86
+
87
+ This is not a static terrain analysis, but rather an attempt to extract the “system's confidence” as a physical quantity within the dynamic process of learning.
88
+
89
+ The Multi-EMA structure in this method functions as a dynamic temporal approximation of higher-order moments in statistics.
90
+
91
+ Third to Fifth Order Approximation: The differences between Short, Medium, and Long EMAs extract the temporal evolution of higher-order information such as skewness, kurtosis, and fluctuations in the loss distribution.
92
+
93
+ Sixth-order approximation: The integrated emotion scalar sigma_t and confidence metric trust_t become sixth-order meta-statistics that indicate “learning phase stability” beyond mere gradient variance.
94
+
95
+ 7th-order approximation (dNR): In deriving dNR, squaring the ratio of these 6th-order information components (d_base/noise_base)^2 exponentially amplifies subtle differences in confidence, yielding an extremely sensitive control signal equivalent to a 7th-order moment.
96
+
97
+
98
  2.2 Definition of the trust level metric trust_t
99
 
100
  Define the core metric trust_t that determines the “quality” of updates as follows.
 
110
 
111
  3. emoPulse: Learning Rate Generation via Autonomous Pulsation
112
 
113
+ In v3.7 and later, the conventional emoDrive (acceleration mechanism) has been integrated into emoPulse. This represents an evolution based on an approximation of dynamic distance estimation (D-adaptation) using the time-series signal-to-noise ratio (S/N ratio).
114
 
115
  3.1 Dynamic Estimation of Noise and Distance
116
 
 
197
 
198
  d_base = abs(N_t - d_t) + ε_t
199
 
200
+ N_t is guaranteed to be positive definite by max(noise_est, ν_r), and d_t is updated by the cumulative sum of abs(trust_t), regardless of improvement or deterioration.
201
  By adding a safety factor (+0.1) to these temporal statistical differences, it is mathematically guaranteed that **“even when history is unstable in an extremely low-precision environment, the minimum step size (lower limit of the numerator) is always ensured.”**
202
 
203
  3. Conclusions on Boundedness and Constraints on emoPulse:
 
219
  Due to the nature of exponential moving averages, the effect of this initial value persists as “history” for approximately 100 steps. During this period, the system maintains a high acceleration pressure while providing convergence power only to “truly reliable signals” that have passed the strict screening by the emotional mechanism.
220
 
221
 
222
+ 5. Polarized Normalization: Adaptation to Low-Precision Environments
223
 
224
  This chapter describes sign-based normalization for applying the theoretical framework of emoPulse to low-precision environments.
225
 
226
+ To eliminate reliance on precise floating-point calculations and support ultra-low precision environments (ultra-quantization), the following update rules are adopted (EmoAiry, EmoCats, EmoTion.)
227
 
228
  delta_w_t = -emoPulse_t * sign( m_t / ( sqrt(v_t) + ε ) )
229
 
230
  This enables EmoAiry to resolve the imbalance in accuracy between one-dimensional vectors and two-dimensional moments, achieving a “unification of will” that extracts only the consensus on direction.
231
  ※ EmoCats supports encoding based on Lion with WD separation.
232
+ ※ EmoTion encodes a proprietary update method called “Geometric Orthogonal Update.”
233
+
234
+
235
+ 6. Respect for Existing Methods and EmoTion's Position
236
+
237
+ The EmoTion update algorithm stems from deep respect for AdamW, a pinnacle of modern deep learning. The concept of “adaptive learning rate” demonstrated by AdamW established the conditions for effective optimization and significantly lowered the barriers to its adoption.
238
+
239
+ EmoTion inherits this spirit while taking a different approach: using geometry (W-Ref Geometry) and emotion (emoPulse) instead of statistics.
240
 
241
+ A New Form of Precision:
242
+ While AdamW meticulously carves a path from past statistics, EmoTion navigates terrain more flexibly through dialogue with current weights (Geometric interaction with current weights) and the pulse of loss. This approach aims for natural convergence that suppresses overfitting while maintaining accuracy on par with AdamW. (Orthogonality as Freshness)
243
 
244
+ Resource-Friendly Design (Reduced VRAM):
245
+ Computational resources are finite, and not everyone has access to high-performance, abundant resources. By entrusting the precise mechanism of second-order moments—which AdamW has carefully preserved—to “scalar control,” EmoTion was able to reduce VRAM load by approximately half. We believe this forms the foundation for a “democratic learning environment” where more people can conduct AI training.
246
+
247
+ Geometric Inertia Control Using W-Ref Geometry:
248
+ The core of this method lies in its geometric update rule based on the orthogonality between the weight vector W and the gradient vector G.
249
+ Whereas conventional statistical methods rely on the accumulated gradient history (shadow), W-Ref Geometry uses the current weight W as the “substance” and derives the freshness of gradient G from the following cosine similarity ρ(rho).
250
+
251
+ rho = |W * G| / ( ||W|| * ||G|| + eps )
252
+
253
+ The smaller ρ (rho) is (the closer it is to orthogonal), the more the current gradient is judged to contain “unknown information” not present in the existing weight structure. This allows the current gradient to be strongly incorporated, overcoming inertia. This geometric “information selection” simultaneously achieves high-precision directional changes without statistical delay and a regularization effect by suppressing redundant updates. (Dynamic Inertia Calibration)
254
+
255
+ Reason it holds true based solely on the first moment:
256
+ The absence of second-order moments (variance estimation) is not merely for weight reduction. W-Ref Geometry updates based on the “freshness of direction” rather than the “magnitude” of gradients, rendering much of the role traditionally fulfilled by second-order moments unnecessary. (Departure from Second-Order Moments)
257
+ Direction selection via W-Ref Geometry determines that gradients G containing unknown information are those most orthogonal to weight W, thereby reducing inertia and steering toward new directions. Conversely, gradients parallel to W are deemed redundant, prioritizing inertia. This selection based on “direction purity” is more direct than variance estimation, robust against noise, and suppresses overfitting.
258
+
259
+
260
+ 7. Conclusion
261
 
262
  EmoSens Generation v3.7 and later has completed the “emotional cycle” that begins with observing the loss function.
263
 
 
307
  Sigma_t [Minus] |---(-)---0.5---(+)---0---(+)---0.5---(-)---| [Plus]
308
  |--Hist(-)-|-Hist(Non)|--Hist(+)-|--Hist(-)-| Reglet
309
 
310
+ μ_g and μ_d:
311
+ v3.7:[Acceleration:LR Growth Max 1.05x] / [Deceleration:LR Decay 0.98x]
312
+ v3.8:[Acceleration:LR Growth Max 1.50x] / [Deceleration:LR Decay 0.80x]
313
 
314
  4. Conclusions on Numerical Stability
315
 
316
  This design, which pits the difference between the “time axis (history)” and the “instant axis (present)” against each other, is not merely a matter of decay. The system autonomously “constantly recalculates the ratio of ‘Doubt’ (Noise) to ‘Certainty’ (Distance)”, enabling dynamic control akin to “heartbeats responding to terrain complexity”—something impossible with manual schedulers.
 
317
  ※ EmoTion is an original model implemented in v3.8.
318
  ※ dNR_hist has different coefficients in v3.7 and v3.8; v3.8 is more aggressive, designed to produce larger fluctuations than v3.7.
319
 
 
324
 
325
  Autonomous Flat-Minima Generation via multiple Positioning of Heterogeneous Optimizers
326
 
327
+ -Proposal of a New Learning Method: Prediction of “Evolutionary Flat Minimum Formation” via Local Synthesis Using of Emo Systems-
328
 
329
 
330
  1. Purpose: To resolve the high cost associated with achieving flat minimization.
 
338
 
339
  2. Proposal: Don't “search” for flat minimalism—create it yourself.
340
 
341
+ Emo-style models (EmoSens, EmoAiry, EmoCats, EmoTion) share a common learning structure despite differing update mechanisms. When trained under identical conditions, they yield learning results with differences representing “local solutions from different directions.”
342
  Integrating these divergent learning outcomes constitutes a synthesis of local solutions, and we anticipate that this synthesis may broaden and flatten the local solutions. In other words, it may bring local solutions closer to flat minima or transform them into flat minima themselves.
343
 
344
  Acquiring these local solutions as full-layer LoRA and integrating them using synthesis methods such as TALL-Mask-Merge,
 
346
  ∨∨∨ → \___/ Composite image of local solutions
347
  (multiple local solutions) (Post-synthesis flattening)
348
 
349
+ ・The “commonly low areas” of local solutions in multiple directions are emphasized.
350
+ ・The sharp edges on multiple (sharp minima) cancel each other out
351
  ・As a result, a shape close to a flat valley bottom (flat minimum) is reconstructed.
352
 
353
  This treats the local solution as multiple positioning (multiple-axis positioning),
 
379
 
380
  The multiple models were integrated by combining their respective learning results into the original model, and this new multiple-model system was then merged back into the original model using TM-merge.
381
 
382
+ Original Model (org) ≪= TM Integration ≪= Model S (Sens), Model A (Airy), Model C (Cats), Model T (Tion)
383
 
384
  Instead of directly integrating with LoRA alone, we integrated it into the base model and then reduced these multiple models back to the base model using TM-merge.
385
  FFT predicts that simply merging the multiple models after FFT back to the original model via TM-merge will yield equivalent results.
emo-v38-paper(JPN).txt CHANGED
@@ -1,11 +1,12 @@
1
- 論文:自律的最適化アルゴリズム emoPulse における時系列 SNR 推定と Regret Bound の改善
2
 
3
- 〜 損失地形の動的内察による「感情駆動型」学習率制御の確立 〜
4
 
5
 
6
  要旨 (Abstract)
7
  ディープラーニングの最適化において学習率の調整と汎化性能の確保は中心的な課題である。 既存手法は精緻な勾配推定に依存し、極低精度環境下でのノイズに対して脆弱であった。 本稿では、損失関数 (Loss) の時系列的な多角解析を主軸に置いた自律的アルゴリズム emoPulse (v3.7以降) を提案する。 本手法は、3段階の指数移動平均 (Multi-EMA) から損失地形の「うねり」を捉え、感情スカラーおよび信頼度指標 (Trust) を介し、S/N比に基づく最適な学習率を自律的に生成する。
8
- さらに、本系に属する3種の異なる更新特性を持つ最適化器 ( Sens / Airy / Cats / Tion ) の学習結果を合成することで、局所解を「多元測位」的に統合し、人工的にフラットミニマを創出する手法を提示する。 これによりハイパーパラメータの設定に依存しない頑健な収束を実現し、計算資源の限られた途上国の研究環境や、多様な文化遺産の継承を目指す多言語学習において民主的な基盤を提供する。
 
9
  最後にグロッキングへの考察と予想を付録する。
10
  ※ v3.7版は EmoTion を除く (EmoTion は v3.8版で新規開発) 後述する emoPulse 機構の dNR_hist で v3.7 と v3.8 に違いがあるだけで他はすべて同一である。
11
 
@@ -22,14 +23,16 @@
22
 
23
  計算効率の劇的向上: 高次モーメントの複雑な計算を Loss の時間的積算によるスカラー制御に置換し時間的積算による近似で演算負荷を軽減した。
24
 
25
- 低精度・量子化への最適化: EmoAiry における行列分解、EmoCats における2次モーメントの完全排除、と両手法を取り入れた、オリジナル(独自型) EmoTion も含め、符号化により低リソース環境での大規模学習を可能にした。
26
 
27
  自律的収束: 損失地形の S/N 比を内察することで、手動のスケジューラを不要とし、ユーザーの試行コストを最小化した。
28
 
29
- 高次モーメント:時間軸における高次統計量 (Time-series Higher-order Statistics) への集約
30
 
31
  これは数学的には、D-adaptation 理論と時系列信号処理の高度な融合であり、途上国の研究環境や多様な文化を遺すための「民主的なAI学習」を実現する基盤となる。
32
 
 
 
33
 
34
  2. 理論的フレームワーク:感情循環系 (Emotional Circulation)
35
 
@@ -70,7 +73,7 @@
70
 
71
  3. emoPulse��自律的拍動による学習率生成
72
 
73
- v3.7 において、従来の emoDrive (加速機構) は emoPulse へと統合された。 これは時系列の S/N 比 (Signal-to-Noise Ratio) に基づく動的距離推定 (D-adaptation) の近似による進化形である。
74
 
75
  3.1 Noise および Distance の動的推定
76
 
@@ -166,7 +169,7 @@
166
  初期値設定による安定化:
167
  ※ データセットが非常に小さい環境や初期ノイズが大きい環境では、マルチ EMA が「履歴」を安定させるまでの間、d_t と N_t の初期値を再設定することを推奨する (例:d-est:0.2, Noise-est:0.2) これにより、初期の確率的ノイズによる発散を抑制できる。 特に、N_0 を d_0 と同等に初期化することで、システムは本質的に「慎重モード」から開始される。 これは、初期の重要なステップにおいて、過度に攻撃的な更新を避け、地形の観察を優先する有機的なウォームアップ・フェーズとして機能する。
168
  初期値設定による「更新圧力」の維持と安全性の両立:
169
- ※ 本手法において emoPulse の分子を形成する d_base は、システムの「潜在的な更新力」を決定する。ここで初期値を N0 = 1.0, d0 = 0.02 と設定することは、学習初期から高い加速ポテンシャルを意図的に確保しておくことを意味する。 この初期値の影響は、指数移動平均の特性上、約100ステップにわたって「履歴」として残留する。 この期間システムは高い加速圧力を背景に持ちつつも、感情機構による厳格な選別をクリアした「真に信頼できる信号」に対してのみ収束力を提供する。
170
 
171
 
172
  5. 符号化正規化:低精度環境への適応
@@ -178,12 +181,36 @@
178
  delta_w_t = -emoPulse_t * sign( m_t / ( sqrt(v_t) + ε ) )
179
 
180
  これにより、 EmoAiry では、1次元ベクトルと2次元モーメントの精度のアンバランスを解消し、方向性の合意のみを抽出する「意志の統一」を実現している。
181
- ※ EmoCats Lionベースに WD分離をした符号化で対応している
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
 
183
 
184
- 6. 結論
185
 
186
- EmoSens世代 v3.7 以降 は、損失関数の観察から始まる「感情の循環」を完結させた。
187
 
188
  観測 (Multi-EMA):地形のうねりを捉える。
189
  判断 (Trust):確信と逡巡を ±0.5 の境界で切り替える。
@@ -203,7 +230,7 @@
203
 
204
 
205
 
206
- 補足資料(1):v3.7 以降 における emoPulse のダイナミクスの解析
207
 
208
  1. 目的
209
 
@@ -301,7 +328,7 @@
301
 
302
  emo系による統合は、元モデルにそれぞれの学習結果を統合し、この新しい多種モデルを TM-merge にて元モデルへ統合した。
303
 
304
- 元モデル(org) ≪= TM統合 ≪= モデルS(Sens)、モデルA(Airy)、モデルC(Cats)、モデルD(Tion)
305
 
306
  LoRAだけで直接統合せず元モデルへ統合し、これら新モデルを元モデルへ TM-merge で還元した。
307
  FFTではFFT後のモデルを元モデルへ TM-merge するだけで同等の効果を持つものと予測する。
 
1
+ 論文:自律的最適化アルゴリズム emoPulse における時系列 SNR 推定と Regret Bound の改善と 「重みと勾配の幾何学的直交性」による2次モーメント・フリー更新の探究
2
 
3
+ 〜 損失地形の動的内察による「感情駆動型」学習率制御の確立 と 損失地形との対話による次世代最適化の提案
4
 
5
 
6
  要旨 (Abstract)
7
  ディープラーニングの最適化において学習率の調整と汎化性能の確保は中心的な課題である。 既存手法は精緻な勾配推定に依存し、極低精度環境下でのノイズに対して脆弱であった。 本稿では、損失関数 (Loss) の時系列的な多角解析を主軸に置いた自律的アルゴリズム emoPulse (v3.7以降) を提案する。 本手法は、3段階の指数移動平均 (Multi-EMA) から損失地形の「うねり」を捉え、感情スカラーおよび信頼度指標 (Trust) を介し、S/N比に基づく最適な学習率を自律的に生成する。
8
+ 次に、重みと勾配の幾何学的関係に着目した更新則 W-Ref Geometry を提案する。 これは、重みと勾配の直交性 (Orthogonality) に基づいて慣性を動的に制御することで、2次モーメントを保持せず、地形の変化に即応する「2次モーメント・フリー」な更新を実現する。 これによりVRAM削減を両立し、計算資源の限られた研究環境や多文化共生のための多言語学習に民主的な基盤を提供する。
9
+ さらに、本系に属する4種の異なる更新特性を持つ最適化器 ( Sens / Airy / Cats / Tion ) の学習結果を合成することで、局所解を「多元測位」的に統合し、人工的にフラットミニマを創出する手法を提示する。 これによりハイパーパラメータの設定に依存しない頑健な収束を実現し、計算資源の限られた途上国の研究環境や、多様な文化遺産の継承を目指す多言語学習において民主的な基盤を提供する。
10
  最後にグロッキングへの考察と予想を付録する。
11
  ※ v3.7版は EmoTion を除く (EmoTion は v3.8版で新規開発) 後述する emoPulse 機構の dNR_hist で v3.7 と v3.8 に違いがあるだけで他はすべて同一である。
12
 
 
23
 
24
  計算効率の劇的向上: 高次モーメントの複雑な計算を Loss の時間的積算によるスカラー制御に置換し時間的積算による近似で演算負荷を軽減した。
25
 
26
+ 低精度・量子化への最適化: EmoAiry における行列分解、EmoCats における2次モーメントの完全排除、と、オリジナル(独自型) EmoTion による「幾何学的直交更新」と2次モーメント完全排除を含む、更新の符号化により低リソース環境での大規模学習を可能にした。
27
 
28
  自律的収束: 損失地形の S/N 比を内察することで、手動のスケジューラを不要とし、ユーザーの試行コストを最小化した。
29
 
30
+ 高次モーメント近似:時間軸における高次統計量 (Time-series Higher-order Statistics) への集約
31
 
32
  これは数学的には、D-adaptation 理論と時系列信号処理の高度な融合であり、途上国の研究環境や多様な文化を遺すための「民主的なAI学習」を実現する基盤となる。
33
 
34
+ ※ EmoTion は、高次モーメントの計算をスカラー制御へ置換するだけでなく、重み自身が持つ幾何学的な情報を更新の指針とすることで、2次モーメントを必要としない軽量な構造を実現している (第6章にて詳述)
35
+
36
 
37
  2. 理論的フレームワーク:感情循環系 (Emotional Circulation)
38
 
 
73
 
74
  3. emoPulse��自律的拍動による学習率生成
75
 
76
+ v3.7以降において、従来の emoDrive (加速機構) は emoPulse へと統合された。 これは時系列の S/N 比 (Signal-to-Noise Ratio) に基づく動的距離推定 (D-adaptation) の近似による進化形である。
77
 
78
  3.1 Noise および Distance の動的推定
79
 
 
169
  初期値設定による安定化:
170
  ※ データセットが非常に小さい環境や初期ノイズが大きい環境では、マルチ EMA が「履歴」を安定させるまでの間、d_t と N_t の初期値を再設定することを推奨する (例:d-est:0.2, Noise-est:0.2) これにより、初期の確率的ノイズによる発散を抑制できる。 特に、N_0 を d_0 と同等に初期化することで、システムは本質的に「慎重モード」から開始される。 これは、初期の重要なステップにおいて、過度に攻撃的な更新を避け、地形の観察を優先する有機的なウォームアップ・フェーズとして機能する。
171
  初期値設定による「更新圧力」の維持と安全性の両立:
172
+ ※ 本手法において emoPulse の分子を形成する d_base は、システムの「潜在的な更新力」を決定する。ここで初期値を N0 = 1.0, d0 = 0.02 と設定することは、学習初期から高い加速ポテンシャルを意図的に確保しておくことを意味する。 この初期値の影響は、指数移動平均の特性上、約100ステップにわたって「履歴」として残留する。 この期間システムは高い加速圧力を背景に持ちつつも、感情機構による厳格な選別をクリアした「真に信頼できる信号」に対してのみ収束力を提供する。
173
 
174
 
175
  5. 符号化正規化:低精度環境への適応
 
181
  delta_w_t = -emoPulse_t * sign( m_t / ( sqrt(v_t) + ε ) )
182
 
183
  これにより、 EmoAiry では、1次元ベクトルと2次元モーメントの精度のアンバランスを解消し、方向性の合意のみを抽出する「意志の統一」を実現している。
184
+ ※ EmoCats は、Lionベースに WD分離をした符号化で対応している
185
+ ※ EmoTion は、独自更新式「幾何学的直交更新」を符号化している
186
+
187
+
188
+ 6. 既存手法への敬意と、EmoTion の立ち位置
189
+
190
+ EmoTion の更新アルゴリズムは、現代のディープラーニングの金字塔である AdamW への深い敬意から出発している。 AdamW が示した「適応的学習率」という概念は最適化を実施できる条件を整え普及へのハードルを大きく下げた。
191
+
192
+ EmoTion はその精神を継承しつつ、異なるアプローチとして「統計の代わりに、幾何学(W-Ref Geometry)と感情(emoPulse)」を用いる。
193
+
194
+ 正確さの新しい形:
195
+ AdamW が「過去の統計」から緻密に道を切り拓くのに対し、EmoTion は「現在の重みとの対話」と「Lossの鼓動」を通じて、よりしなやかに地形を歩む。 これにより、AdamW と並び立つ正確さを維持しながら、過学習を抑えた「自然な収束」を目指した。
196
+
197
+ リソースへの優しさ(VRAM削減):
198
+ 計算資源は有限であり、誰もが高性能で潤沢なリソースを使えるわけではない。 EmoTion は AdamW が大切に保持してきた2次モーメントという正確な仕組みを「スカラー制御」に委ねることで、VRAM 負荷を約半分に抑えることができた。 これは、より多くの人がAI学習を実施できる「民主的な学習環境」の基盤になると考える。
199
+
200
+ W-Ref Geometry による幾何学的慣性制御:
201
+ 本手法の核心は、重みベクトル W と勾配ベクトル G の直交性(Orthogonality)に基づく幾何学的更新則にある。 従来の統計的手法が過去の勾配の蓄積(影)に依存するのに対し、W-Ref Geometry は現在の重み W という「実���」を基準とし、勾配 G の新鮮度(Freshness)を以下の余弦類似度 ρ(rho)から導出する。
202
+
203
+ rho = |W * G| / ( ||W|| * ||G|| + eps )
204
+
205
+ ρ (rho)が小さい(直交に近い)ほど、現在の勾配は既存の重み構造に含まれない「未知の情報」を持っていると判断し、慣性を排して現時点の勾配を強く取り込む。 この幾何学的な「情報の選別」により、統計的遅延のない高精度な方向転換と、冗長な更新の抑制による正則化効果を同時に達成している。
206
+
207
+ 1次モーメントのみで成立する理由:
208
+ EmoTion が 2次モーメント(分散推定)を持たないのは単なる軽量化ではない。 W-Ref Geometry により、勾配の「大きさ」ではなく「方向の新鮮さ」を基準に更新を行うため、2次モーメントが担う役割の多くが不要になる。 W-Ref Geometry による方向の選別は、勾配 G が 重み W と直交に近いほど、未知の情報を含むと判断し、慣性を弱めて新しい方向へ舵を切る。 逆に、W と平行な勾配は冗長とみなし、慣性を優先する。 この「方向の純度」に基づく選別は、分散推定よりも直接的で、ノイズに強く、過学習を抑える効果を持つ。
209
 
210
 
211
+ 7. 結論
212
 
213
+ EmoSens世代 v3.7以降 は、損失関数の観察から始まる「感情の循環」を完結させた。
214
 
215
  観測 (Multi-EMA):地形のうねりを捉える。
216
  判断 (Trust):確信と逡巡を ±0.5 の境界で切り替える。
 
230
 
231
 
232
 
233
+ 補足資料(1):v3.7以降 における emoPulse のダイナミクスの解析
234
 
235
  1. 目的
236
 
 
328
 
329
  emo系による統合は、元モデルにそれぞれの学習結果を統合し、この新しい多種モデルを TM-merge にて元モデルへ統合した。
330
 
331
+ 元モデル(org) ≪= TM統合 ≪= モデルS(Sens)、モデルA(Airy)、モデルC(Cats)、モデルT(Tion)
332
 
333
  LoRAだけで直接統合せず元モデルへ統合し、これら新モデルを元モデルへ TM-merge で還元した。
334
  FFTではFFT後のモデルを元モデルへ TM-merge するだけで同等の効果を持つものと予測する。