Felldude commited on
Commit
9e4b863
·
verified ·
1 Parent(s): ea1e135

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -56
README.md CHANGED
@@ -12,84 +12,115 @@ tags:
12
  - ernie
13
  pipeline_tag: text-generation
14
  ---
15
- Model Card
16
- Overview
 
 
17
 
18
  This repository documents two separate large language model training methodologies and precision strategies:
19
 
20
- Mistral LLM Training
21
- Fully trained in native FP32 precision.
22
- Optimization performed using standard AdamW.
23
- No Adam8bit, quantized optimizer states, or reduced-precision optimizer approximations were used during training.
24
- Intended to preserve numerical stability and high-fidelity gradient accumulation throughout all training phases.
25
-
26
- DIT Ernie Model
27
- Uses a Monte Carlo estimation approach to approximate FP32 behavior.
28
-
29
- Training Details
30
- Mistral LLM
31
- Precision
32
- Full FP32 training
33
- FP32 activations
34
- FP32 optimizer states
35
- FP32 gradients
36
- Optimizer
37
- AdamW
38
- Weight decay enabled
39
- No 8-bit optimizer compression
40
- No low-rank optimizer approximation
41
- Notes
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  The Mistral configuration prioritizes:
44
 
45
- numerical consistency
46
- deterministic convergence behavior
47
- stable long-context optimization
48
- reduced quantization-induced gradient noise
 
 
 
 
49
 
50
- This setup is computationally expensive but provides high-fidelity optimization dynamics during pretraining and finetuning.
51
 
52
- DIT Ernie
53
- Precision Strategy
54
 
55
  The DIT Ernie architecture utilizes:
56
 
57
- Monte Carlo estimation techniques
58
- probabilistic FP32 approximation
59
- stochastic numerical reconstruction
60
 
61
  Rather than maintaining strict FP32 execution across the entire training stack, the model estimates FP32-equivalent statistical behavior through sampling-based computation.
62
 
63
- Goals
64
- reduce memory bandwidth requirements
65
- improve throughput efficiency
66
- retain approximate FP32 convergence characteristics
67
- balance numerical quality with hardware scalability
68
- Notes
 
 
69
 
70
  This methodology may introduce:
71
 
72
- stochastic variance between runs
73
- approximation noise
74
- non-deterministic optimization characteristics
75
 
76
  However, it can significantly reduce training cost relative to native FP32 execution.
77
 
78
- Intended Use
 
 
79
 
80
  This repository is intended for:
81
 
82
- research documentation
83
- training methodology comparison
84
- optimizer precision analysis
85
- numerical stability benchmarking
86
- transformer architecture experimentation
87
- Limitations
 
 
 
 
88
  Results can vary depending on:
89
- sampling strategy
90
- hardware backend
91
- distributed training topology
92
- random seed initialization
93
- License
94
 
95
- Apache License 2.0
 
 
 
 
 
 
 
 
 
 
12
  - ernie
13
  pipeline_tag: text-generation
14
  ---
15
+
16
+ # **Model Card**
17
+
18
+ # **Overview**
19
 
20
  This repository documents two separate large language model training methodologies and precision strategies:
21
 
22
+ ---
23
+
24
+ # **Mistral LLM Training**
25
+
26
+ - **Fully trained in native FP32 precision**
27
+ - Optimization performed using standard **AdamW**
28
+ - **No Adam8bit**, quantized optimizer states, or reduced-precision optimizer approximations were used during training
29
+ - Intended to preserve **numerical stability** and **high-fidelity gradient accumulation** throughout all training phases
30
+
31
+ ---
32
+
33
+ # **DIT Ernie Model**
34
+
35
+ - Uses a **Monte Carlo estimation** approach to approximate **FP32 behavior**
36
+
37
+ ---
38
+
39
+ # **Training Details**
40
+
41
+ # **Mistral LLM**
42
+
43
+ ## **Precision**
44
+
45
+ - **Full FP32 training**
46
+ - **FP32 activations**
47
+ - **FP32 optimizer states**
48
+ - **FP32 gradients**
49
+
50
+ ## **Optimizer**
51
+
52
+ - **AdamW**
53
+ - Weight decay enabled
54
+ - **No 8-bit optimizer compression**
55
+ - **No low-rank optimizer approximation**
56
+
57
+ ## **Notes**
58
 
59
  The Mistral configuration prioritizes:
60
 
61
+ - **numerical consistency**
62
+ - **deterministic convergence behavior**
63
+ - **stable long-context optimization**
64
+ - **reduced quantization-induced gradient noise**
65
+
66
+ This setup is computationally expensive but provides **high-fidelity optimization dynamics** during pretraining and finetuning.
67
+
68
+ ---
69
 
70
+ # **DIT Ernie**
71
 
72
+ ## **Precision Strategy**
 
73
 
74
  The DIT Ernie architecture utilizes:
75
 
76
+ - **Monte Carlo estimation techniques**
77
+ - **probabilistic FP32 approximation**
78
+ - **stochastic numerical reconstruction**
79
 
80
  Rather than maintaining strict FP32 execution across the entire training stack, the model estimates FP32-equivalent statistical behavior through sampling-based computation.
81
 
82
+ ## **Goals**
83
+
84
+ - reduce memory bandwidth requirements
85
+ - improve throughput efficiency
86
+ - retain approximate FP32 convergence characteristics
87
+ - balance numerical quality with hardware scalability
88
+
89
+ ## **Notes**
90
 
91
  This methodology may introduce:
92
 
93
+ - **stochastic variance between runs**
94
+ - **approximation noise**
95
+ - **non-deterministic optimization characteristics**
96
 
97
  However, it can significantly reduce training cost relative to native FP32 execution.
98
 
99
+ ---
100
+
101
+ # **Intended Use**
102
 
103
  This repository is intended for:
104
 
105
+ - research documentation
106
+ - training methodology comparison
107
+ - optimizer precision analysis
108
+ - numerical stability benchmarking
109
+ - transformer architecture experimentation
110
+
111
+ ---
112
+
113
+ # **Limitations**
114
+
115
  Results can vary depending on:
 
 
 
 
 
116
 
117
+ - sampling strategy
118
+ - hardware backend
119
+ - distributed training topology
120
+ - random seed initialization
121
+
122
+ ---
123
+
124
+ # **License**
125
+
126
+ **Apache License 2.0**