Felldude commited on
Commit
589247e
·
verified ·
1 Parent(s): c86b8f3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md CHANGED
@@ -1,3 +1,98 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - mistral
7
+ - fp32
8
+ - adamw
9
+ - transformer
10
+ - monte-carlo
11
+ - dit
12
+ - ernie
13
+ pipeline_tag: text-generation
14
  ---
15
+ Model Card
16
+ Overview
17
+
18
+ This repository documents two separate large language model training methodologies and precision strategies:
19
+
20
+ Mistral LLM Training
21
+ Fully trained in native FP32 precision.
22
+ Optimization performed using standard AdamW.
23
+ No Adam8bit, quantized optimizer states, or reduced-precision optimizer approximations were used during training.
24
+ Intended to preserve numerical stability and high-fidelity gradient accumulation throughout all training phases.
25
+ DIT Ernie Model
26
+ Uses a Monte Carlo estimation approach to approximate FP32 behavior.
27
+ The model does not operate as a strict full FP32 pipeline.
28
+ Instead, stochastic estimation techniques are applied to emulate FP32 statistical characteristics while reducing computational overhead.
29
+ This approach trades exact deterministic FP32 arithmetic for probabilistic approximation efficiency.
30
+ Training Details
31
+ Mistral LLM
32
+ Precision
33
+ Full FP32 training
34
+ FP32 activations
35
+ FP32 optimizer states
36
+ FP32 gradients
37
+ Optimizer
38
+ AdamW
39
+ Weight decay enabled
40
+ No 8-bit optimizer compression
41
+ No low-rank optimizer approximation
42
+ Notes
43
+
44
+ The Mistral configuration prioritizes:
45
+
46
+ numerical consistency
47
+ deterministic convergence behavior
48
+ stable long-context optimization
49
+ reduced quantization-induced gradient noise
50
+
51
+ This setup is computationally expensive but provides high-fidelity optimization dynamics during pretraining and finetuning.
52
+
53
+ DIT Ernie
54
+ Precision Strategy
55
+
56
+ The DIT Ernie architecture utilizes:
57
+
58
+ Monte Carlo estimation techniques
59
+ probabilistic FP32 approximation
60
+ stochastic numerical reconstruction
61
+
62
+ Rather than maintaining strict FP32 execution across the entire training stack, the model estimates FP32-equivalent statistical behavior through sampling-based computation.
63
+
64
+ Goals
65
+ reduce memory bandwidth requirements
66
+ improve throughput efficiency
67
+ retain approximate FP32 convergence characteristics
68
+ balance numerical quality with hardware scalability
69
+ Notes
70
+
71
+ This methodology may introduce:
72
+
73
+ stochastic variance between runs
74
+ approximation noise
75
+ non-deterministic optimization characteristics
76
+
77
+ However, it can significantly reduce training cost relative to native FP32 execution.
78
+
79
+ Intended Use
80
+
81
+ This repository is intended for:
82
+
83
+ research documentation
84
+ training methodology comparison
85
+ optimizer precision analysis
86
+ numerical stability benchmarking
87
+ transformer architecture experimentation
88
+ Limitations
89
+ Full FP32 training incurs substantial VRAM and compute costs.
90
+ Monte Carlo FP32 approximation may not exactly reproduce deterministic FP32 outputs.
91
+ Results can vary depending on:
92
+ sampling strategy
93
+ hardware backend
94
+ distributed training topology
95
+ random seed initialization
96
+ License
97
+
98
+ Apache License 2.0