File size: 2,662 Bytes
c86b8f3
 
589247e
eec04dc
589247e
eec04dc
 
 
 
 
 
 
c500fe0
c86b8f3
9e4b863
 
 
 
589247e
 
 
9e4b863
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
589247e
 
 
9e4b863
 
 
 
 
 
 
 
589247e
9e4b863
589247e
9e4b863
589247e
 
 
9e4b863
 
 
589247e
 
 
9e4b863
 
 
 
 
 
 
 
589247e
 
 
9e4b863
 
 
589247e
 
 
9e4b863
 
 
589247e
 
 
9e4b863
 
 
 
 
 
 
 
 
 
589247e
 
9e4b863
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
license: apache-2.0
language:
- en
tags:
- mistral
- fp32
- adamw
- transformer
- monte-carlo
- dit
- ernie
pipeline_tag: text-to-image
---

# **Model Card**

# **Overview**

This repository documents two separate large language model training methodologies and precision strategies:

---

# **Mistral LLM Training**

- **Fully trained in native FP32 precision**
- Optimization performed using standard **AdamW**
- **No Adam8bit**, quantized optimizer states, or reduced-precision optimizer approximations were used during training
- Intended to preserve **numerical stability** and **high-fidelity gradient accumulation** throughout all training phases

---

# **DIT Ernie Model**

- Uses a **Monte Carlo estimation** approach to approximate **FP32 behavior**

---

# **Training Details**

# **Mistral LLM**

## **Precision**

- **Full FP32 training**
- **FP32 activations**
- **FP32 optimizer states**
- **FP32 gradients**

## **Optimizer**

- **AdamW**
- Weight decay enabled
- **No 8-bit optimizer compression**
- **No low-rank optimizer approximation**

## **Notes**

The Mistral configuration prioritizes:

- **numerical consistency**
- **deterministic convergence behavior**
- **stable long-context optimization**
- **reduced quantization-induced gradient noise**

This setup is computationally expensive but provides **high-fidelity optimization dynamics** during pretraining and finetuning.

---

# **DIT Ernie**

## **Precision Strategy**

The DIT Ernie architecture utilizes:

- **Monte Carlo estimation techniques**
- **probabilistic FP32 approximation**
- **stochastic numerical reconstruction**

Rather than maintaining strict FP32 execution across the entire training stack, the model estimates FP32-equivalent statistical behavior through sampling-based computation.

## **Goals**

- reduce memory bandwidth requirements
- improve throughput efficiency
- retain approximate FP32 convergence characteristics
- balance numerical quality with hardware scalability

## **Notes**

This methodology may introduce:

- **stochastic variance between runs**
- **approximation noise**
- **non-deterministic optimization characteristics**

However, it can significantly reduce training cost relative to native FP32 execution.

---

# **Intended Use**

This repository is intended for:

- research documentation
- training methodology comparison
- optimizer precision analysis
- numerical stability benchmarking
- transformer architecture experimentation

---

# **Limitations**

Results can vary depending on:

- sampling strategy
- hardware backend
- distributed training topology
- random seed initialization

---

# **License**

**Apache License 2.0**