Felldude
/

ERNIE-Image

@@ -12,84 +12,115 @@ tags:
   - ernie
 pipeline_tag: text-generation
 ---
-Model Card
-Overview
 This repository documents two separate large language model training methodologies and precision strategies:
-Mistral LLM Training
-Fully trained in native FP32 precision.
-Optimization performed using standard AdamW.
-No Adam8bit, quantized optimizer states, or reduced-precision optimizer approximations were used during training.
-Intended to preserve numerical stability and high-fidelity gradient accumulation throughout all training phases.
-DIT Ernie Model
-Uses a Monte Carlo estimation approach to approximate FP32 behavior.
-Training Details
-Mistral LLM
-Precision
-Full FP32 training
-FP32 activations
-FP32 optimizer states
-FP32 gradients
-Optimizer
-AdamW
-Weight decay enabled
-No 8-bit optimizer compression
-No low-rank optimizer approximation
-Notes
 The Mistral configuration prioritizes:
-numerical consistency
-deterministic convergence behavior
-stable long-context optimization
-reduced quantization-induced gradient noise
-This setup is computationally expensive but provides high-fidelity optimization dynamics during pretraining and finetuning.
-DIT Ernie
-Precision Strategy
 The DIT Ernie architecture utilizes:
-Monte Carlo estimation techniques
-probabilistic FP32 approximation
-stochastic numerical reconstruction
 Rather than maintaining strict FP32 execution across the entire training stack, the model estimates FP32-equivalent statistical behavior through sampling-based computation.
-Goals
-reduce memory bandwidth requirements
-improve throughput efficiency
-retain approximate FP32 convergence characteristics
-balance numerical quality with hardware scalability
-Notes
 This methodology may introduce:
-stochastic variance between runs
-approximation noise
-non-deterministic optimization characteristics
 However, it can significantly reduce training cost relative to native FP32 execution.
-Intended Use
 This repository is intended for:
-research documentation
-training methodology comparison
-optimizer precision analysis
-numerical stability benchmarking
-transformer architecture experimentation
-Limitations
 Results can vary depending on:
-sampling strategy
-hardware backend
-distributed training topology
-random seed initialization
-License
-Apache License 2.0

   - ernie
 pipeline_tag: text-generation
 ---
+# **Model Card**
+# **Overview**
 This repository documents two separate large language model training methodologies and precision strategies:
+---
+# **Mistral LLM Training**
+- **Fully trained in native FP32 precision**
+- Optimization performed using standard **AdamW**
+- **No Adam8bit**, quantized optimizer states, or reduced-precision optimizer approximations were used during training
+- Intended to preserve **numerical stability** and **high-fidelity gradient accumulation** throughout all training phases
+---
+# **DIT Ernie Model**
+- Uses a **Monte Carlo estimation** approach to approximate **FP32 behavior**
+---
+# **Training Details**
+# **Mistral LLM**
+## **Precision**
+- **Full FP32 training**
+- **FP32 activations**
+- **FP32 optimizer states**
+- **FP32 gradients**
+## **Optimizer**
+- **AdamW**
+- Weight decay enabled
+- **No 8-bit optimizer compression**
+- **No low-rank optimizer approximation**
+## **Notes**
 The Mistral configuration prioritizes:
+- **numerical consistency**
+- **deterministic convergence behavior**
+- **stable long-context optimization**
+- **reduced quantization-induced gradient noise**
+This setup is computationally expensive but provides **high-fidelity optimization dynamics** during pretraining and finetuning.
+---
+# **DIT Ernie**
+## **Precision Strategy**
 The DIT Ernie architecture utilizes:
+- **Monte Carlo estimation techniques**
+- **probabilistic FP32 approximation**
+- **stochastic numerical reconstruction**
 Rather than maintaining strict FP32 execution across the entire training stack, the model estimates FP32-equivalent statistical behavior through sampling-based computation.
+## **Goals**
+- reduce memory bandwidth requirements
+- improve throughput efficiency
+- retain approximate FP32 convergence characteristics
+- balance numerical quality with hardware scalability
+## **Notes**
 This methodology may introduce:
+- **stochastic variance between runs**
+- **approximation noise**
+- **non-deterministic optimization characteristics**
 However, it can significantly reduce training cost relative to native FP32 execution.
+---
+# **Intended Use**
 This repository is intended for:
+- research documentation
+- training methodology comparison
+- optimizer precision analysis
+- numerical stability benchmarking
+- transformer architecture experimentation
+---
+# **Limitations**
 Results can vary depending on:
+- sampling strategy
+- hardware backend
+- distributed training topology
+- random seed initialization
+---
+# **License**
+**Apache License 2.0**