ERNIE-Image / README.md
Felldude's picture
Update README.md
c500fe0 verified
---
license: apache-2.0
language:
- en
tags:
- mistral
- fp32
- adamw
- transformer
- monte-carlo
- dit
- ernie
pipeline_tag: text-to-image
---
# **Model Card**
# **Overview**
This repository documents two separate large language model training methodologies and precision strategies:
---
# **Mistral LLM Training**
- **Fully trained in native FP32 precision**
- Optimization performed using standard **AdamW**
- **No Adam8bit**, quantized optimizer states, or reduced-precision optimizer approximations were used during training
- Intended to preserve **numerical stability** and **high-fidelity gradient accumulation** throughout all training phases
---
# **DIT Ernie Model**
- Uses a **Monte Carlo estimation** approach to approximate **FP32 behavior**
---
# **Training Details**
# **Mistral LLM**
## **Precision**
- **Full FP32 training**
- **FP32 activations**
- **FP32 optimizer states**
- **FP32 gradients**
## **Optimizer**
- **AdamW**
- Weight decay enabled
- **No 8-bit optimizer compression**
- **No low-rank optimizer approximation**
## **Notes**
The Mistral configuration prioritizes:
- **numerical consistency**
- **deterministic convergence behavior**
- **stable long-context optimization**
- **reduced quantization-induced gradient noise**
This setup is computationally expensive but provides **high-fidelity optimization dynamics** during pretraining and finetuning.
---
# **DIT Ernie**
## **Precision Strategy**
The DIT Ernie architecture utilizes:
- **Monte Carlo estimation techniques**
- **probabilistic FP32 approximation**
- **stochastic numerical reconstruction**
Rather than maintaining strict FP32 execution across the entire training stack, the model estimates FP32-equivalent statistical behavior through sampling-based computation.
## **Goals**
- reduce memory bandwidth requirements
- improve throughput efficiency
- retain approximate FP32 convergence characteristics
- balance numerical quality with hardware scalability
## **Notes**
This methodology may introduce:
- **stochastic variance between runs**
- **approximation noise**
- **non-deterministic optimization characteristics**
However, it can significantly reduce training cost relative to native FP32 execution.
---
# **Intended Use**
This repository is intended for:
- research documentation
- training methodology comparison
- optimizer precision analysis
- numerical stability benchmarking
- transformer architecture experimentation
---
# **Limitations**
Results can vary depending on:
- sampling strategy
- hardware backend
- distributed training topology
- random seed initialization
---
# **License**
**Apache License 2.0**