File size: 2,662 Bytes

---
license: apache-2.0
language:
- en
tags:
- mistral
- fp32
- adamw
- transformer
- monte-carlo
- dit
- ernie
pipeline_tag: text-to-image
---

# **Model Card**

# **Overview**

This repository documents two separate large language model training methodologies and precision strategies:

---

# **Mistral LLM Training**

- **Fully trained in native FP32 precision**
- Optimization performed using standard **AdamW**
- **No Adam8bit**, quantized optimizer states, or reduced-precision optimizer approximations were used during training
- Intended to preserve **numerical stability** and **high-fidelity gradient accumulation** throughout all training phases

---

# **DIT Ernie Model**

- Uses a **Monte Carlo estimation** approach to approximate **FP32 behavior**

---

# **Training Details**

# **Mistral LLM**

## **Precision**

- **Full FP32 training**
- **FP32 activations**
- **FP32 optimizer states**
- **FP32 gradients**

## **Optimizer**

- **AdamW**
- Weight decay enabled
- **No 8-bit optimizer compression**
- **No low-rank optimizer approximation**

## **Notes**

The Mistral configuration prioritizes:

- **numerical consistency**
- **deterministic convergence behavior**
- **stable long-context optimization**
- **reduced quantization-induced gradient noise**

This setup is computationally expensive but provides **high-fidelity optimization dynamics** during pretraining and finetuning.

---

# **DIT Ernie**

## **Precision Strategy**

The DIT Ernie architecture utilizes:

- **Monte Carlo estimation techniques**
- **probabilistic FP32 approximation**
- **stochastic numerical reconstruction**

Rather than maintaining strict FP32 execution across the entire training stack, the model estimates FP32-equivalent statistical behavior through sampling-based computation.

## **Goals**

- reduce memory bandwidth requirements
- improve throughput efficiency
- retain approximate FP32 convergence characteristics
- balance numerical quality with hardware scalability

## **Notes**

This methodology may introduce:

- **stochastic variance between runs**
- **approximation noise**
- **non-deterministic optimization characteristics**

However, it can significantly reduce training cost relative to native FP32 execution.

---

# **Intended Use**

This repository is intended for:

- research documentation
- training methodology comparison
- optimizer precision analysis
- numerical stability benchmarking
- transformer architecture experimentation

---

# **Limitations**

Results can vary depending on:

- sampling strategy
- hardware backend
- distributed training topology
- random seed initialization

---

# **License**

**Apache License 2.0**