--- license: apache-2.0 language: - en tags: - mistral - fp32 - adamw - transformer - monte-carlo - dit - ernie pipeline_tag: text-to-image --- # **Model Card** # **Overview** This repository documents two separate large language model training methodologies and precision strategies: --- # **Mistral LLM Training** - **Fully trained in native FP32 precision** - Optimization performed using standard **AdamW** - **No Adam8bit**, quantized optimizer states, or reduced-precision optimizer approximations were used during training - Intended to preserve **numerical stability** and **high-fidelity gradient accumulation** throughout all training phases --- # **DIT Ernie Model** - Uses a **Monte Carlo estimation** approach to approximate **FP32 behavior** --- # **Training Details** # **Mistral LLM** ## **Precision** - **Full FP32 training** - **FP32 activations** - **FP32 optimizer states** - **FP32 gradients** ## **Optimizer** - **AdamW** - Weight decay enabled - **No 8-bit optimizer compression** - **No low-rank optimizer approximation** ## **Notes** The Mistral configuration prioritizes: - **numerical consistency** - **deterministic convergence behavior** - **stable long-context optimization** - **reduced quantization-induced gradient noise** This setup is computationally expensive but provides **high-fidelity optimization dynamics** during pretraining and finetuning. --- # **DIT Ernie** ## **Precision Strategy** The DIT Ernie architecture utilizes: - **Monte Carlo estimation techniques** - **probabilistic FP32 approximation** - **stochastic numerical reconstruction** Rather than maintaining strict FP32 execution across the entire training stack, the model estimates FP32-equivalent statistical behavior through sampling-based computation. ## **Goals** - reduce memory bandwidth requirements - improve throughput efficiency - retain approximate FP32 convergence characteristics - balance numerical quality with hardware scalability ## **Notes** This methodology may introduce: - **stochastic variance between runs** - **approximation noise** - **non-deterministic optimization characteristics** However, it can significantly reduce training cost relative to native FP32 execution. --- # **Intended Use** This repository is intended for: - research documentation - training methodology comparison - optimizer precision analysis - numerical stability benchmarking - transformer architecture experimentation --- # **Limitations** Results can vary depending on: - sampling strategy - hardware backend - distributed training topology - random seed initialization --- # **License** **Apache License 2.0**