ERNIE-Image / README.md
Felldude's picture
Update README.md
c500fe0 verified
metadata
license: apache-2.0
language:
  - en
tags:
  - mistral
  - fp32
  - adamw
  - transformer
  - monte-carlo
  - dit
  - ernie
pipeline_tag: text-to-image

Model Card

Overview

This repository documents two separate large language model training methodologies and precision strategies:


Mistral LLM Training

  • Fully trained in native FP32 precision
  • Optimization performed using standard AdamW
  • No Adam8bit, quantized optimizer states, or reduced-precision optimizer approximations were used during training
  • Intended to preserve numerical stability and high-fidelity gradient accumulation throughout all training phases

DIT Ernie Model

  • Uses a Monte Carlo estimation approach to approximate FP32 behavior

Training Details

Mistral LLM

Precision

  • Full FP32 training
  • FP32 activations
  • FP32 optimizer states
  • FP32 gradients

Optimizer

  • AdamW
  • Weight decay enabled
  • No 8-bit optimizer compression
  • No low-rank optimizer approximation

Notes

The Mistral configuration prioritizes:

  • numerical consistency
  • deterministic convergence behavior
  • stable long-context optimization
  • reduced quantization-induced gradient noise

This setup is computationally expensive but provides high-fidelity optimization dynamics during pretraining and finetuning.


DIT Ernie

Precision Strategy

The DIT Ernie architecture utilizes:

  • Monte Carlo estimation techniques
  • probabilistic FP32 approximation
  • stochastic numerical reconstruction

Rather than maintaining strict FP32 execution across the entire training stack, the model estimates FP32-equivalent statistical behavior through sampling-based computation.

Goals

  • reduce memory bandwidth requirements
  • improve throughput efficiency
  • retain approximate FP32 convergence characteristics
  • balance numerical quality with hardware scalability

Notes

This methodology may introduce:

  • stochastic variance between runs
  • approximation noise
  • non-deterministic optimization characteristics

However, it can significantly reduce training cost relative to native FP32 execution.


Intended Use

This repository is intended for:

  • research documentation
  • training methodology comparison
  • optimizer precision analysis
  • numerical stability benchmarking
  • transformer architecture experimentation

Limitations

Results can vary depending on:

  • sampling strategy
  • hardware backend
  • distributed training topology
  • random seed initialization

License

Apache License 2.0