ERNIE-Image / README.md

Felldude

Update README.md

c500fe0 verified 7 days ago

preview code

raw

history blame contribute delete

2.66 kB

metadata

license: apache-2.0
language:
  - en
tags:
  - mistral
  - fp32
  - adamw
  - transformer
  - monte-carlo
  - dit
  - ernie
pipeline_tag: text-to-image

Model Card

Overview

This repository documents two separate large language model training methodologies and precision strategies:

Mistral LLM Training

Fully trained in native FP32 precision
Optimization performed using standard AdamW
No Adam8bit, quantized optimizer states, or reduced-precision optimizer approximations were used during training
Intended to preserve numerical stability and high-fidelity gradient accumulation throughout all training phases

DIT Ernie Model

Uses a Monte Carlo estimation approach to approximate FP32 behavior

Training Details

Mistral LLM

Precision

Full FP32 training
FP32 activations
FP32 optimizer states
FP32 gradients

Optimizer

AdamW
Weight decay enabled
No 8-bit optimizer compression
No low-rank optimizer approximation

Notes

The Mistral configuration prioritizes:

numerical consistency
deterministic convergence behavior
stable long-context optimization
reduced quantization-induced gradient noise

This setup is computationally expensive but provides high-fidelity optimization dynamics during pretraining and finetuning.

DIT Ernie

Precision Strategy

The DIT Ernie architecture utilizes:

Monte Carlo estimation techniques
probabilistic FP32 approximation
stochastic numerical reconstruction

Rather than maintaining strict FP32 execution across the entire training stack, the model estimates FP32-equivalent statistical behavior through sampling-based computation.

Goals

reduce memory bandwidth requirements
improve throughput efficiency
retain approximate FP32 convergence characteristics
balance numerical quality with hardware scalability

Notes

This methodology may introduce:

stochastic variance between runs
approximation noise
non-deterministic optimization characteristics

However, it can significantly reduce training cost relative to native FP32 execution.

Intended Use

This repository is intended for:

research documentation
training methodology comparison
optimizer precision analysis
numerical stability benchmarking
transformer architecture experimentation

Limitations

Results can vary depending on:

sampling strategy
hardware backend
distributed training topology
random seed initialization

License

Apache License 2.0