Felldude
/

ERNIE-Image

ErnieImagePipeline

Model card Files Files and versions

ERNIE-Image / README.md

Felldude's picture

Update README.md

c500fe0 verified 9 days ago

|

history blame contribute delete

2.66 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- mistral
	- fp32
	- adamw
	- transformer
	- monte-carlo
	- dit
	- ernie
	pipeline_tag: text-to-image
	---

	# Model Card

	# Overview

	This repository documents two separate large language model training methodologies and precision strategies:

	---

	# Mistral LLM Training

	- Fully trained in native FP32 precision
	- Optimization performed using standard AdamW
	- No Adam8bit, quantized optimizer states, or reduced-precision optimizer approximations were used during training
	- Intended to preserve numerical stability and high-fidelity gradient accumulation throughout all training phases

	---

	# DIT Ernie Model

	- Uses a Monte Carlo estimation approach to approximate FP32 behavior

	---

	# Training Details

	# Mistral LLM

	## Precision

	- Full FP32 training
	- FP32 activations
	- FP32 optimizer states
	- FP32 gradients

	## Optimizer

	- AdamW
	- Weight decay enabled
	- No 8-bit optimizer compression
	- No low-rank optimizer approximation

	## Notes

	The Mistral configuration prioritizes:

	- numerical consistency
	- deterministic convergence behavior
	- stable long-context optimization
	- reduced quantization-induced gradient noise

	This setup is computationally expensive but provides high-fidelity optimization dynamics during pretraining and finetuning.

	---

	# DIT Ernie

	## Precision Strategy

	The DIT Ernie architecture utilizes:

	- Monte Carlo estimation techniques
	- probabilistic FP32 approximation
	- stochastic numerical reconstruction

	Rather than maintaining strict FP32 execution across the entire training stack, the model estimates FP32-equivalent statistical behavior through sampling-based computation.

	## Goals

	- reduce memory bandwidth requirements
	- improve throughput efficiency
	- retain approximate FP32 convergence characteristics
	- balance numerical quality with hardware scalability

	## Notes

	This methodology may introduce:

	- stochastic variance between runs
	- approximation noise
	- non-deterministic optimization characteristics

	However, it can significantly reduce training cost relative to native FP32 execution.

	---

	# Intended Use

	This repository is intended for:

	- research documentation
	- training methodology comparison
	- optimizer precision analysis
	- numerical stability benchmarking
	- transformer architecture experimentation

	---

	# Limitations

	Results can vary depending on:

	- sampling strategy
	- hardware backend
	- distributed training topology
	- random seed initialization

	---

	# License

	Apache License 2.0