Rybib
/

EDEN

Feature Extraction

text-enhancement

grammar-correction

encoder-decoder

Model card Files Files and versions

EDEN / docs /TRAINING.md

Rybib's picture

Upload EDEN model and code

0865fcf verified 15 days ago

|

History Blame Contribute Delete

3.24 kB

	# Training and fine-tuning EDEN

	This guide covers retraining EDEN from scratch, fine-tuning it on your own data,
	and converting a checkpoint for publishing.

	## Install

	```bash
	pip install -r requirements.txt
	```

	## Where files live

	All training artifacts are written under a workspace folder named
	`eden_system`, created next to where you run the commands. You can move the
	workspace by setting the `EDEN_HOME` environment variable:

	```bash
	export EDEN_HOME=/path/to/workspace
	```

	The layout is:

	```
	eden_system/
	data/ prepared pairs, tokenizer, training config
	checkpoints/ default checkpoint folder
	training_sessions/ numbered training runs, each with its own checkpoints
	run/ live metrics, logs, and run state
	exports/ exported artifacts
	```

	## Prepare the dataset

	```bash
	python -m eden.cli prepare
	```

	This downloads and combines the source corpora, generates synthetic noise pairs,
	trains the byte-level BPE tokenizer, and writes everything into
	`eden_system/data`.

	## Train from scratch

	```bash
	python -m eden.cli train
	```

	Recipes control model size and memory use:

	```bash
	python -m eden.cli train --recipe survivor # smallest, always runs
	python -m eden.cli train --recipe m5-smart # balanced default
	python -m eden.cli train --recipe m5-large # largest, matches this release
	```

	Start with `m5-smart`. Move to `m5-large` only after a smaller recipe trains
	without memory stops.

	To resume:

	```bash
	python -m eden.cli train --resume eden_system/checkpoints/latest.pt
	```

	## Fine-tune on your own examples

	Create a JSONL file of input and target pairs:

	```jsonl
	{"input": "bad rough text here", "target": "Polished text here."}
	{"input": "another messy sentance", "target": "Another polished sentence."}
	```

	CSV and TSV files with `input` and `target` columns also work. Then run:

	```bash
	python -m eden.cli finetune --data my_pairs.jsonl --mix-base
	```

	`--mix-base` blends in the base dataset so the model learns your style without
	forgetting general spelling and grammar ability. Use a low learning rate for
	fine-tuning, for example `--lr 0.00008`.

	## Evaluate

	```bash
	python -m eden.cli eval --checkpoint eden_system/checkpoints/best.pt
	```

	## Convert a checkpoint for Hugging Face

	Once you have a checkpoint you like, convert it into safetensors plus the
	configuration and tokenizer files:

	```bash
	python scripts/convert_checkpoint_to_hf.py \
	--checkpoint eden_system/checkpoints/best.pt \
	--tokenizer eden_system/data/tokenizer.json \
	--out .
	```

	Then upload:

	```bash
	python scripts/push_to_hub.py --repo-id Rybib/EDEN
	```

	## Memory safety

	EDEN keeps PyTorch MPS inside a bounded memory budget and stops with a resumable
	checkpoint if memory use gets too high. A saved checkpoint is much better than a
	frozen machine. The cutoff is configurable through the training config and the
	recipe.

	## The web dashboard

	```bash
	python -m eden.cli ui
	# open http://127.0.0.1:7860
	```

	The dashboard can start, pause, resume, and monitor training, and run a finished
	checkpoint in the browser. It launches training as a separate process using
	`python -m eden.cli`, so make sure the `eden` package is importable from the
	folder you launch it in.