EDEN / docs /TRAINING.md
Rybib's picture
Upload EDEN model and code
0865fcf verified
|
Raw
History Blame Contribute Delete
3.24 kB
# Training and fine-tuning EDEN
This guide covers retraining EDEN from scratch, fine-tuning it on your own data,
and converting a checkpoint for publishing.
## Install
```bash
pip install -r requirements.txt
```
## Where files live
All training artifacts are written under a workspace folder named
`eden_system`, created next to where you run the commands. You can move the
workspace by setting the `EDEN_HOME` environment variable:
```bash
export EDEN_HOME=/path/to/workspace
```
The layout is:
```
eden_system/
data/ prepared pairs, tokenizer, training config
checkpoints/ default checkpoint folder
training_sessions/ numbered training runs, each with its own checkpoints
run/ live metrics, logs, and run state
exports/ exported artifacts
```
## Prepare the dataset
```bash
python -m eden.cli prepare
```
This downloads and combines the source corpora, generates synthetic noise pairs,
trains the byte-level BPE tokenizer, and writes everything into
`eden_system/data`.
## Train from scratch
```bash
python -m eden.cli train
```
Recipes control model size and memory use:
```bash
python -m eden.cli train --recipe survivor # smallest, always runs
python -m eden.cli train --recipe m5-smart # balanced default
python -m eden.cli train --recipe m5-large # largest, matches this release
```
Start with `m5-smart`. Move to `m5-large` only after a smaller recipe trains
without memory stops.
To resume:
```bash
python -m eden.cli train --resume eden_system/checkpoints/latest.pt
```
## Fine-tune on your own examples
Create a JSONL file of input and target pairs:
```jsonl
{"input": "bad rough text here", "target": "Polished text here."}
{"input": "another messy sentance", "target": "Another polished sentence."}
```
CSV and TSV files with `input` and `target` columns also work. Then run:
```bash
python -m eden.cli finetune --data my_pairs.jsonl --mix-base
```
`--mix-base` blends in the base dataset so the model learns your style without
forgetting general spelling and grammar ability. Use a low learning rate for
fine-tuning, for example `--lr 0.00008`.
## Evaluate
```bash
python -m eden.cli eval --checkpoint eden_system/checkpoints/best.pt
```
## Convert a checkpoint for Hugging Face
Once you have a checkpoint you like, convert it into safetensors plus the
configuration and tokenizer files:
```bash
python scripts/convert_checkpoint_to_hf.py \
--checkpoint eden_system/checkpoints/best.pt \
--tokenizer eden_system/data/tokenizer.json \
--out .
```
Then upload:
```bash
python scripts/push_to_hub.py --repo-id Rybib/EDEN
```
## Memory safety
EDEN keeps PyTorch MPS inside a bounded memory budget and stops with a resumable
checkpoint if memory use gets too high. A saved checkpoint is much better than a
frozen machine. The cutoff is configurable through the training config and the
recipe.
## The web dashboard
```bash
python -m eden.cli ui
# open http://127.0.0.1:7860
```
The dashboard can start, pause, resume, and monitor training, and run a finished
checkpoint in the browser. It launches training as a separate process using
`python -m eden.cli`, so make sure the `eden` package is importable from the
folder you launch it in.