EDEN architecture

EDEN is a standard encoder-decoder Transformer trained from scratch for text enhancement. This document describes how the model is built.

Overview

The model reads a rough source sentence and generates a polished target sentence. It uses a shared byte-level BPE vocabulary for both the input and the output, and the input embedding matrix is tied to the output projection.

rough text
   |
   v
[byte-level BPE tokenizer]
   |
   v
[embedding + sinusoidal positional encoding]
   |
   v
[Transformer encoder, 8 layers]  ->  memory
   |
   v
[Transformer decoder, 8 layers]  (attends to memory, causal self-attention)
   |
   v
[tied linear language-model head]
   |
   v
polished text

Configuration

Field	Value	Meaning
`vocab_size`	24000	Byte-level BPE vocabulary size
`d_model`	640	Hidden size
`n_heads`	10	Attention heads per block
`n_layers`	8	Encoder layers, and decoder layers
`dim_feedforward`	2560	Feed-forward inner size
`dropout`	0.1	Dropout probability
`max_len`	512	Maximum positions

Key design choices

Tied embeddings. The language-model head shares its weight matrix with the input embedding. This reduces parameters and tends to improve quality on vocabulary-heavy tasks.
Pre-norm blocks. The encoder and decoder use norm_first=True, which makes deep Transformers more stable to train.
GELU activations in the feed-forward blocks.
Sinusoidal positional encoding stored as a buffer. In the Transformers integration this buffer is persistent so it is saved and restored correctly through safetensors and meta-device loading.
Padding-aware attention. Padding tokens are masked in both the encoder and the decoder, and the decoder uses a causal mask for self-attention.

Special tokens

Token	Id	Role
`[UNK]`	0	Unknown token
`[PAD]`	1	Padding
`[BOS]`	2	Beginning of sequence and decoder start
`[EOS]`	3	End of sequence

Generation

For inference the model supports three strategies:

Beam search (default), with a length penalty and a repetition penalty. This gives the most conservative, reliable edits.
Greedy decoding.
Sampling with temperature, top-k, and top-p filtering.

Long inputs are split into sentence-aware chunks that each fit inside the 512 token window, rewritten independently, and joined back together.

Two code paths, one architecture

The exact same layer structure is defined in two places:

eden/model.py is the reference model used by the training engine.
modeling_eden.py is the Hugging Face Transformers wrapper.

Because the module names and shapes match, a checkpoint trained with the engine loads into the Transformers model without any key remapping. The conversion script in scripts/convert_checkpoint_to_hf.py performs this step and writes the safetensors weights.