EDEN / docs /ARCHITECTURE.md
Rybib's picture
Upload EDEN model and code
2f65125 verified
|
Raw
History Blame Contribute Delete
2.98 kB

EDEN architecture

EDEN is a standard encoder-decoder Transformer trained from scratch for text enhancement. This document describes how the model is built.

Overview

The model reads a rough source sentence and generates a polished target sentence. It uses a shared byte-level BPE vocabulary for both the input and the output, and the input embedding matrix is tied to the output projection.

rough text
   |
   v
[byte-level BPE tokenizer]
   |
   v
[embedding + sinusoidal positional encoding]
   |
   v
[Transformer encoder, 8 layers]  ->  memory
   |
   v
[Transformer decoder, 8 layers]  (attends to memory, causal self-attention)
   |
   v
[tied linear language-model head]
   |
   v
polished text

Configuration

Field Value Meaning
vocab_size 24000 Byte-level BPE vocabulary size
d_model 640 Hidden size
n_heads 10 Attention heads per block
n_layers 8 Encoder layers, and decoder layers
dim_feedforward 2560 Feed-forward inner size
dropout 0.1 Dropout probability
max_len 512 Maximum positions

Key design choices

  • Tied embeddings. The language-model head shares its weight matrix with the input embedding. This reduces parameters and tends to improve quality on vocabulary-heavy tasks.
  • Pre-norm blocks. The encoder and decoder use norm_first=True, which makes deep Transformers more stable to train.
  • GELU activations in the feed-forward blocks.
  • Sinusoidal positional encoding stored as a buffer. In the Transformers integration this buffer is persistent so it is saved and restored correctly through safetensors and meta-device loading.
  • Padding-aware attention. Padding tokens are masked in both the encoder and the decoder, and the decoder uses a causal mask for self-attention.

Special tokens

Token Id Role
[UNK] 0 Unknown token
[PAD] 1 Padding
[BOS] 2 Beginning of sequence and decoder start
[EOS] 3 End of sequence

Generation

For inference the model supports three strategies:

  • Beam search (default), with a length penalty and a repetition penalty. This gives the most conservative, reliable edits.
  • Greedy decoding.
  • Sampling with temperature, top-k, and top-p filtering.

Long inputs are split into sentence-aware chunks that each fit inside the 512 token window, rewritten independently, and joined back together.

Two code paths, one architecture

The exact same layer structure is defined in two places:

  • eden/model.py is the reference model used by the training engine.
  • modeling_eden.py is the Hugging Face Transformers wrapper.

Because the module names and shapes match, a checkpoint trained with the engine loads into the Transformers model without any key remapping. The conversion script in scripts/convert_checkpoint_to_hf.py performs this step and writes the safetensors weights.