EDEN / docs /ARCHITECTURE.md
Rybib's picture
Upload EDEN model and code
2f65125 verified
|
Raw
History Blame Contribute Delete
2.98 kB
# EDEN architecture
EDEN is a standard encoder-decoder Transformer trained from scratch for text
enhancement. This document describes how the model is built.
## Overview
The model reads a rough source sentence and generates a polished target
sentence. It uses a shared byte-level BPE vocabulary for both the input and the
output, and the input embedding matrix is tied to the output projection.
```
rough text
|
v
[byte-level BPE tokenizer]
|
v
[embedding + sinusoidal positional encoding]
|
v
[Transformer encoder, 8 layers] -> memory
|
v
[Transformer decoder, 8 layers] (attends to memory, causal self-attention)
|
v
[tied linear language-model head]
|
v
polished text
```
## Configuration
| Field | Value | Meaning |
| --- | --- | --- |
| `vocab_size` | 24000 | Byte-level BPE vocabulary size |
| `d_model` | 640 | Hidden size |
| `n_heads` | 10 | Attention heads per block |
| `n_layers` | 8 | Encoder layers, and decoder layers |
| `dim_feedforward` | 2560 | Feed-forward inner size |
| `dropout` | 0.1 | Dropout probability |
| `max_len` | 512 | Maximum positions |
## Key design choices
* **Tied embeddings.** The language-model head shares its weight matrix with the
input embedding. This reduces parameters and tends to improve quality on
vocabulary-heavy tasks.
* **Pre-norm blocks.** The encoder and decoder use `norm_first=True`, which makes
deep Transformers more stable to train.
* **GELU activations** in the feed-forward blocks.
* **Sinusoidal positional encoding** stored as a buffer. In the Transformers
integration this buffer is persistent so it is saved and restored correctly
through safetensors and meta-device loading.
* **Padding-aware attention.** Padding tokens are masked in both the encoder and
the decoder, and the decoder uses a causal mask for self-attention.
## Special tokens
| Token | Id | Role |
| --- | --- | --- |
| `[UNK]` | 0 | Unknown token |
| `[PAD]` | 1 | Padding |
| `[BOS]` | 2 | Beginning of sequence and decoder start |
| `[EOS]` | 3 | End of sequence |
## Generation
For inference the model supports three strategies:
* **Beam search** (default), with a length penalty and a repetition penalty.
This gives the most conservative, reliable edits.
* **Greedy** decoding.
* **Sampling** with temperature, top-k, and top-p filtering.
Long inputs are split into sentence-aware chunks that each fit inside the 512
token window, rewritten independently, and joined back together.
## Two code paths, one architecture
The exact same layer structure is defined in two places:
* `eden/model.py` is the reference model used by the training engine.
* `modeling_eden.py` is the Hugging Face Transformers wrapper.
Because the module names and shapes match, a checkpoint trained with the engine
loads into the Transformers model without any key remapping. The conversion
script in `scripts/convert_checkpoint_to_hf.py` performs this step and writes the
safetensors weights.