File size: 2,977 Bytes
2f65125
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# EDEN architecture

EDEN is a standard encoder-decoder Transformer trained from scratch for text
enhancement. This document describes how the model is built.

## Overview

The model reads a rough source sentence and generates a polished target
sentence. It uses a shared byte-level BPE vocabulary for both the input and the
output, and the input embedding matrix is tied to the output projection.

```
rough text
   |
   v
[byte-level BPE tokenizer]
   |
   v
[embedding + sinusoidal positional encoding]
   |
   v
[Transformer encoder, 8 layers]  ->  memory
   |
   v
[Transformer decoder, 8 layers]  (attends to memory, causal self-attention)
   |
   v
[tied linear language-model head]
   |
   v
polished text
```

## Configuration

| Field | Value | Meaning |
| --- | --- | --- |
| `vocab_size` | 24000 | Byte-level BPE vocabulary size |
| `d_model` | 640 | Hidden size |
| `n_heads` | 10 | Attention heads per block |
| `n_layers` | 8 | Encoder layers, and decoder layers |
| `dim_feedforward` | 2560 | Feed-forward inner size |
| `dropout` | 0.1 | Dropout probability |
| `max_len` | 512 | Maximum positions |

## Key design choices

* **Tied embeddings.** The language-model head shares its weight matrix with the
  input embedding. This reduces parameters and tends to improve quality on
  vocabulary-heavy tasks.
* **Pre-norm blocks.** The encoder and decoder use `norm_first=True`, which makes
  deep Transformers more stable to train.
* **GELU activations** in the feed-forward blocks.
* **Sinusoidal positional encoding** stored as a buffer. In the Transformers
  integration this buffer is persistent so it is saved and restored correctly
  through safetensors and meta-device loading.
* **Padding-aware attention.** Padding tokens are masked in both the encoder and
  the decoder, and the decoder uses a causal mask for self-attention.

## Special tokens

| Token | Id | Role |
| --- | --- | --- |
| `[UNK]` | 0 | Unknown token |
| `[PAD]` | 1 | Padding |
| `[BOS]` | 2 | Beginning of sequence and decoder start |
| `[EOS]` | 3 | End of sequence |

## Generation

For inference the model supports three strategies:

* **Beam search** (default), with a length penalty and a repetition penalty.
  This gives the most conservative, reliable edits.
* **Greedy** decoding.
* **Sampling** with temperature, top-k, and top-p filtering.

Long inputs are split into sentence-aware chunks that each fit inside the 512
token window, rewritten independently, and joined back together.

## Two code paths, one architecture

The exact same layer structure is defined in two places:

* `eden/model.py` is the reference model used by the training engine.
* `modeling_eden.py` is the Hugging Face Transformers wrapper.

Because the module names and shapes match, a checkpoint trained with the engine
loads into the Transformers model without any key remapping. The conversion
script in `scripts/convert_checkpoint_to_hf.py` performs this step and writes the
safetensors weights.