Rybib
/

EDEN

Feature Extraction

text-enhancement

grammar-correction

encoder-decoder

Model card Files Files and versions

EDEN / docs /ARCHITECTURE.md

Rybib's picture

Upload EDEN model and code

2f65125 verified 8 days ago

|

History Blame Contribute Delete

2.98 kB

	# EDEN architecture

	EDEN is a standard encoder-decoder Transformer trained from scratch for text
	enhancement. This document describes how the model is built.

	## Overview

	The model reads a rough source sentence and generates a polished target
	sentence. It uses a shared byte-level BPE vocabulary for both the input and the
	output, and the input embedding matrix is tied to the output projection.

	```
	rough text
	\|
	v
	[byte-level BPE tokenizer]
	\|
	v
	[embedding + sinusoidal positional encoding]
	\|
	v
	[Transformer encoder, 8 layers] -> memory
	\|
	v
	[Transformer decoder, 8 layers] (attends to memory, causal self-attention)
	\|
	v
	[tied linear language-model head]
	\|
	v
	polished text
	```

	## Configuration

	\| Field \| Value \| Meaning \|
	\| --- \| --- \| --- \|
	\| `vocab_size` \| 24000 \| Byte-level BPE vocabulary size \|
	\| `d_model` \| 640 \| Hidden size \|
	\| `n_heads` \| 10 \| Attention heads per block \|
	\| `n_layers` \| 8 \| Encoder layers, and decoder layers \|
	\| `dim_feedforward` \| 2560 \| Feed-forward inner size \|
	\| `dropout` \| 0.1 \| Dropout probability \|
	\| `max_len` \| 512 \| Maximum positions \|

	## Key design choices

	* Tied embeddings. The language-model head shares its weight matrix with the
	input embedding. This reduces parameters and tends to improve quality on
	vocabulary-heavy tasks.
	* Pre-norm blocks. The encoder and decoder use `norm_first=True`, which makes
	deep Transformers more stable to train.
	* GELU activations in the feed-forward blocks.
	* Sinusoidal positional encoding stored as a buffer. In the Transformers
	integration this buffer is persistent so it is saved and restored correctly
	through safetensors and meta-device loading.
	* Padding-aware attention. Padding tokens are masked in both the encoder and
	the decoder, and the decoder uses a causal mask for self-attention.

	## Special tokens

	\| Token \| Id \| Role \|
	\| --- \| --- \| --- \|
	\| `[UNK]` \| 0 \| Unknown token \|
	\| `[PAD]` \| 1 \| Padding \|
	\| `[BOS]` \| 2 \| Beginning of sequence and decoder start \|
	\| `[EOS]` \| 3 \| End of sequence \|

	## Generation

	For inference the model supports three strategies:

	* Beam search (default), with a length penalty and a repetition penalty.
	This gives the most conservative, reliable edits.
	* Greedy decoding.
	* Sampling with temperature, top-k, and top-p filtering.

	Long inputs are split into sentence-aware chunks that each fit inside the 512
	token window, rewritten independently, and joined back together.

	## Two code paths, one architecture

	The exact same layer structure is defined in two places:

	* `eden/model.py` is the reference model used by the training engine.
	* `modeling_eden.py` is the Hugging Face Transformers wrapper.

	Because the module names and shapes match, a checkpoint trained with the engine
	loads into the Transformers model without any key remapping. The conversion
	script in `scripts/convert_checkpoint_to_hf.py` performs this step and writes the
	safetensors weights.