Scaffold GPT-2 Portuguese

GPT-2 272M trained from scratch on Portuguese news articles with scaffold (countdown) tokens for perfect length control.

Resources

Resource	Link
Dataset	viniciusxpb/scaffold-tokens-dataset
Training code	github.com/viniciusxpb/scaffold-tokens
Author	Vinícius França

What are Scaffold Tokens?

Each word in the training data is preceded by a countdown token <ff_N> that tells the model how many words remain:

<ff_6> O <ff_5> presidente <ff_4> anunciou <ff_3> novas <ff_2> medidas <ff_1> econômicas <ff_0> .

The model achieves 100% length accuracy (53/53 exact match) across all tested ranges (50 to 999 words).

Model Details

Field	Value
Architecture	GPT-2 272M (12 layers, 768 dim, 6 heads)
Parameters	272M (75M transformer + 118M value embeddings + 79M embeddings/head)
Vocab	51,264 (50,257 BPE + 1,000 FF + 7 padding)
Tokenizer	tiktoken GPT-2
Precision	BF16
Training	1,750 steps, 55 min on RTX 3060
Final val loss	1.381
Dataset	scaffold-tokens-dataset (~208M tokens)

Quick Start

# Clone the repo
git clone https://github.com/viniciusxpb/scaffold-tokens
cd scaffold-tokens

# Setup, download model, and generate
make setup
make download-model
make generate

Usage

import torch
import tiktoken

# Load model
ckpt = torch.load("model.pt", map_location="cuda", weights_only=False)

# The model uses a custom architecture (not HuggingFace Transformers).
# See the full inference code at the GitHub repo.

For full inference with forced countdown generation, see the training repository:

github.com/viniciusxpb/scaffold-tokens

Training Data

Trained on scaffold-tokens-dataset -- Portuguese news articles from Folha de S.Paulo (public domain), pre-tokenized with <ff_N> countdown tokens.

Results

Metric	Value
Length control accuracy	100% (53/53 exact match)
Final word loss	2.08
Final FF loss	0.045
Peak VRAM	5.8 GB / 12 GB

Checkpoint Format

The file model.pt is a PyTorch checkpoint containing:

{
    "model": OrderedDict,   # state_dict (weights only, no optimizer)
    "step": 1750,
    "val_loss": 1.381,
}

Citation

@misc{scaffold-tokens-2025,
  title={Scaffold Tokens: Teaching LLMs to Plan with Countdown Tokens},
  author={Vinícius França},
  year={2025},
  url={https://github.com/viniciusxpb/scaffold-tokens}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

viniciusxpb
/

scaffold-gpt2-pt