Scaffold GPT-2 Portuguese

GPT-2 272M trained from scratch on Portuguese news articles with scaffold (countdown) tokens for perfect length control.

Resources

What are Scaffold Tokens?

Each word in the training data is preceded by a countdown token <ff_N> that tells the model how many words remain:

<ff_6> O <ff_5> presidente <ff_4> anunciou <ff_3> novas <ff_2> medidas <ff_1> econômicas <ff_0> .

The model achieves 100% length accuracy (53/53 exact match) across all tested ranges (50 to 999 words).

Model Details

Field Value
Architecture GPT-2 272M (12 layers, 768 dim, 6 heads)
Parameters 272M (75M transformer + 118M value embeddings + 79M embeddings/head)
Vocab 51,264 (50,257 BPE + 1,000 FF + 7 padding)
Tokenizer tiktoken GPT-2
Precision BF16
Training 1,750 steps, 55 min on RTX 3060
Final val loss 1.381
Dataset scaffold-tokens-dataset (~208M tokens)

Quick Start

# Clone the repo
git clone https://github.com/viniciusxpb/scaffold-tokens
cd scaffold-tokens

# Setup, download model, and generate
make setup
make download-model
make generate

Usage

import torch
import tiktoken

# Load model
ckpt = torch.load("model.pt", map_location="cuda", weights_only=False)

# The model uses a custom architecture (not HuggingFace Transformers).
# See the full inference code at the GitHub repo.

For full inference with forced countdown generation, see the training repository:

github.com/viniciusxpb/scaffold-tokens

Training Data

Trained on scaffold-tokens-dataset -- Portuguese news articles from Folha de S.Paulo (public domain), pre-tokenized with <ff_N> countdown tokens.

Results

Metric Value
Length control accuracy 100% (53/53 exact match)
Final word loss 2.08
Final FF loss 0.045
Peak VRAM 5.8 GB / 12 GB

Checkpoint Format

The file model.pt is a PyTorch checkpoint containing:

{
    "model": OrderedDict,   # state_dict (weights only, no optimizer)
    "step": 1750,
    "val_loss": 1.381,
}

Citation

@misc{scaffold-tokens-2025,
  title={Scaffold Tokens: Teaching LLMs to Plan with Countdown Tokens},
  author={Vinícius França},
  year={2025},
  url={https://github.com/viniciusxpb/scaffold-tokens}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train viniciusxpb/scaffold-gpt2-pt