HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 588k • 1.08k
Adaptive-RETRO-GPT-1B is a RETRO-inspired retrieval-pretrained decoder-only language model. Unlike a standard RAG system that only adds retrieved text at inference time, this model is trained with retrieved chunks available during next-token language modeling.
2, retrieval sequence length 5120.0010.1, random-retrieval probability 0.15,11,17HuggingFaceFW/fineweb-edu / sample-10BTwikimedia/wikipedia / 20231101.en20481,172,146,17920000kyLELEng/adaptive-retro-gpt-1b-corpuskyLELEng/adaptive-retro-gpt-1b-datastore{
"step": 20000,
"retrieval_on": {
"loss": 1.7580267190933228,
"lm_loss": 1.7580267190933228,
"ppl": 5.800979131574639,
"gate_mean": 1.749867806211114e-06
},
"retrieval_off": {
"loss": 1.7650717496871948,
"lm_loss": 1.7650717496871948,
"ppl": 5.841991504112031,
"gate_mean": 0.0
},
"random_retrieval": {
"loss": 1.7536429166793823,
"lm_loss": 1.7536429166793823,
"ppl": 5.775604444698179,
"gate_mean": 1.7668644431978464e-06
},
"delta_lm_loss_off_minus_on": 0.00704503059387207,
"delta_lm_loss_random_minus_on": -0.00438380241394043
}
The evaluation compares retrieval-on, retrieval-off, and random-retrieval modes. This is the main ablation for whether the trained model is using retrieved context productively and whether it is robust to noisy retrieval.
This is an experimental RETRO-style pretraining run for comparing retrieval-pretrained GPT models against dense GPT baselines at similar training budgets. It is not instruction tuned and should not be used as a factual assistant without further evaluation.