rfbr's picture
Update image link
d0980ce verified
---
language:
- en
license: cc-by-sa-4.0
library_name: transformers
tags:
- causal-lm
- sequential-pretraining
- helium
- kyutai
datasets:
- kyutai/KairosQA
metrics:
- accuracy
---
# Helium 6B: Sequential vs. Shuffled Pretraining
<p align="center">
<img src="https://huggingface.co/kyutai/Sequential_Helium_6B/resolve/main/kairos_seq_model.png" width="400" alt="Kairos Sequential Model Logo">
</p>
This repository houses the **Helium 6B** models, specifically designed to compare **sequential pretraining** on temporally ordered data against standard **shuffled pretraining**. This research aims to understand how the order of data affects a model's ability to retain facts and minimize chronological confusion.
The architecture is derived from [Helium 2B](https://huggingface.co/kyutai/helium-1-2b).
## Model Details
- **Developed by:** Kyutai
- **Model type:** Large Language Model (Decoder-only)
- **Language(s):** Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Irish, Croatian, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish.
- **License:** CC-BY-SA-4.0
- **Base Model:** Helium 2B Architecture (scaled)
---
## Uses
### Direct Use
The sequential variant is engineered to improve **factuality on recent knowledge**. To support this research, we developed:
* **[KairosQA](https://huggingface.co/datasets/kyutai/KairosQA):** A benchmark of 7,000+ temporally grounded questions.
* **[Kairos Evaluation Code](https://github.com/kyutai-labs/kairos):** Tools to analyze how models associate facts with specific time periods.
### Out-of-Scope Use
* **Instruction Following:** These are base models and have not undergone SFT or RLHF. They will not respond well to direct prompts or "chat" style interactions without further tuning.
* **Multilingual:** The model should not be used in other languages than the ones on which it was trained.
* **Malicious Intent:** Any illegal or harmful activity is strictly prohibited.
---
## Bias, Risks, and Limitations
Helium 6B is a base model and has not been aligned with human preferences.
* **Content:** It may generate biased, incorrect, or harmful content.
* **Recommendation:** Do not use for downstream applications without rigorous alignment (SFT/RLHF) and risk mitigation.
---
## How to Get Started
### Loading the Base Model
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "kyutai/Sequential_Helium_6B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
```
### Loading Temporal Checkpoints
To access a specific stage of training (e.g., the 2024 sequential checkpoint):
```python
model = AutoModelForCausalLM.from_pretrained(
model_id,
subfolder='sequential_2024',
torch_dtype=torch.bfloat16,
device_map="auto"
)
```
The list of available checkpoints is disclosed below:
| Subfolder | N. Tokens | Cut-Off date | Min. date | Shuffled ? |
|--------------|:------:|:------:|:------:|:------:|
| | | | | |
| Main ("") | 2.5T | 2025 | 2018 | no |
| sequential_2024<sup>*</sup> | 2.2T | 2024 | 2018 | no |
| sequential_2023<sup>*</sup> | 1.9T | 2023 | 2018 | no |
| sequential_2022<sup>*</sup> | 1.6T | 2022 | 2018 | no |
| sequential_2021<sup>*</sup> | 1.2T | 2021 | 2018 | no |
| sequential_2020<sup>*</sup> | 0.9T | 2020 | 2018 | no |
| shuffle_eq_2020 | 0.9T | 2024 | 2020 | yes |
| shuffle_eq_2024 | 2.2T | 2024 | 2020 | yes |
| shuffle_eq_2025 | 2.5T | 2024| 2020 | yes |
<sup>*</sup> **Note on Non-Cooldown Variants:** For these specific checkpoints, we can also provide "non-cooldown" counterparts. These are extracted directly from the training process at the equivalent token count without applying a learning rate decay (cooldown phase).
## Training Details
### Training Data
Helium 6B checkpoints were trained on data from Common Crawl, which was preprocessed with the [dactory](https://github.com/kyutai-labs/dactory) library.
## Evaluation
#### Testing Data
While our models are primarily designed to facilitate research on LLM temporality and base model dynamics—which may result in lower general performance compared to state-of-the-art models—we nonetheless evaluated them using the OLMES benchmark. This evaluation covers MMLU, ARC (Easy & Challenge), OpenBookQA, CommonSenseQA, PIQA, SIQA, HellaSwag, WinoGrande, and BoolQA.
#### English Results after 2.5T training tokens
| Benchmark | Sequential-Helium 6B | Shuffled-Helium 6B |
|--------------|:------:|:------:|
| | | |
| MMLU | 59.2 | 56.9 |
| ARC E | 87.7 | 86.6 |
| ARC C | 74.6 | 72.3 |
| OBQA | 74.0 | 72.8 |
| CSQA | 73.6 | 74.2 |
| PIQA | 79.9 | 80.3 |
| SIQA | 66.9 | 67.6 |
| HS | 78.9 | 81.2 |
| WG | 73.2 | 73.3 |
| BoolQA | 84.0 | 83.7 |
| | | |
| OLMES | 77.0 | 77.0 |
### Temporal improvements
We underline in the paper [Understanding Data Temporality Impact on Large Language Models Pre-training](https://arxiv.org/abs/2605.22769) that our sequentially trained Helium 6B benefits from more up-to-date as tested on our [KairosQA](https://huggingface.co/datasets/kyutai/KairosQA) dataset.
### Licensing
Helium 6B models are licensed under the CC-BY-SA 4.0 license.
## Citations
If you use one of these models, please cite:
```bibtex
@misc{pilchen2026understandingdatatemporalityimpact,
title={Understanding Data Temporality Impact on Large Language Models Pre-training},
author={Hippolyte Pilchen and Romain Fabre and Franck Signe Talla and Patrick Perez and Edouard Grave},
year={2026},
eprint={2605.22769},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.22769},
}
```