---
language:
- en
license: cc-by-sa-4.0
library_name: transformers
tags:
- causal-lm
- sequential-pretraining
- helium
- kyutai
datasets:
- kyutai/KairosQA
metrics:
- accuracy
---

# Helium 6B: Sequential vs. Shuffled Pretraining

<p align="center">
  <img src="https://huggingface.co/kyutai/Sequential_Helium_6B/resolve/main/kairos_seq_model.png" width="400" alt="Kairos Sequential Model Logo">
</p>

This repository houses the **Helium 6B** models, specifically designed to compare **sequential pretraining** on temporally ordered data against standard **shuffled pretraining**. This research aims to understand how the order of data affects a model's ability to retain facts and minimize chronological confusion.

The architecture is derived from [Helium 2B](https://huggingface.co/kyutai/helium-1-2b).

## Model Details

- **Developed by:** Kyutai
- **Model type:** Large Language Model (Decoder-only)
- **Language(s):** Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Irish, Croatian, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish.
- **License:** CC-BY-SA-4.0
- **Base Model:** Helium 2B Architecture (scaled)

---

## Uses

### Direct Use
The sequential variant is engineered to improve **factuality on recent knowledge**. To support this research, we developed:
* **[KairosQA](https://huggingface.co/datasets/kyutai/KairosQA):** A benchmark of 7,000+ temporally grounded questions.
* **[Kairos Evaluation Code](https://github.com/kyutai-labs/kairos):** Tools to analyze how models associate facts with specific time periods.

### Out-of-Scope Use
* **Instruction Following:** These are base models and have not undergone SFT or RLHF. They will not respond well to direct prompts or "chat" style interactions without further tuning.
* **Multilingual:** The model should not be used in other languages than the ones on which it was trained.
* **Malicious Intent:** Any illegal or harmful activity is strictly prohibited.

---

## Bias, Risks, and Limitations

Helium 6B is a base model and has not been aligned with human preferences. 
* **Content:** It may generate biased, incorrect, or harmful content.
* **Recommendation:** Do not use for downstream applications without rigorous alignment (SFT/RLHF) and risk mitigation.

---

## How to Get Started

### Loading the Base Model
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "kyutai/Sequential_Helium_6B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)
```

### Loading Temporal Checkpoints
To access a specific stage of training (e.g., the 2024 sequential checkpoint):
```python
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    subfolder='sequential_2024',
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
```

The list of available checkpoints is disclosed below: 

| Subfolder | N. Tokens | Cut-Off date |  Min. date | Shuffled ? |
|--------------|:------:|:------:|:------:|:------:|
| | | | | | 
| Main ("") | 2.5T | 2025 | 2018 | no |
| sequential_2024<sup>*</sup> |  2.2T | 2024 | 2018 | no |
| sequential_2023<sup>*</sup> |  1.9T | 2023 | 2018 | no |
| sequential_2022<sup>*</sup> |  1.6T | 2022 | 2018 | no |
| sequential_2021<sup>*</sup> |  1.2T | 2021 | 2018 | no |
| sequential_2020<sup>*</sup> |  0.9T | 2020 | 2018 | no |
| shuffle_eq_2020 |  0.9T | 2024 | 2020 | yes |
| shuffle_eq_2024 |  2.2T | 2024 | 2020 | yes |
| shuffle_eq_2025 |  2.5T | 2024| 2020 | yes |

<sup>*</sup> **Note on Non-Cooldown Variants:** For these specific checkpoints, we can also provide "non-cooldown" counterparts. These are extracted directly from the training process at the equivalent token count without applying a learning rate decay (cooldown phase).
## Training Details

### Training Data

Helium 6B checkpoints were trained on data from Common Crawl, which was preprocessed with the [dactory](https://github.com/kyutai-labs/dactory) library.


## Evaluation


#### Testing Data

While our models are primarily designed to facilitate research on LLM temporality and base model dynamics—which may result in lower general performance compared to state-of-the-art models—we nonetheless evaluated them using the OLMES benchmark. This evaluation covers MMLU, ARC (Easy & Challenge), OpenBookQA, CommonSenseQA, PIQA, SIQA, HellaSwag, WinoGrande, and BoolQA.


#### English Results after 2.5T training tokens

| Benchmark | Sequential-Helium 6B | Shuffled-Helium 6B | 
|--------------|:------:|:------:|
| | | | 
| MMLU | 59.2 | 56.9 | 
| ARC E | 87.7 | 86.6 |
| ARC C | 74.6 | 72.3 |
| OBQA | 74.0 | 72.8 |
| CSQA | 73.6 | 74.2 |
| PIQA | 79.9 | 80.3 | 
| SIQA | 66.9 | 67.6 | 
| HS | 78.9 | 81.2 | 
| WG | 73.2 | 73.3 | 
| BoolQA | 84.0 | 83.7 | 
| | | | 
| OLMES | 77.0 | 77.0 |


### Temporal improvements

We underline in the paper [Understanding Data Temporality Impact on Large Language Models Pre-training](https://arxiv.org/abs/2605.22769) that our sequentially trained Helium 6B benefits from more up-to-date as tested on our [KairosQA](https://huggingface.co/datasets/kyutai/KairosQA) dataset. 


 ### Licensing

Helium 6B models are licensed under the CC-BY-SA 4.0 license.
 
 ## Citations

 If you use one of these models, please cite:

```bibtex
@misc{pilchen2026understandingdatatemporalityimpact,
      title={Understanding Data Temporality Impact on Large Language Models Pre-training}, 
      author={Hippolyte Pilchen and Romain Fabre and Franck Signe Talla and Patrick Perez and Edouard Grave},
      year={2026},
      eprint={2605.22769},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.22769}, 
}
```