--- language: - en license: cc-by-sa-4.0 library_name: transformers tags: - causal-lm - sequential-pretraining - helium - kyutai datasets: - kyutai/KairosQA metrics: - accuracy --- # Helium 6B: Sequential vs. Shuffled Pretraining

Kairos Sequential Model Logo

This repository houses the **Helium 6B** models, specifically designed to compare **sequential pretraining** on temporally ordered data against standard **shuffled pretraining**. This research aims to understand how the order of data affects a model's ability to retain facts and minimize chronological confusion. The architecture is derived from [Helium 2B](https://huggingface.co/kyutai/helium-1-2b). ## Model Details - **Developed by:** Kyutai - **Model type:** Large Language Model (Decoder-only) - **Language(s):** Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Irish, Croatian, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovenian, Swedish. - **License:** CC-BY-SA-4.0 - **Base Model:** Helium 2B Architecture (scaled) --- ## Uses ### Direct Use The sequential variant is engineered to improve **factuality on recent knowledge**. To support this research, we developed: * **[KairosQA](https://huggingface.co/datasets/kyutai/KairosQA):** A benchmark of 7,000+ temporally grounded questions. * **[Kairos Evaluation Code](https://github.com/kyutai-labs/kairos):** Tools to analyze how models associate facts with specific time periods. ### Out-of-Scope Use * **Instruction Following:** These are base models and have not undergone SFT or RLHF. They will not respond well to direct prompts or "chat" style interactions without further tuning. * **Multilingual:** The model should not be used in other languages than the ones on which it was trained. * **Malicious Intent:** Any illegal or harmful activity is strictly prohibited. --- ## Bias, Risks, and Limitations Helium 6B is a base model and has not been aligned with human preferences. * **Content:** It may generate biased, incorrect, or harmful content. * **Recommendation:** Do not use for downstream applications without rigorous alignment (SFT/RLHF) and risk mitigation. --- ## How to Get Started ### Loading the Base Model ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "kyutai/Sequential_Helium_6B" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) ``` ### Loading Temporal Checkpoints To access a specific stage of training (e.g., the 2024 sequential checkpoint): ```python model = AutoModelForCausalLM.from_pretrained( model_id, subfolder='sequential_2024', torch_dtype=torch.bfloat16, device_map="auto" ) ``` The list of available checkpoints is disclosed below: | Subfolder | N. Tokens | Cut-Off date | Min. date | Shuffled ? | |--------------|:------:|:------:|:------:|:------:| | | | | | | | Main ("") | 2.5T | 2025 | 2018 | no | | sequential_2024* | 2.2T | 2024 | 2018 | no | | sequential_2023* | 1.9T | 2023 | 2018 | no | | sequential_2022* | 1.6T | 2022 | 2018 | no | | sequential_2021* | 1.2T | 2021 | 2018 | no | | sequential_2020* | 0.9T | 2020 | 2018 | no | | shuffle_eq_2020 | 0.9T | 2024 | 2020 | yes | | shuffle_eq_2024 | 2.2T | 2024 | 2020 | yes | | shuffle_eq_2025 | 2.5T | 2024| 2020 | yes | * **Note on Non-Cooldown Variants:** For these specific checkpoints, we can also provide "non-cooldown" counterparts. These are extracted directly from the training process at the equivalent token count without applying a learning rate decay (cooldown phase). ## Training Details ### Training Data Helium 6B checkpoints were trained on data from Common Crawl, which was preprocessed with the [dactory](https://github.com/kyutai-labs/dactory) library. ## Evaluation #### Testing Data While our models are primarily designed to facilitate research on LLM temporality and base model dynamics—which may result in lower general performance compared to state-of-the-art models—we nonetheless evaluated them using the OLMES benchmark. This evaluation covers MMLU, ARC (Easy & Challenge), OpenBookQA, CommonSenseQA, PIQA, SIQA, HellaSwag, WinoGrande, and BoolQA. #### English Results after 2.5T training tokens | Benchmark | Sequential-Helium 6B | Shuffled-Helium 6B | |--------------|:------:|:------:| | | | | | MMLU | 59.2 | 56.9 | | ARC E | 87.7 | 86.6 | | ARC C | 74.6 | 72.3 | | OBQA | 74.0 | 72.8 | | CSQA | 73.6 | 74.2 | | PIQA | 79.9 | 80.3 | | SIQA | 66.9 | 67.6 | | HS | 78.9 | 81.2 | | WG | 73.2 | 73.3 | | BoolQA | 84.0 | 83.7 | | | | | | OLMES | 77.0 | 77.0 | ### Temporal improvements We underline in the paper [Understanding Data Temporality Impact on Large Language Models Pre-training](https://arxiv.org/abs/2605.22769) that our sequentially trained Helium 6B benefits from more up-to-date as tested on our [KairosQA](https://huggingface.co/datasets/kyutai/KairosQA) dataset. ### Licensing Helium 6B models are licensed under the CC-BY-SA 4.0 license. ## Citations If you use one of these models, please cite: ```bibtex @misc{pilchen2026understandingdatatemporalityimpact, title={Understanding Data Temporality Impact on Large Language Models Pre-training}, author={Hippolyte Pilchen and Romain Fabre and Franck Signe Talla and Patrick Perez and Edouard Grave}, year={2026}, eprint={2605.22769}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2605.22769}, } ```