File size: 5,979 Bytes
d2854aa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
---
license: apache-2.0
language:
- en
- es
- fr
- de
- it
- pt
- ru
- ar
- hi
- ko
- zh
library_name: transformers
base_model:
- arcee-ai/Trinity-Large-TrueBase
---
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable html -->
<!-- markdownlint-disable no-duplicate-header -->
<div align="center">
<picture>
<img
src="https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/i-v1KyAMOW_mgVGeic9WJ.png"
alt="Arcee Trinity Large"
style="max-width: 100%; height: auto;"
>
</picture>
</div>
<hr>
# Trinity-Large-Base
## Introduction
Trinity-Large-Base is a pretrained foundation model from Arcee AI's Trinity Large training run. It is a 398B-parameter sparse Mixture-of-Experts (MoE) model with approximately 13B active parameters per token. The checkpoint was captured after 17 trillion tokens of pretraining, including mid-training learning-rate anneals and context extension, but prior to any instruction tuning or reinforcement learning.
This checkpoint represents the completed pretraining phase and serves as a foundation for research and downstream fine-tuning.
More details on the training of Trinity Large are available in the [technical report](https://github.com/arcee-ai/trinity-large-tech-report/).
## Model Variants
The Trinity Large family consists of three checkpoints from the same training run:
- **Trinity-Large-Base** (this release): Full 17T-token pretrained foundation model with mid-training anneals
- **[Trinity-Large-TrueBase](https://huggingface.co/arcee-ai/Trinity-Large-TrueBase)**: 10T-token pre-anneal checkpoint with no instruction data
- **[Trinity-Large-Preview](https://huggingface.co/arcee-ai/Trinity-Large-Preview)**: Lightly post-trained, chat-ready model undergoing active RL
## Architecture
Trinity-Large-Base uses a sparse MoE configuration designed to maximize efficiency while maintaining large-scale capacity.
| Hyperparameter | Value |
|:---|:---:|
| Total parameters | ~398B |
| Active parameters per token | ~13B |
| Experts | 256 |
| Active experts | 4 |
| Routing strategy | 4-of-256 (1.56% sparsity) |
| Dense layers | 6 |
| Pretraining context length | 8,192 |
| Context length after extention | 512k |
| Architecture | Sparse MoE (AfmoeForCausalLM) |
## Benchmark Results
| Benchmark | N-shot | Metric | Score | Stderr |
|------------------------|--------|-------------------------------|--------|---------|
| mbpp_plus | 3 | pass_at_1,none | 0.8862 | ±0.0164 |
| minerva_math500 | 4 | math_verify,none | 0.6520 | ±0.0213 |
| hellaswag_5shot | 5 | acc_norm,none | 0.9011 | ±0.0030 |
| winogrande_5shot | 5 | acc,none | 0.8082 | ±0.0111 |
| mmlu_5shot | 5 | acc,none | 0.8258 | ±0.0031 |
| mmlu_generative_5shot | 5 | exact_match,get_response | 0.8260 | ±0.0031 |
| mmlu_pro | 5 | exact_match,custom-extract | 0.6602 | ±0.0042 |
| triviaqa_5shot | 5 | exact_match,remove_whitespace | 0.8330 | ±0.0028 |
| arc_challenge_0shot | 0 | acc_norm,none | 0.6544 | ±0.0139 |
| bbh_fewshot | 3 | exact_match,remove_whitespace | 0.6570 | ±0.0051 |
| gpqa_diamond_5shot | 5 | acc_norm,none | 0.4394 | ±0.0354 |
| gsm8k_cot | 8 | exact_match,flexible-extract | 0.9136 | ±0.0077 |
## Training Configuration
### Pretraining
- Training tokens: 17 trillion
- Checkpoint type: Post-anneal (foundation)
- Instruction data: None
- RLHF or post-training: None
This checkpoint represents the final pretrained state after completion of the pretraining phase, including mid-training learning-rate anneals, but before instruction tuning or reinforcement learning.
### Optimizers
Optimizer learning rates during WSD stable phase:
- Adam learning rate: 2e-4
- Muon learning rate: 8e-4
Muon was used to support larger critical batch sizes in a highly sparse MoE regime.
### Infrastructure
- Hardware: 2,048 NVIDIA B300 GPUs
- Parallelism: HSDP + Expert Parallelism
- Compute partner: [Prime Intellect](https://www.primeintellect.ai/)
- Data partner: [Datology](https://www.datologyai.com/)
<div align="center">
<picture>
<img src="https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/sSVjGNHfrJKmQ6w8I18ek.png" style="background-color:ghostwhite;padding:5px;" width="17%" alt="Powered by Datology">
</picture>
</div>
<div align="center">
<picture>
<img src="https://cdn-avatars.huggingface.co/v1/production/uploads/61e020e4a343274bb132e138/H2mcdPRWtl4iKLd-OYYBc.jpeg" style="background-color:ghostwhite;padding:5px;" width="17%" alt="Powered by Datology">
</picture>
</div>
## Intended Use
- Studying emergent behavior from large-scale pretraining
- Sparse MoE routing and load-balancing research
- Interpretability, probing, and ablation studies
- Domain-specific fine-tuning from a pretrained foundation
- Academic and industrial foundation model research
## Comparison with TrueBase
Trinity-Large-Base includes an additional 7 trillion training tokens compared to Trinity-Large-TrueBase, along with mid-training learning-rate anneals. These anneals stabilize training dynamics and typically improve downstream fine-tuning performance compared to the pre-anneal checkpoint. Researchers studying raw pretraining dynamics may prefer TrueBase, while those seeking a foundation for fine-tuning may prefer this checkpoint.
## Known Limitations
- Not aligned for safety, helpfulness, or conversational tone
- Requires substantial compute and expertise to fine-tune
- May exhibit raw or unstable behaviors typical of unaligned models
- No extended-context tuning beyond the 8K pretraining window
## License
Trinity-Large-Base is released under the Apache License, Version 2.0.
|