File size: 3,747 Bytes
ca09d35 a5d22a6 2c8e2b0 b8f8fd1 a5d22a6 b8f8fd1 2c8e2b0 a5d22a6 2c8e2b0 b8f8fd1 2c8e2b0 b8f8fd1 f6655e7 2c8e2b0 a478344 2c8e2b0 b8f8fd1 2c8e2b0 b8f8fd1 2c8e2b0 b8f8fd1 2c8e2b0 b8f8fd1 2c8e2b0 a5d22a6 2c8e2b0 b8f8fd1 2c8e2b0 b8f8fd1 2c8e2b0 b8f8fd1 2c8e2b0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | ---
license: apache-2.0
datasets:
- mlfoundations/dclm-baseline-1.0-parquet
language:
- en
pipeline_tag: text-generation
---
# Covenant-72B
## Model Overview
**Covenant-72B** is the largest permissionless collaboratively trained language
model, trained entirely from scratch at the 72 billion parameter scale on 1.1
trillion tokens of English text.

For more details, see the [technical report](https://arxiv.org/abs/2603.08163).
This is a base model. See [Covenant-72B-Chat](https://huggingface.co/1Covenant/Covenant-72B-Chat) for the instruction-tuned variant.
**Covenant-72B** was trained with 20+ globally distributed participants
coordinated via decentralized infrastructure on the Bittensor blockchain.
Unlike prior collaborative training efforts that use whitelisted compute,
Covenant-72B is the first to achieve this scale with fully permissionless
participation. Training used the SparseLoCo communication-efficient optimizer
to reduce bandwidth requirements across distributed nodes.
## Usage
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"1Covenant/Covenant-72B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("1Covenant/Covenant-72B")
input_text = "The theory of general relativity"
input_ids = tokenizer.encode(input_text, return_tensors="pt").to(model.device)
output_ids = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
```
## Model Details
- **Compute Participants**: 20+ independent contributors on Bittensor
- **Minimum Compute per Participant**: 8×B200 or equivalent
- **Model License**: Apache 2.0
## Technical Specifications
| Parameter | Value |
| ------------------------- | ------------------------------ |
| Parameter Size | 72B |
| Architecture | LLaMA-style (LlamaForCausalLM) |
| Number of Layers | 80 |
| Number of Attention Heads | 64 (8 KV heads) |
| Hidden Size | 8192 |
| Intermediate Size | 28672 |
| Head Dimension | 128 |
| Vocabulary Size | 262,144 |
**Training Details**:
- **Dataset**: [DCLM-baseline](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet)
- **Tokens**: 1.1 Trillion
- **Optimizer**: SparseLoCo (communication-efficient optimizer)
## Performance on Benchmarks
_All results are 0-shot acc_norm (%) unless noted._
| Model | Size | Tokens | ARC-C | ARC-E | PIQA | OBQA | HellaSwag | WinoGrande\* | MMLU\* |
| :----------------- | ---: | -----: | ----: | ----: | ----: | ----: | --------: | -----------: | -----: |
| **Covenant-72B** | 72B | 1.1T | 56.83 | 80.93 | 81.56 | 44.00 | 80.61 | 75.85 | 67.11 |
| INTELLECT-1 | 10B | 1T | 44.80 | 71.76 | 77.37 | 43.80 | 70.26 | 63.30 | 32.69 |
| Psyche Consilience | 40B | 1.2T | 31.14 | 55.77 | 76.12 | 35.20 | 63.67 | 56.99 | 24.23 |
| LLM360 K2 ckpt_108 | 65B | 420B | 45.73 | 70.54 | 80.90 | 43.20 | 78.23 | 71.90 | 50.01 |
| LLM360 K2 | 65B | 1.4T | 53.75 | 75.97 | 82.54 | 48.00 | 82.86 | 76.40 | 65.51 |
| LLaMA-2-7B | 7B | 2T | 45.05 | 73.82 | 78.73 | 44.20 | 76.18 | 69.38 | 41.73 |
| LLaMA-2-70B | 70B | 2T | 57.42 | 79.55 | 82.59 | 49.40 | 84.34 | 80.43 | 65.63 |
_\*WinoGrande uses acc; MMLU uses acc._
|