File size: 3,747 Bytes

dc9c7e6

---
license: apache-2.0
datasets:
  - mlfoundations/dclm-baseline-1.0-parquet
language:
  - en
pipeline_tag: text-generation
---

# Covenant-72B

## Model Overview

**Covenant-72B** is the largest permissionless collaboratively trained language
model, trained entirely from scratch at the 72 billion parameter scale on 1.1
trillion tokens of English text.

![Covenant-72B](assets/covenant-72b.webp)

For more details, see the [technical report](https://arxiv.org/abs/2603.08163).
This is a base model. See [Covenant-72B-Chat](https://huggingface.co/1Covenant/Covenant-72B-Chat) for the instruction-tuned variant.

**Covenant-72B** was trained with 20+ globally distributed participants
coordinated via decentralized infrastructure on the Bittensor blockchain.
Unlike prior collaborative training efforts that use whitelisted compute,
Covenant-72B is the first to achieve this scale with fully permissionless
participation. Training used the SparseLoCo communication-efficient optimizer
to reduce bandwidth requirements across distributed nodes.

## Usage

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "1Covenant/Covenant-72B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("1Covenant/Covenant-72B")

input_text = "The theory of general relativity"
input_ids = tokenizer.encode(input_text, return_tensors="pt").to(model.device)
output_ids = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
```

## Model Details

- **Compute Participants**: 20+ independent contributors on Bittensor
- **Minimum Compute per Participant**: 8×B200 or equivalent
- **Model License**: Apache 2.0

## Technical Specifications

| Parameter                 | Value                          |
| ------------------------- | ------------------------------ |
| Parameter Size            | 72B                            |
| Architecture              | LLaMA-style (LlamaForCausalLM) |
| Number of Layers          | 80                             |
| Number of Attention Heads | 64 (8 KV heads)                |
| Hidden Size               | 8192                           |
| Intermediate Size         | 28672                          |
| Head Dimension            | 128                            |
| Vocabulary Size           | 262,144                        |

**Training Details**:

- **Dataset**: [DCLM-baseline](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet)
- **Tokens**: 1.1 Trillion
- **Optimizer**: SparseLoCo (communication-efficient optimizer)

## Performance on Benchmarks

_All results are 0-shot acc_norm (%) unless noted._

| Model              | Size | Tokens | ARC-C | ARC-E |  PIQA |  OBQA | HellaSwag | WinoGrande\* | MMLU\* |
| :----------------- | ---: | -----: | ----: | ----: | ----: | ----: | --------: | -----------: | -----: |
| **Covenant-72B**   |  72B |   1.1T | 56.83 | 80.93 | 81.56 | 44.00 |     80.61 |        75.85 |  67.11 |
| INTELLECT-1        |  10B |     1T | 44.80 | 71.76 | 77.37 | 43.80 |     70.26 |        63.30 |  32.69 |
| Psyche Consilience |  40B |   1.2T | 31.14 | 55.77 | 76.12 | 35.20 |     63.67 |        56.99 |  24.23 |
| LLM360 K2 ckpt_108 |  65B |   420B | 45.73 | 70.54 | 80.90 | 43.20 |     78.23 |        71.90 |  50.01 |
| LLM360 K2          |  65B |   1.4T | 53.75 | 75.97 | 82.54 | 48.00 |     82.86 |        76.40 |  65.51 |
| LLaMA-2-7B         |   7B |     2T | 45.05 | 73.82 | 78.73 | 44.20 |     76.18 |        69.38 |  41.73 |
| LLaMA-2-70B        |  70B |     2T | 57.42 | 79.55 | 82.59 | 49.40 |     84.34 |        80.43 |  65.63 |

_\*WinoGrande uses acc; MMLU uses acc._