| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - mlfoundations/dclm-baseline-1.0-parquet |
| | language: |
| | - en |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # Covenant-72B |
| |
|
| | ## Model Overview |
| |
|
| | **Covenant-72B** is the largest permissionless collaboratively trained language |
| | model, trained entirely from scratch at the 72 billion parameter scale on 1.1 |
| | trillion tokens of English text. |
| |
|
| |  |
| |
|
| | For more details, see the [technical report](https://arxiv.org/abs/2603.08163). |
| | This is a base model. See [Covenant-72B-Chat](https://huggingface.co/1Covenant/Covenant-72B-Chat) for the instruction-tuned variant. |
| |
|
| | **Covenant-72B** was trained with 20+ globally distributed participants |
| | coordinated via decentralized infrastructure on the Bittensor blockchain. |
| | Unlike prior collaborative training efforts that use whitelisted compute, |
| | Covenant-72B is the first to achieve this scale with fully permissionless |
| | participation. Training used the SparseLoCo communication-efficient optimizer |
| | to reduce bandwidth requirements across distributed nodes. |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model = AutoModelForCausalLM.from_pretrained( |
| | "1Covenant/Covenant-72B", |
| | torch_dtype=torch.bfloat16, |
| | device_map="auto", |
| | ) |
| | tokenizer = AutoTokenizer.from_pretrained("1Covenant/Covenant-72B") |
| | |
| | input_text = "The theory of general relativity" |
| | input_ids = tokenizer.encode(input_text, return_tensors="pt").to(model.device) |
| | output_ids = model.generate(input_ids, max_new_tokens=100) |
| | print(tokenizer.decode(output_ids[0], skip_special_tokens=True)) |
| | ``` |
| |
|
| | ## Model Details |
| |
|
| | - **Compute Participants**: 20+ independent contributors on Bittensor |
| | - **Minimum Compute per Participant**: 8×B200 or equivalent |
| | - **Model License**: Apache 2.0 |
| |
|
| | ## Technical Specifications |
| |
|
| | | Parameter | Value | |
| | | ------------------------- | ------------------------------ | |
| | | Parameter Size | 72B | |
| | | Architecture | LLaMA-style (LlamaForCausalLM) | |
| | | Number of Layers | 80 | |
| | | Number of Attention Heads | 64 (8 KV heads) | |
| | | Hidden Size | 8192 | |
| | | Intermediate Size | 28672 | |
| | | Head Dimension | 128 | |
| | | Vocabulary Size | 262,144 | |
| |
|
| | **Training Details**: |
| |
|
| | - **Dataset**: [DCLM-baseline](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet) |
| | - **Tokens**: 1.1 Trillion |
| | - **Optimizer**: SparseLoCo (communication-efficient optimizer) |
| |
|
| | ## Performance on Benchmarks |
| |
|
| | _All results are 0-shot acc_norm (%) unless noted._ |
| |
|
| | | Model | Size | Tokens | ARC-C | ARC-E | PIQA | OBQA | HellaSwag | WinoGrande\* | MMLU\* | |
| | | :----------------- | ---: | -----: | ----: | ----: | ----: | ----: | --------: | -----------: | -----: | |
| | | **Covenant-72B** | 72B | 1.1T | 56.83 | 80.93 | 81.56 | 44.00 | 80.61 | 75.85 | 67.11 | |
| | | INTELLECT-1 | 10B | 1T | 44.80 | 71.76 | 77.37 | 43.80 | 70.26 | 63.30 | 32.69 | |
| | | Psyche Consilience | 40B | 1.2T | 31.14 | 55.77 | 76.12 | 35.20 | 63.67 | 56.99 | 24.23 | |
| | | LLM360 K2 ckpt_108 | 65B | 420B | 45.73 | 70.54 | 80.90 | 43.20 | 78.23 | 71.90 | 50.01 | |
| | | LLM360 K2 | 65B | 1.4T | 53.75 | 75.97 | 82.54 | 48.00 | 82.86 | 76.40 | 65.51 | |
| | | LLaMA-2-7B | 7B | 2T | 45.05 | 73.82 | 78.73 | 44.20 | 76.18 | 69.38 | 41.73 | |
| | | LLaMA-2-70B | 70B | 2T | 57.42 | 79.55 | 82.59 | 49.40 | 84.34 | 80.43 | 65.63 | |
| | |
| | _\*WinoGrande uses acc; MMLU uses acc._ |
| | |