Duplicated from 1Covenant/Covenant-72B

facelessyt
/

Covenant-72B

Text Generation

Model card Files Files and versions

Covenant-72B / README.md

facelessyt's picture

Duplicate from 1Covenant/Covenant-72B

dc9c7e6 24 days ago

|

history blame contribute delete

3.75 kB

	---
	license: apache-2.0
	datasets:
	- mlfoundations/dclm-baseline-1.0-parquet
	language:
	- en
	pipeline_tag: text-generation
	---

	# Covenant-72B

	## Model Overview

	Covenant-72B is the largest permissionless collaboratively trained language
	model, trained entirely from scratch at the 72 billion parameter scale on 1.1
	trillion tokens of English text.

	![Covenant-72B](assets/covenant-72b.webp)

	For more details, see the [technical report](https://arxiv.org/abs/2603.08163).
	This is a base model. See [Covenant-72B-Chat](https://huggingface.co/1Covenant/Covenant-72B-Chat) for the instruction-tuned variant.

	Covenant-72B was trained with 20+ globally distributed participants
	coordinated via decentralized infrastructure on the Bittensor blockchain.
	Unlike prior collaborative training efforts that use whitelisted compute,
	Covenant-72B is the first to achieve this scale with fully permissionless
	participation. Training used the SparseLoCo communication-efficient optimizer
	to reduce bandwidth requirements across distributed nodes.

	## Usage

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(
	"1Covenant/Covenant-72B",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)
	tokenizer = AutoTokenizer.from_pretrained("1Covenant/Covenant-72B")

	input_text = "The theory of general relativity"
	input_ids = tokenizer.encode(input_text, return_tensors="pt").to(model.device)
	output_ids = model.generate(input_ids, max_new_tokens=100)
	print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
	```

	## Model Details

	- Compute Participants: 20+ independent contributors on Bittensor
	- Minimum Compute per Participant: 8×B200 or equivalent
	- Model License: Apache 2.0

	## Technical Specifications

	\| Parameter \| Value \|
	\| ------------------------- \| ------------------------------ \|
	\| Parameter Size \| 72B \|
	\| Architecture \| LLaMA-style (LlamaForCausalLM) \|
	\| Number of Layers \| 80 \|
	\| Number of Attention Heads \| 64 (8 KV heads) \|
	\| Hidden Size \| 8192 \|
	\| Intermediate Size \| 28672 \|
	\| Head Dimension \| 128 \|
	\| Vocabulary Size \| 262,144 \|

	Training Details:

	- Dataset: [DCLM-baseline](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet)
	- Tokens: 1.1 Trillion
	- Optimizer: SparseLoCo (communication-efficient optimizer)

	## Performance on Benchmarks

	_All results are 0-shot acc_norm (%) unless noted._

	\| Model \| Size \| Tokens \| ARC-C \| ARC-E \| PIQA \| OBQA \| HellaSwag \| WinoGrande\* \| MMLU\* \|
	\| :----------------- \| ---: \| -----: \| ----: \| ----: \| ----: \| ----: \| --------: \| -----------: \| -----: \|
	\| Covenant-72B \| 72B \| 1.1T \| 56.83 \| 80.93 \| 81.56 \| 44.00 \| 80.61 \| 75.85 \| 67.11 \|
	\| INTELLECT-1 \| 10B \| 1T \| 44.80 \| 71.76 \| 77.37 \| 43.80 \| 70.26 \| 63.30 \| 32.69 \|
	\| Psyche Consilience \| 40B \| 1.2T \| 31.14 \| 55.77 \| 76.12 \| 35.20 \| 63.67 \| 56.99 \| 24.23 \|
	\| LLM360 K2 ckpt_108 \| 65B \| 420B \| 45.73 \| 70.54 \| 80.90 \| 43.20 \| 78.23 \| 71.90 \| 50.01 \|
	\| LLM360 K2 \| 65B \| 1.4T \| 53.75 \| 75.97 \| 82.54 \| 48.00 \| 82.86 \| 76.40 \| 65.51 \|
	\| LLaMA-2-7B \| 7B \| 2T \| 45.05 \| 73.82 \| 78.73 \| 44.20 \| 76.18 \| 69.38 \| 41.73 \|
	\| LLaMA-2-70B \| 70B \| 2T \| 57.42 \| 79.55 \| 82.59 \| 49.40 \| 84.34 \| 80.43 \| 65.63 \|

	_\*WinoGrande uses acc; MMLU uses acc._