File size: 3,072 Bytes
9a25ac0 2d9c447 9a25ac0 c682304 9f31690 ec6b87f c682304 ace4381 9f31690 2dbc2e4 c682304 9f31690 c682304 9f31690 c682304 9f31690 c682304 5088d4f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
{
"language": ["en"],
"license": "apache-2.0",
"tags": [
"text-generation",
"causal-lm",
"continual-pretraining",
"lora",
"axolotl",
"deepspeed",
"transformers",
"commandr",
"cohere",
"eu-hpc"
],
"datasets": [
"arxiv",
"gov",
"news",
"wikipedia"
],
"metrics": [
"loss"
],
"library_name": "transformers",
"framework": "pytorch",
"base_model": "CohereLabs/c4ai-command-r-v01",
"model_name": "commandr-35b-cpt",
"pipeline_tag": "text-generation",
"task_categories": ["text-generation"],
"model_type": "AutoModelForCausalLM",
"inference": {
"parameters": {
"max_new_tokens": 512,
"temperature": 0.7,
"top_p": 0.9
}
},
"trained_on": [
"Leonardo EuroHPC"
],
"description": "Continual pretraining (CPT) of Cohere Command-R 35B using Axolotl and DeepSpeed ZeRO-1. The model was trained on scientific, governmental, news, and Wikipedia data with LoRA adapters to improve factual grounding and reasoning."
}
---
# Command-R 35B — CPT (Continual Pretraining with LoRA)
**Model type:** Causal Language Model
**Base model:** [CohereLabs/c4ai-command-r-v01](https://huggingface.co/CohereLabs/c4ai-command-r-v01)
**License:** Apache 2.0
**Framework:** [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)
---
## Overview
`commandr-35b-cpt` is a **continual-pretrained** version of Cohere's Command-R 35B model, trained with LoRA adapters for efficient enregy doman adaptation.
The goal of CPT is to extend the model’s general reasoning, factual grounding, and domain knowledge across science, governance, and energy-domain text.
Training was performed on the **Leonardo EuroHPC** system using Axolotl with DeepSpeed ZeRO-1 optimization.
---
## Training Setup
**Objective:** Language modeling (unsupervised continual pretraining)
**Adapter type:** LoRA
**Precision:** bfloat16
**Hardware:** 8 nodes × 2 × NVIDIA A100 64GB GPUs
**Framework:** DeepSpeed ZeRO-1, Axolotl, PyTorch 2.5.1+cu121
**Runtime:** ~24 hours
**Checkpoints:** Saved every 1/5 of an epoch
---
## Dataset
Public energy domain text sources:
- `arxiv.jsonl` — scientific and technical papers
- `gov.jsonl` — public governmental documents
- `news.jsonl` — news articles
- `wiki.jsonl` — Wikipedia text
---
## Hyperparameters
| Parameter | Value |
|------------|-------|
| Sequence length | 2048 |
| Micro batch size | 1 |
| Gradient accumulation | 4 |
| Epochs | 1 |
| Max steps | 10000 |
| Learning rate | 0.0002 |
| LR scheduler | cosine |
| Optimizer | AdamW (8-bit) |
| Warmup steps | 10 |
| Weight decay | 0.0 |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Gradient checkpointing | ✅ |
| Flash attention | ✅ |
| Auto resume | ✅ |
| Loss watchdog threshold | 5.0 |
| Loss watchdog patience | 3 |
## Tokenizer
**Tokenizer type:** `AutoTokenizer`
**Special token:** `<|end_of_text|>` as `pad_token` |