File size: 3,072 Bytes

---
{
  "language": ["en"],
  "license": "apache-2.0",
  "tags": [
    "text-generation",
    "causal-lm",
    "continual-pretraining",
    "lora",
    "axolotl",
    "deepspeed",
    "transformers",
    "commandr",
    "cohere",
    "eu-hpc"
  ],
  "datasets": [
    "arxiv",
    "gov",
    "news",
    "wikipedia"
  ],
  "metrics": [
    "loss"
  ],
  "library_name": "transformers",
  "framework": "pytorch",
  "base_model": "CohereLabs/c4ai-command-r-v01",
  "model_name": "commandr-35b-cpt",
  "pipeline_tag": "text-generation",
  "task_categories": ["text-generation"],
  "model_type": "AutoModelForCausalLM",
  "inference": {
    "parameters": {
      "max_new_tokens": 512,
      "temperature": 0.7,
      "top_p": 0.9
    }
  },
  "trained_on": [
    "Leonardo EuroHPC"
  ],
  "description": "Continual pretraining (CPT) of Cohere Command-R 35B using Axolotl and DeepSpeed ZeRO-1. The model was trained on scientific, governmental, news, and Wikipedia data with LoRA adapters to improve factual grounding and reasoning."
}
---

# Command-R 35B — CPT (Continual Pretraining with LoRA)

**Model type:** Causal Language Model  
**Base model:** [CohereLabs/c4ai-command-r-v01](https://huggingface.co/CohereLabs/c4ai-command-r-v01)  
**License:** Apache 2.0  
**Framework:** [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)

---

## Overview

`commandr-35b-cpt` is a **continual-pretrained** version of Cohere's Command-R 35B model, trained with LoRA adapters for efficient enregy doman adaptation. 
The goal of CPT is to extend the model’s general reasoning, factual grounding, and domain knowledge across science, governance, and energy-domain text.

Training was performed on the **Leonardo EuroHPC** system using Axolotl with DeepSpeed ZeRO-1 optimization.

---

## Training Setup

**Objective:** Language modeling (unsupervised continual pretraining)  
**Adapter type:** LoRA  
**Precision:** bfloat16  
**Hardware:** 8 nodes × 2 × NVIDIA A100 64GB GPUs  
**Framework:** DeepSpeed ZeRO-1, Axolotl, PyTorch 2.5.1+cu121    
**Runtime:** ~24 hours  
**Checkpoints:** Saved every 1/5 of an epoch

---

## Dataset

Public energy domain text sources:

- `arxiv.jsonl` — scientific and technical papers  
- `gov.jsonl` — public governmental documents  
- `news.jsonl` — news articles  
- `wiki.jsonl` — Wikipedia text

---

## Hyperparameters

| Parameter | Value |
|------------|-------|
| Sequence length | 2048 |
| Micro batch size | 1 |
| Gradient accumulation | 4 |
| Epochs | 1 |
| Max steps | 10000 |
| Learning rate | 0.0002 |
| LR scheduler | cosine |
| Optimizer | AdamW (8-bit) |
| Warmup steps | 10 |
| Weight decay | 0.0 |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Gradient checkpointing | ✅ |
| Flash attention | ✅ |
| Auto resume | ✅ |
| Loss watchdog threshold | 5.0 |
| Loss watchdog patience | 3 |


## Tokenizer

**Tokenizer type:** `AutoTokenizer`  
**Special token:** `<|end_of_text|>` as `pad_token`