---
{
  "language": ["en"],
  "license": "apache-2.0",
  "tags": [
    "text-generation",
    "causal-lm",
    "two-stage-training",
    "continual-pretraining",
    "supervised-fine-tuning",
    "synthetic-qa",
    "lora",
    "axolotl",
    "deepspeed",
    "transformers",
    "commandr",
    "cohere",
    "eu-hpc"
  ],
  "datasets": [
    "arxiv",
    "gov",
    "news",
    "wikipedia",
    "axolotl_deduplicated_synthetic_qa"
  ],
  "metrics": [
    "loss"
  ],
  "library_name": "transformers",
  "framework": "pytorch",
  "base_model": "ubitech-edg/commandr-35b-cpt",
  "model_name": "commandr-35b-cpt-sft",
  "pipeline_tag": "text-generation",
  "task_categories": ["text-generation", "instruction-following"],
  "model_type": "AutoModelForCausalLM",
  "inference": {
    "parameters": {
      "max_new_tokens": 512,
      "temperature": 0.7,
      "top_p": 0.9
    }
  },
  "trained_on": [
    "Leonardo EuroHPC"
  ],
  "description": "Two-stage training (CPT + SFT) of Cohere Command-R 35B using Axolotl and DeepSpeed. The model first undergoes domain-adaptive continual pretraining and then instruction fine-tuning on synthetic QA data."
}
---

# Command-R 35B — CPT + SFT

**Model type:** Causal Language Model  
**Base model:** [commandr-35b-cpt](https://huggingface.co/ubitech-edg/commandr-35b-cpt)  
**License:** Apache 2.0  
**Framework:** [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)

---

## Overview

`commandr-35b-cpt-sft` combines both **continual pretraining (CPT)** and **supervised fine-tuning (SFT)** in a two-stage process. 
The model first learns additional general-domain representations (CPT), then undergoes supervised instruction tuning (SFT) on synthetic QA data.  
This combination enhances factual grounding, fluency, and instruction adherence.

Training was performed on the **Leonardo EuroHPC** system.

---

## Training Setup

**Stage 1 (CPT):** Domain-adaptive continual pretraining  
**Stage 2 (SFT):** Instruction fine-tuning  
**Adapter type:** LoRA  
**Precision:** bfloat16  
**Hardware:** 8 nodes × 2 × NVIDIA A100 64GB GPUs  
**Framework:** DeepSpeed ZeRO-1, Axolotl, PyTorch 2.5.1+cu121

---

## Datasets

**CPT Stage:**
- `arxiv.jsonl`
- `gov.jsonl`
- `news.jsonl`
- `wiki.jsonl`

**SFT Stage:**
- `axolotl_deduplicated_synthetic_qa.jsonl`

---

## Hyperparameters

| Parameter | Value |
|------------|-------|
| Sequence length | 2048 |
| Micro batch size | 1 |
| Gradient accumulation | 2 |
| Epochs | 1 |
| Learning rate | 0.00008 |
| LR scheduler | cosine |
| Optimizer | AdamW (8-bit) |
| Warmup steps | 20 |
| Weight decay | 0.0 |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Gradient checkpointing | ✅ |
| Flash attention | ✅ |
| Auto resume | ✅ |
| Loss watchdog threshold | 8.0 |
| Loss watchdog patience | 20 |

---

## Tokenizer

**Tokenizer type:** `AutoTokenizer`  
**Special token:** `<|end_of_text|>` as `pad_token`