ubitech-edg
/

commandr-35b-cpt-sft

Text Generation

two-stage-training

continual-pretraining

supervised-fine-tuning

text-generation-inference

Model card Files Files and versions

commandr-35b-cpt-sft / README.md

kosmylo1992's picture

Update README.md

2c21e12 verified 3 months ago

|

history blame contribute delete

3.02 kB

	---
	{
	"language": ["en"],
	"license": "apache-2.0",
	"tags": [
	"text-generation",
	"causal-lm",
	"two-stage-training",
	"continual-pretraining",
	"supervised-fine-tuning",
	"synthetic-qa",
	"lora",
	"axolotl",
	"deepspeed",
	"transformers",
	"commandr",
	"cohere",
	"eu-hpc"
	],
	"datasets": [
	"arxiv",
	"gov",
	"news",
	"wikipedia",
	"axolotl_deduplicated_synthetic_qa"
	],
	"metrics": [
	"loss"
	],
	"library_name": "transformers",
	"framework": "pytorch",
	"base_model": "ubitech-edg/commandr-35b-cpt",
	"model_name": "commandr-35b-cpt-sft",
	"pipeline_tag": "text-generation",
	"task_categories": ["text-generation", "instruction-following"],
	"model_type": "AutoModelForCausalLM",
	"inference": {
	"parameters": {
	"max_new_tokens": 512,
	"temperature": 0.7,
	"top_p": 0.9
	}
	},
	"trained_on": [
	"Leonardo EuroHPC"
	],
	"description": "Two-stage training (CPT + SFT) of Cohere Command-R 35B using Axolotl and DeepSpeed. The model first undergoes domain-adaptive continual pretraining and then instruction fine-tuning on synthetic QA data."
	}
	---

	# Command-R 35B — CPT + SFT

	Model type: Causal Language Model
	Base model: [commandr-35b-cpt](https://huggingface.co/ubitech-edg/commandr-35b-cpt)
	License: Apache 2.0
	Framework: [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)

	---

	## Overview

	`commandr-35b-cpt-sft` combines both continual pretraining (CPT) and supervised fine-tuning (SFT) in a two-stage process.
	The model first learns additional general-domain representations (CPT), then undergoes supervised instruction tuning (SFT) on synthetic QA data.
	This combination enhances factual grounding, fluency, and instruction adherence.

	Training was performed on the Leonardo EuroHPC system.

	---

	## Training Setup

	Stage 1 (CPT): Domain-adaptive continual pretraining
	Stage 2 (SFT): Instruction fine-tuning
	Adapter type: LoRA
	Precision: bfloat16
	Hardware: 8 nodes × 2 × NVIDIA A100 64GB GPUs
	Framework: DeepSpeed ZeRO-1, Axolotl, PyTorch 2.5.1+cu121

	---

	## Datasets

	CPT Stage:
	- `arxiv.jsonl`
	- `gov.jsonl`
	- `news.jsonl`
	- `wiki.jsonl`

	SFT Stage:
	- `axolotl_deduplicated_synthetic_qa.jsonl`

	---

	## Hyperparameters

	\| Parameter \| Value \|
	\|------------\|-------\|
	\| Sequence length \| 2048 \|
	\| Micro batch size \| 1 \|
	\| Gradient accumulation \| 2 \|
	\| Epochs \| 1 \|
	\| Learning rate \| 0.00008 \|
	\| LR scheduler \| cosine \|
	\| Optimizer \| AdamW (8-bit) \|
	\| Warmup steps \| 20 \|
	\| Weight decay \| 0.0 \|
	\| LoRA rank (r) \| 16 \|
	\| LoRA alpha \| 32 \|
	\| LoRA dropout \| 0.05 \|
	\| LoRA target modules \| q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj \|
	\| Gradient checkpointing \| ✅ \|
	\| Flash attention \| ✅ \|
	\| Auto resume \| ✅ \|
	\| Loss watchdog threshold \| 8.0 \|
	\| Loss watchdog patience \| 20 \|

	---

	## Tokenizer

	Tokenizer type: `AutoTokenizer`
	Special token: `<\|end_of_text\|>` as `pad_token`