ubitech-edg
/

commandr-35b-cpt

Text Generation

continual-pretraining

text-generation-inference

Model card Files Files and versions

commandr-35b-cpt / README.md

kosmylo1992's picture

Update README.md

2d9c447 verified 4 months ago

|

history blame contribute delete

3.07 kB

	---
	{
	"language": ["en"],
	"license": "apache-2.0",
	"tags": [
	"text-generation",
	"causal-lm",
	"continual-pretraining",
	"lora",
	"axolotl",
	"deepspeed",
	"transformers",
	"commandr",
	"cohere",
	"eu-hpc"
	],
	"datasets": [
	"arxiv",
	"gov",
	"news",
	"wikipedia"
	],
	"metrics": [
	"loss"
	],
	"library_name": "transformers",
	"framework": "pytorch",
	"base_model": "CohereLabs/c4ai-command-r-v01",
	"model_name": "commandr-35b-cpt",
	"pipeline_tag": "text-generation",
	"task_categories": ["text-generation"],
	"model_type": "AutoModelForCausalLM",
	"inference": {
	"parameters": {
	"max_new_tokens": 512,
	"temperature": 0.7,
	"top_p": 0.9
	}
	},
	"trained_on": [
	"Leonardo EuroHPC"
	],
	"description": "Continual pretraining (CPT) of Cohere Command-R 35B using Axolotl and DeepSpeed ZeRO-1. The model was trained on scientific, governmental, news, and Wikipedia data with LoRA adapters to improve factual grounding and reasoning."
	}
	---

	# Command-R 35B — CPT (Continual Pretraining with LoRA)

	Model type: Causal Language Model
	Base model: [CohereLabs/c4ai-command-r-v01](https://huggingface.co/CohereLabs/c4ai-command-r-v01)
	License: Apache 2.0
	Framework: [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)

	---

	## Overview

	`commandr-35b-cpt` is a continual-pretrained version of Cohere's Command-R 35B model, trained with LoRA adapters for efficient enregy doman adaptation.
	The goal of CPT is to extend the model’s general reasoning, factual grounding, and domain knowledge across science, governance, and energy-domain text.

	Training was performed on the Leonardo EuroHPC system using Axolotl with DeepSpeed ZeRO-1 optimization.

	---

	## Training Setup

	Objective: Language modeling (unsupervised continual pretraining)
	Adapter type: LoRA
	Precision: bfloat16
	Hardware: 8 nodes × 2 × NVIDIA A100 64GB GPUs
	Framework: DeepSpeed ZeRO-1, Axolotl, PyTorch 2.5.1+cu121
	Runtime: ~24 hours
	Checkpoints: Saved every 1/5 of an epoch

	---

	## Dataset

	Public energy domain text sources:

	- `arxiv.jsonl` — scientific and technical papers
	- `gov.jsonl` — public governmental documents
	- `news.jsonl` — news articles
	- `wiki.jsonl` — Wikipedia text

	---

	## Hyperparameters

	\| Parameter \| Value \|
	\|------------\|-------\|
	\| Sequence length \| 2048 \|
	\| Micro batch size \| 1 \|
	\| Gradient accumulation \| 4 \|
	\| Epochs \| 1 \|
	\| Max steps \| 10000 \|
	\| Learning rate \| 0.0002 \|
	\| LR scheduler \| cosine \|
	\| Optimizer \| AdamW (8-bit) \|
	\| Warmup steps \| 10 \|
	\| Weight decay \| 0.0 \|
	\| LoRA rank (r) \| 16 \|
	\| LoRA alpha \| 32 \|
	\| LoRA dropout \| 0.05 \|
	\| LoRA target modules \| q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj \|
	\| Gradient checkpointing \| ✅ \|
	\| Flash attention \| ✅ \|
	\| Auto resume \| ✅ \|
	\| Loss watchdog threshold \| 5.0 \|
	\| Loss watchdog patience \| 3 \|


	## Tokenizer

	Tokenizer type: `AutoTokenizer`
	Special token: `<\|end_of_text\|>` as `pad_token`