|
|
--- |
|
|
{ |
|
|
"language": ["en"], |
|
|
"license": "apache-2.0", |
|
|
"tags": [ |
|
|
"text-generation", |
|
|
"causal-lm", |
|
|
"continual-pretraining", |
|
|
"lora", |
|
|
"axolotl", |
|
|
"deepspeed", |
|
|
"transformers", |
|
|
"commandr", |
|
|
"cohere", |
|
|
"eu-hpc" |
|
|
], |
|
|
"datasets": [ |
|
|
"arxiv", |
|
|
"gov", |
|
|
"news", |
|
|
"wikipedia" |
|
|
], |
|
|
"metrics": [ |
|
|
"loss" |
|
|
], |
|
|
"library_name": "transformers", |
|
|
"framework": "pytorch", |
|
|
"base_model": "CohereLabs/c4ai-command-r-v01", |
|
|
"model_name": "commandr-35b-cpt", |
|
|
"pipeline_tag": "text-generation", |
|
|
"task_categories": ["text-generation"], |
|
|
"model_type": "AutoModelForCausalLM", |
|
|
"inference": { |
|
|
"parameters": { |
|
|
"max_new_tokens": 512, |
|
|
"temperature": 0.7, |
|
|
"top_p": 0.9 |
|
|
} |
|
|
}, |
|
|
"trained_on": [ |
|
|
"Leonardo EuroHPC" |
|
|
], |
|
|
"description": "Continual pretraining (CPT) of Cohere Command-R 35B using Axolotl and DeepSpeed ZeRO-1. The model was trained on scientific, governmental, news, and Wikipedia data with LoRA adapters to improve factual grounding and reasoning." |
|
|
} |
|
|
--- |
|
|
|
|
|
# Command-R 35B — CPT (Continual Pretraining with LoRA) |
|
|
|
|
|
**Model type:** Causal Language Model |
|
|
**Base model:** [CohereLabs/c4ai-command-r-v01](https://huggingface.co/CohereLabs/c4ai-command-r-v01) |
|
|
**License:** Apache 2.0 |
|
|
**Framework:** [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) |
|
|
|
|
|
--- |
|
|
|
|
|
## Overview |
|
|
|
|
|
`commandr-35b-cpt` is a **continual-pretrained** version of Cohere's Command-R 35B model, trained with LoRA adapters for efficient enregy doman adaptation. |
|
|
The goal of CPT is to extend the model’s general reasoning, factual grounding, and domain knowledge across science, governance, and energy-domain text. |
|
|
|
|
|
Training was performed on the **Leonardo EuroHPC** system using Axolotl with DeepSpeed ZeRO-1 optimization. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Setup |
|
|
|
|
|
**Objective:** Language modeling (unsupervised continual pretraining) |
|
|
**Adapter type:** LoRA |
|
|
**Precision:** bfloat16 |
|
|
**Hardware:** 8 nodes × 2 × NVIDIA A100 64GB GPUs |
|
|
**Framework:** DeepSpeed ZeRO-1, Axolotl, PyTorch 2.5.1+cu121 |
|
|
**Runtime:** ~24 hours |
|
|
**Checkpoints:** Saved every 1/5 of an epoch |
|
|
|
|
|
--- |
|
|
|
|
|
## Dataset |
|
|
|
|
|
Public energy domain text sources: |
|
|
|
|
|
- `arxiv.jsonl` — scientific and technical papers |
|
|
- `gov.jsonl` — public governmental documents |
|
|
- `news.jsonl` — news articles |
|
|
- `wiki.jsonl` — Wikipedia text |
|
|
|
|
|
--- |
|
|
|
|
|
## Hyperparameters |
|
|
|
|
|
| Parameter | Value | |
|
|
|------------|-------| |
|
|
| Sequence length | 2048 | |
|
|
| Micro batch size | 1 | |
|
|
| Gradient accumulation | 4 | |
|
|
| Epochs | 1 | |
|
|
| Max steps | 10000 | |
|
|
| Learning rate | 0.0002 | |
|
|
| LR scheduler | cosine | |
|
|
| Optimizer | AdamW (8-bit) | |
|
|
| Warmup steps | 10 | |
|
|
| Weight decay | 0.0 | |
|
|
| LoRA rank (r) | 16 | |
|
|
| LoRA alpha | 32 | |
|
|
| LoRA dropout | 0.05 | |
|
|
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | |
|
|
| Gradient checkpointing | ✅ | |
|
|
| Flash attention | ✅ | |
|
|
| Auto resume | ✅ | |
|
|
| Loss watchdog threshold | 5.0 | |
|
|
| Loss watchdog patience | 3 | |
|
|
|
|
|
|
|
|
## Tokenizer |
|
|
|
|
|
**Tokenizer type:** `AutoTokenizer` |
|
|
**Special token:** `<|end_of_text|>` as `pad_token` |