|
|
--- |
|
|
{ |
|
|
"language": ["en"], |
|
|
"license": "apache-2.0", |
|
|
"tags": [ |
|
|
"text-generation", |
|
|
"causal-lm", |
|
|
"two-stage-training", |
|
|
"continual-pretraining", |
|
|
"supervised-fine-tuning", |
|
|
"synthetic-qa", |
|
|
"lora", |
|
|
"axolotl", |
|
|
"deepspeed", |
|
|
"transformers", |
|
|
"commandr", |
|
|
"cohere", |
|
|
"eu-hpc" |
|
|
], |
|
|
"datasets": [ |
|
|
"arxiv", |
|
|
"gov", |
|
|
"news", |
|
|
"wikipedia", |
|
|
"axolotl_deduplicated_synthetic_qa" |
|
|
], |
|
|
"metrics": [ |
|
|
"loss" |
|
|
], |
|
|
"library_name": "transformers", |
|
|
"framework": "pytorch", |
|
|
"base_model": "ubitech-edg/commandr-35b-cpt", |
|
|
"model_name": "commandr-35b-cpt-sft", |
|
|
"pipeline_tag": "text-generation", |
|
|
"task_categories": ["text-generation", "instruction-following"], |
|
|
"model_type": "AutoModelForCausalLM", |
|
|
"inference": { |
|
|
"parameters": { |
|
|
"max_new_tokens": 512, |
|
|
"temperature": 0.7, |
|
|
"top_p": 0.9 |
|
|
} |
|
|
}, |
|
|
"trained_on": [ |
|
|
"Leonardo EuroHPC" |
|
|
], |
|
|
"description": "Two-stage training (CPT + SFT) of Cohere Command-R 35B using Axolotl and DeepSpeed. The model first undergoes domain-adaptive continual pretraining and then instruction fine-tuning on synthetic QA data." |
|
|
} |
|
|
--- |
|
|
|
|
|
# Command-R 35B — CPT + SFT |
|
|
|
|
|
**Model type:** Causal Language Model |
|
|
**Base model:** [commandr-35b-cpt](https://huggingface.co/ubitech-edg/commandr-35b-cpt) |
|
|
**License:** Apache 2.0 |
|
|
**Framework:** [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) |
|
|
|
|
|
--- |
|
|
|
|
|
## Overview |
|
|
|
|
|
`commandr-35b-cpt-sft` combines both **continual pretraining (CPT)** and **supervised fine-tuning (SFT)** in a two-stage process. |
|
|
The model first learns additional general-domain representations (CPT), then undergoes supervised instruction tuning (SFT) on synthetic QA data. |
|
|
This combination enhances factual grounding, fluency, and instruction adherence. |
|
|
|
|
|
Training was performed on the **Leonardo EuroHPC** system. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Setup |
|
|
|
|
|
**Stage 1 (CPT):** Domain-adaptive continual pretraining |
|
|
**Stage 2 (SFT):** Instruction fine-tuning |
|
|
**Adapter type:** LoRA |
|
|
**Precision:** bfloat16 |
|
|
**Hardware:** 8 nodes × 2 × NVIDIA A100 64GB GPUs |
|
|
**Framework:** DeepSpeed ZeRO-1, Axolotl, PyTorch 2.5.1+cu121 |
|
|
|
|
|
--- |
|
|
|
|
|
## Datasets |
|
|
|
|
|
**CPT Stage:** |
|
|
- `arxiv.jsonl` |
|
|
- `gov.jsonl` |
|
|
- `news.jsonl` |
|
|
- `wiki.jsonl` |
|
|
|
|
|
**SFT Stage:** |
|
|
- `axolotl_deduplicated_synthetic_qa.jsonl` |
|
|
|
|
|
--- |
|
|
|
|
|
## Hyperparameters |
|
|
|
|
|
| Parameter | Value | |
|
|
|------------|-------| |
|
|
| Sequence length | 2048 | |
|
|
| Micro batch size | 1 | |
|
|
| Gradient accumulation | 2 | |
|
|
| Epochs | 1 | |
|
|
| Learning rate | 0.00008 | |
|
|
| LR scheduler | cosine | |
|
|
| Optimizer | AdamW (8-bit) | |
|
|
| Warmup steps | 20 | |
|
|
| Weight decay | 0.0 | |
|
|
| LoRA rank (r) | 16 | |
|
|
| LoRA alpha | 32 | |
|
|
| LoRA dropout | 0.05 | |
|
|
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | |
|
|
| Gradient checkpointing | ✅ | |
|
|
| Flash attention | ✅ | |
|
|
| Auto resume | ✅ | |
|
|
| Loss watchdog threshold | 8.0 | |
|
|
| Loss watchdog patience | 20 | |
|
|
|
|
|
--- |
|
|
|
|
|
## Tokenizer |
|
|
|
|
|
**Tokenizer type:** `AutoTokenizer` |
|
|
**Special token:** `<|end_of_text|>` as `pad_token` |