--- { "language": ["en"], "license": "apache-2.0", "tags": [ "text-generation", "causal-lm", "two-stage-training", "continual-pretraining", "supervised-fine-tuning", "synthetic-qa", "lora", "axolotl", "deepspeed", "transformers", "commandr", "cohere", "eu-hpc" ], "datasets": [ "arxiv", "gov", "news", "wikipedia", "axolotl_deduplicated_synthetic_qa" ], "metrics": [ "loss" ], "library_name": "transformers", "framework": "pytorch", "base_model": "ubitech-edg/commandr-35b-cpt", "model_name": "commandr-35b-cpt-sft", "pipeline_tag": "text-generation", "task_categories": ["text-generation", "instruction-following"], "model_type": "AutoModelForCausalLM", "inference": { "parameters": { "max_new_tokens": 512, "temperature": 0.7, "top_p": 0.9 } }, "trained_on": [ "Leonardo EuroHPC" ], "description": "Two-stage training (CPT + SFT) of Cohere Command-R 35B using Axolotl and DeepSpeed. The model first undergoes domain-adaptive continual pretraining and then instruction fine-tuning on synthetic QA data." } --- # Command-R 35B — CPT + SFT **Model type:** Causal Language Model **Base model:** [commandr-35b-cpt](https://huggingface.co/ubitech-edg/commandr-35b-cpt) **License:** Apache 2.0 **Framework:** [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) --- ## Overview `commandr-35b-cpt-sft` combines both **continual pretraining (CPT)** and **supervised fine-tuning (SFT)** in a two-stage process. The model first learns additional general-domain representations (CPT), then undergoes supervised instruction tuning (SFT) on synthetic QA data. This combination enhances factual grounding, fluency, and instruction adherence. Training was performed on the **Leonardo EuroHPC** system. --- ## Training Setup **Stage 1 (CPT):** Domain-adaptive continual pretraining **Stage 2 (SFT):** Instruction fine-tuning **Adapter type:** LoRA **Precision:** bfloat16 **Hardware:** 8 nodes × 2 × NVIDIA A100 64GB GPUs **Framework:** DeepSpeed ZeRO-1, Axolotl, PyTorch 2.5.1+cu121 --- ## Datasets **CPT Stage:** - `arxiv.jsonl` - `gov.jsonl` - `news.jsonl` - `wiki.jsonl` **SFT Stage:** - `axolotl_deduplicated_synthetic_qa.jsonl` --- ## Hyperparameters | Parameter | Value | |------------|-------| | Sequence length | 2048 | | Micro batch size | 1 | | Gradient accumulation | 2 | | Epochs | 1 | | Learning rate | 0.00008 | | LR scheduler | cosine | | Optimizer | AdamW (8-bit) | | Warmup steps | 20 | | Weight decay | 0.0 | | LoRA rank (r) | 16 | | LoRA alpha | 32 | | LoRA dropout | 0.05 | | LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj | | Gradient checkpointing | ✅ | | Flash attention | ✅ | | Auto resume | ✅ | | Loss watchdog threshold | 8.0 | | Loss watchdog patience | 20 | --- ## Tokenizer **Tokenizer type:** `AutoTokenizer` **Special token:** `<|end_of_text|>` as `pad_token`