mistral-12b-cpt / README.md
kosmylo1992's picture
Update README.md
88d141c verified
metadata
language:
  - en
license: apache-2.0
tags:
  - text-generation
  - causal-lm
  - continual-pretraining
  - lora
  - axolotl
  - deepspeed
  - transformers
  - mistral
  - nemo
  - eu-hpc
datasets:
  - arxiv
  - gov
  - news
  - wikipedia
metrics:
  - loss
library_name: transformers
framework: pytorch
base_model: mistralai/Mistral-Nemo-Instruct-2407
model_name: mistral-12b-cpt
pipeline_tag: text-generation
task_categories:
  - text-generation
model_type: AutoModelForCausalLM
inference:
  parameters:
    max_new_tokens: 512
    temperature: 0.7
    top_p: 0.9
trained_on:
  - Leonardo EuroHPC
description: >-
  Continual pretraining (CPT) of Mistral 12B Nemo Instruct using Axolotl and
  DeepSpeed ZeRO-1. Trained on scientific, government, news, and Wikipedia text
  with LoRA adapters.

Mistral 12B — CPT (Continual Pretraining with LoRA)

Model type: Causal Language Model
Base model: mistralai/Mistral-Nemo-Instruct-2407
License: Apache 2.0
Framework: Axolotl


Overview

mistral-12b-cpt is a continual-pretrained version of the Mistral-12B Nemo Instruct model.
This CPT phase extends the model’s factual and energy domain understanding using scientific, governmental, news, and encyclopedic text.

Training was executed on the Leonardo EuroHPC system using Axolotl with DeepSpeed ZeRO-1 for efficient large-scale distributed fine-tuning.


Training Setup

Objective: Unsupervised continual pretraining (language modeling)
Adapter type: LoRA
Precision: bfloat16
Hardware: 8 nodes × 2 × NVIDIA A100 64 GB GPUs
Framework: Axolotl + DeepSpeed + PyTorch 2.5.1 + CUDA 12.1
Runtime: 24 h
Checkpoints: 5 per epoch


Dataset

Dataset Description
arxiv.jsonl Scientific and technical papers
gov.jsonl Government and policy documents
news.jsonl News articles
wiki.jsonl Wikipedia text

Hyperparameters

Parameter Value
Sequence length 2048
Micro batch size 2
Gradient accumulation 2
Epochs 10
Max steps 10000
Learning rate 0.0002
LR scheduler cosine
Optimizer AdamW (8-bit)
Warmup steps 10
Weight decay 0.0
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.05
LoRA targets q_proj, k_proj, v_proj, o_proj
Gradient checkpointing
Flash attention
Loss watchdog (threshold/patience) 5.0 / 3

Tokenizer

Tokenizer type: AutoTokenizer
Pad token: <|end_of_text|>