Instructions to use AiForgeMaster/gemma4-31b-cpt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AiForgeMaster/gemma4-31b-cpt with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="AiForgeMaster/gemma4-31b-cpt")

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("AiForgeMaster/gemma4-31b-cpt")
model = AutoModelForImageTextToText.from_pretrained("AiForgeMaster/gemma4-31b-cpt")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use AiForgeMaster/gemma4-31b-cpt with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AiForgeMaster/gemma4-31b-cpt"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AiForgeMaster/gemma4-31b-cpt",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/AiForgeMaster/gemma4-31b-cpt

SGLang

How to use AiForgeMaster/gemma4-31b-cpt with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AiForgeMaster/gemma4-31b-cpt" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AiForgeMaster/gemma4-31b-cpt",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AiForgeMaster/gemma4-31b-cpt" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AiForgeMaster/gemma4-31b-cpt",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use AiForgeMaster/gemma4-31b-cpt with Docker Model Runner:
```
docker model run hf.co/AiForgeMaster/gemma4-31b-cpt
```

gemma4-31b-cpt / README.md

AiForgeMaster

Upload final Gemma-4-31B CPT model weights

7dc5c21 verified about 1 month ago

preview code

raw

history blame contribute delete

8.61 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: google/gemma-4-31B
	tags:
	- generated_from_trainer
	datasets:
	- AiForgeMaster/gemma4-31b-cpt-data
	model-index:
	- name: workspace/data/axolotl_output/gemma4-31b-cpt
	results: []
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
	<details><summary>See axolotl config</summary>

	axolotl version: `0.16.0.dev0`
	```yaml
	# ══════════════════════════════════════════════════════════════════════════════
	# Axolotl — Full Fine-Tuning Continued Pre-Training
	# Model: Gemma 4 31B Dense (google/gemma-4-31B) — all parameters trainable
	# GPUs: 8× A100 80GB SXM (NVLink) — DeepSpeed ZeRO-3
	# Data: 321,196 chunks \| 75% domain (Vedic/SPH) + 25% FineWeb-Edu
	# Tokens: ~1.013B \| Sequence: 4096 \| 1 epoch
	# Cost: 6× $1.49/hr = $8.94/hr → est. 55-75 hrs → $490-670
	#
	# Launch:
	# PYTORCH_ALLOC_CONF=expandable_segments:True accelerate launch --num_processes 8 -m axolotl.cli.train axolotl_cpt.yml > train.log 2>&1
	#
	# References (verified):
	# - MEDITRON-70B (EPFL, arXiv:2311.16079): FFT CPT, LR=1.5e-4, 48B tok
	# - Me-LLaMA-70B (UF, PMC/11142305): FFT CPT, LR=8e-6, 129B tok
	# - Biderman et al. (TMLR 2024, arXiv:2405.09673): FFT > LoRA for CPT
	# ══════════════════════════════════════════════════════════════════════════════

	# ── Model (FFT — no adapter, no quantization) ─────────────────────────────
	# Use the BASE (pre-trained) model, NOT instruction-tuned (-it).
	base_model: google/gemma-4-31B
	model_type: AutoModelForCausalLM
	tokenizer_type: AutoTokenizer
	trust_remote_code: true
	hf_use_auth_token: true

	# ── DeepSpeed ZeRO-3 ──────────────────────────────────────────────────────
	# Shards weights, gradients, and optimizer states across 6 GPUs.
	# Per-GPU: ~62GB sharded model state + activations → fits 80GB with grad ckpt.
	# Config ships with Axolotl — no custom JSON needed.
	deepspeed: /workspace/axolotl/deepspeed_configs/zero3_bf16.json

	# ── Dataset ────────────────────────────────────────────────────────────────
	# Loaded from HuggingFace Hub — pre-shuffled: 75% domain (Vedic/SPH) + 25% FineWeb-Edu.
	# Pre-chunked to ≤4096 Gemma tokens with 256-tok intra-doc overlap.
	# type: completion → each {"text": "..."} line is one sample, loss on all tokens.
	# (Do NOT use type: pretrain — that re-concatenates, breaking our chunking.)
	datasets:
	- path: AiForgeMaster/gemma4-31b-cpt-data
	type: completion
	split: train

	dataset_prepared_path: /workspace/axolotl/axolotl_cache/cpt

	# ── Sequence & Packing ─────────────────────────────────────────────────────
	sequence_len: 4096
	sample_packing: true # packs shorter chunks together — eliminates padding waste
	pad_to_sequence_len: true
	gemma4_hybrid_attn_impl: true # FA2 on sliding (head_dim=256) layers, SDPA on global (head_dim=512) layers; sets flash_attention internally

	# ── Training Hyperparameters ───────────────────────────────────────────────
	num_epochs: 1 # one epoch — ~1,932 steps over 1.013B tokens

	# Effective batch = micro_batch × grad_accum × 8 GPUs = 1 × 16 × 8 = 128 samples
	# → ~524K tokens/step
	micro_batch_size: 1 # mbs=2 OOM'd on transient all-gather (8 GB failed alloc) — stay at mbs=1
	gradient_accumulation_steps: 16

	chunked_cross_entropy: true # avoids materializing full (B,S,V) logits tensor

	plugins:
	- axolotl.integrations.liger.LigerPlugin
	liger_glu_activation: true # fused GEGLU MLP for Gemma 4 (Triton)
	liger_rms_norm: false # keep existing fused_attn.py RMSNorm patch
	liger_rope: false # Gemma 4 incompatible (separate q/k)
	liger_cross_entropy: false # chunked_cross_entropy handles this
	liger_fused_linear_cross_entropy: false # Gemma 4 incompatible

	optimizer: adamw_bnb_8bit # 8-bit Adam — ~6 bytes/param opt state instead of 12; saves ~23 GB/GPU
	lr_scheduler: cosine
	learning_rate: 5e-5 # conservative for FFT CPT on 31B with 1B tokens
	# verified range: 5e-6 (Biderman) to 1.5e-4 (MEDITRON)
	# 5e-5 balances learning vs forgetting for our data scale
	weight_decay: 0.1 # standard for FFT with AdamW (MEDITRON used 0.1)
	max_grad_norm: 1.0
	warmup_ratio: 0.01 # ~25-33 warmup steps before cosine decay

	# ── Precision & Memory ─────────────────────────────────────────────────────
	bf16: true
	tf32: true

	gradient_checkpointing: true
	gradient_checkpointing_kwargs:
	use_reentrant: true # DeepSpeed compatibility (per Axolotl docs)

	# ── Output & Checkpointing ─────────────────────────────────────────────────
	# /workspace/data is on a 16 TB volume — full DS checkpoints (~310 GB each) fit fine.
	output_dir: /workspace/data/axolotl_output/gemma4-31b-cpt
	logging_steps: 10

	save_only_model: false # save optimizer + scheduler + RNG for exact resume
	saves_per_epoch: 4 # every ~25% of epoch (~every 483 steps / ~14 hrs)
	save_total_limit: 2 # keep latest 2 (briefly 3 during write) — ~900 GB peak on disk
	# Resume: accelerate launch ... axolotl_cpt.yml --resume_from_checkpoint <path>
	# Infra switch (different GPU count): run zero_to_fp32.py on old checkpoint,
	# then start fresh — optimizer state resets, loss wobbles briefly then recovers.

	val_set_size: 0 # no eval split — CPT trains on all data
	load_best_model_at_end: false

	# ── Weights & Biases ──────────────────────────────────────────────────────
	wandb_project: virtual_agama
	wandb_run_id: gemma4-31b-fft-stage1

	# ── Benchmark First! ──────────────────────────────────────────────────────
	# Before committing full budget, run a quick throughput test:
	# 1. Set max_steps: 50
	# 2. Launch training, note tokens/sec from logs
	# 3. Calculate: 1,013,000,000 / tok_per_sec / 3600 * 8.94 = total cost
	# 4. If over $500, options:
	# a) Train on domain only (760M tok) — skip GK mix, add at SFT stage
	# b) Stretch budget $50-100 — worth it for FFT quality over QLoRA

	```

	</details><br>

	# workspace/data/axolotl_output/gemma4-31b-cpt

	This model is a fine-tuned version of [google/gemma-4-31B](https://huggingface.co/google/gemma-4-31B) on the AiForgeMaster/gemma4-31b-cpt-data dataset.

	## Model description

	More information needed

	## Intended uses & limitations

	More information needed

	## Training and evaluation data

	More information needed

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 1
	- eval_batch_size: 1
	- seed: 42
	- distributed_type: multi-GPU
	- num_devices: 8
	- gradient_accumulation_steps: 16
	- total_train_batch_size: 128
	- total_eval_batch_size: 8
	- optimizer: Use OptimizerNames.ADAMW_BNB with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_steps: 16
	- training_steps: 1624

	### Training results



	### Framework versions

	- Transformers 5.5.4
	- Pytorch 2.10.0+cu128
	- Datasets 4.8.4
	- Tokenizers 0.22.2