Instructions to use AiForgeMaster/gemma4-31b-cpt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AiForgeMaster/gemma4-31b-cpt with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="AiForgeMaster/gemma4-31b-cpt")# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("AiForgeMaster/gemma4-31b-cpt") model = AutoModelForImageTextToText.from_pretrained("AiForgeMaster/gemma4-31b-cpt") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use AiForgeMaster/gemma4-31b-cpt with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AiForgeMaster/gemma4-31b-cpt" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AiForgeMaster/gemma4-31b-cpt", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/AiForgeMaster/gemma4-31b-cpt
- SGLang
How to use AiForgeMaster/gemma4-31b-cpt with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AiForgeMaster/gemma4-31b-cpt" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AiForgeMaster/gemma4-31b-cpt", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AiForgeMaster/gemma4-31b-cpt" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AiForgeMaster/gemma4-31b-cpt", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use AiForgeMaster/gemma4-31b-cpt with Docker Model Runner:
docker model run hf.co/AiForgeMaster/gemma4-31b-cpt
| library_name: transformers | |
| license: apache-2.0 | |
| base_model: google/gemma-4-31B | |
| tags: | |
| - generated_from_trainer | |
| datasets: | |
| - AiForgeMaster/gemma4-31b-cpt-data | |
| model-index: | |
| - name: workspace/data/axolotl_output/gemma4-31b-cpt | |
| results: [] | |
| <!-- This model card has been generated automatically according to the information the Trainer had access to. You | |
| should probably proofread and complete it, then remove this comment. --> | |
| [<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl) | |
| <details><summary>See axolotl config</summary> | |
| axolotl version: `0.16.0.dev0` | |
| ```yaml | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Axolotl β Full Fine-Tuning Continued Pre-Training | |
| # Model: Gemma 4 31B Dense (google/gemma-4-31B) β all parameters trainable | |
| # GPUs: 8Γ A100 80GB SXM (NVLink) β DeepSpeed ZeRO-3 | |
| # Data: 321,196 chunks | 75% domain (Vedic/SPH) + 25% FineWeb-Edu | |
| # Tokens: ~1.013B | Sequence: 4096 | 1 epoch | |
| # Cost: 6Γ $1.49/hr = $8.94/hr β est. 55-75 hrs β $490-670 | |
| # | |
| # Launch: | |
| # PYTORCH_ALLOC_CONF=expandable_segments:True accelerate launch --num_processes 8 -m axolotl.cli.train axolotl_cpt.yml > train.log 2>&1 | |
| # | |
| # References (verified): | |
| # - MEDITRON-70B (EPFL, arXiv:2311.16079): FFT CPT, LR=1.5e-4, 48B tok | |
| # - Me-LLaMA-70B (UF, PMC/11142305): FFT CPT, LR=8e-6, 129B tok | |
| # - Biderman et al. (TMLR 2024, arXiv:2405.09673): FFT > LoRA for CPT | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # ββ Model (FFT β no adapter, no quantization) βββββββββββββββββββββββββββββ | |
| # Use the BASE (pre-trained) model, NOT instruction-tuned (-it). | |
| base_model: google/gemma-4-31B | |
| model_type: AutoModelForCausalLM | |
| tokenizer_type: AutoTokenizer | |
| trust_remote_code: true | |
| hf_use_auth_token: true | |
| # ββ DeepSpeed ZeRO-3 ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Shards weights, gradients, and optimizer states across 6 GPUs. | |
| # Per-GPU: ~62GB sharded model state + activations β fits 80GB with grad ckpt. | |
| # Config ships with Axolotl β no custom JSON needed. | |
| deepspeed: /workspace/axolotl/deepspeed_configs/zero3_bf16.json | |
| # ββ Dataset ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Loaded from HuggingFace Hub β pre-shuffled: 75% domain (Vedic/SPH) + 25% FineWeb-Edu. | |
| # Pre-chunked to β€4096 Gemma tokens with 256-tok intra-doc overlap. | |
| # type: completion β each {"text": "..."} line is one sample, loss on all tokens. | |
| # (Do NOT use type: pretrain β that re-concatenates, breaking our chunking.) | |
| datasets: | |
| - path: AiForgeMaster/gemma4-31b-cpt-data | |
| type: completion | |
| split: train | |
| dataset_prepared_path: /workspace/axolotl/axolotl_cache/cpt | |
| # ββ Sequence & Packing βββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| sequence_len: 4096 | |
| sample_packing: true # packs shorter chunks together β eliminates padding waste | |
| pad_to_sequence_len: true | |
| gemma4_hybrid_attn_impl: true # FA2 on sliding (head_dim=256) layers, SDPA on global (head_dim=512) layers; sets flash_attention internally | |
| # ββ Training Hyperparameters βββββββββββββββββββββββββββββββββββββββββββββββ | |
| num_epochs: 1 # one epoch β ~1,932 steps over 1.013B tokens | |
| # Effective batch = micro_batch Γ grad_accum Γ 8 GPUs = 1 Γ 16 Γ 8 = 128 samples | |
| # β ~524K tokens/step | |
| micro_batch_size: 1 # mbs=2 OOM'd on transient all-gather (8 GB failed alloc) β stay at mbs=1 | |
| gradient_accumulation_steps: 16 | |
| chunked_cross_entropy: true # avoids materializing full (B,S,V) logits tensor | |
| plugins: | |
| - axolotl.integrations.liger.LigerPlugin | |
| liger_glu_activation: true # fused GEGLU MLP for Gemma 4 (Triton) | |
| liger_rms_norm: false # keep existing fused_attn.py RMSNorm patch | |
| liger_rope: false # Gemma 4 incompatible (separate q/k) | |
| liger_cross_entropy: false # chunked_cross_entropy handles this | |
| liger_fused_linear_cross_entropy: false # Gemma 4 incompatible | |
| optimizer: adamw_bnb_8bit # 8-bit Adam β ~6 bytes/param opt state instead of 12; saves ~23 GB/GPU | |
| lr_scheduler: cosine | |
| learning_rate: 5e-5 # conservative for FFT CPT on 31B with 1B tokens | |
| # verified range: 5e-6 (Biderman) to 1.5e-4 (MEDITRON) | |
| # 5e-5 balances learning vs forgetting for our data scale | |
| weight_decay: 0.1 # standard for FFT with AdamW (MEDITRON used 0.1) | |
| max_grad_norm: 1.0 | |
| warmup_ratio: 0.01 # ~25-33 warmup steps before cosine decay | |
| # ββ Precision & Memory βββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| bf16: true | |
| tf32: true | |
| gradient_checkpointing: true | |
| gradient_checkpointing_kwargs: | |
| use_reentrant: true # DeepSpeed compatibility (per Axolotl docs) | |
| # ββ Output & Checkpointing βββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # /workspace/data is on a 16 TB volume β full DS checkpoints (~310 GB each) fit fine. | |
| output_dir: /workspace/data/axolotl_output/gemma4-31b-cpt | |
| logging_steps: 10 | |
| save_only_model: false # save optimizer + scheduler + RNG for exact resume | |
| saves_per_epoch: 4 # every ~25% of epoch (~every 483 steps / ~14 hrs) | |
| save_total_limit: 2 # keep latest 2 (briefly 3 during write) β ~900 GB peak on disk | |
| # Resume: accelerate launch ... axolotl_cpt.yml --resume_from_checkpoint <path> | |
| # Infra switch (different GPU count): run zero_to_fp32.py on old checkpoint, | |
| # then start fresh β optimizer state resets, loss wobbles briefly then recovers. | |
| val_set_size: 0 # no eval split β CPT trains on all data | |
| load_best_model_at_end: false | |
| # ββ Weights & Biases ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| wandb_project: virtual_agama | |
| wandb_run_id: gemma4-31b-fft-stage1 | |
| # ββ Benchmark First! ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Before committing full budget, run a quick throughput test: | |
| # 1. Set max_steps: 50 | |
| # 2. Launch training, note tokens/sec from logs | |
| # 3. Calculate: 1,013,000,000 / tok_per_sec / 3600 * 8.94 = total cost | |
| # 4. If over $500, options: | |
| # a) Train on domain only (760M tok) β skip GK mix, add at SFT stage | |
| # b) Stretch budget $50-100 β worth it for FFT quality over QLoRA | |
| ``` | |
| </details><br> | |
| # workspace/data/axolotl_output/gemma4-31b-cpt | |
| This model is a fine-tuned version of [google/gemma-4-31B](https://huggingface.co/google/gemma-4-31B) on the AiForgeMaster/gemma4-31b-cpt-data dataset. | |
| ## Model description | |
| More information needed | |
| ## Intended uses & limitations | |
| More information needed | |
| ## Training and evaluation data | |
| More information needed | |
| ## Training procedure | |
| ### Training hyperparameters | |
| The following hyperparameters were used during training: | |
| - learning_rate: 5e-05 | |
| - train_batch_size: 1 | |
| - eval_batch_size: 1 | |
| - seed: 42 | |
| - distributed_type: multi-GPU | |
| - num_devices: 8 | |
| - gradient_accumulation_steps: 16 | |
| - total_train_batch_size: 128 | |
| - total_eval_batch_size: 8 | |
| - optimizer: Use OptimizerNames.ADAMW_BNB with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments | |
| - lr_scheduler_type: cosine | |
| - lr_scheduler_warmup_steps: 16 | |
| - training_steps: 1624 | |
| ### Training results | |
| ### Framework versions | |
| - Transformers 5.5.4 | |
| - Pytorch 2.10.0+cu128 | |
| - Datasets 4.8.4 | |
| - Tokenizers 0.22.2 | |