Mero Mero
Gemma4 31BA finetune of Gemma 4 31B designed for creative tasks.
Another difficult to work with but extremely good model from Google.
This model has a slightly better swipe diversity and a less flowery / verbose writing style. Reasoning tends to average out being a bit longer than the original however. Intelligence appears to be on par with the original.
Supports both thinking and non thinking.
Creation Process: SFT > Merge
SFT on approx 49 million tokens.
Despite using 49 million tokens, this dataset is fairly modest in size. Trainable is somewhere in the rough ballpark of 10-15 million. All of the datasets were trained on the last turn only, to faithfully mirror the Gemma 4 chat template
The approach was very similar to the 26B A4B MeroMero. I trained the model aggressively for 2 epochs on my data and after testing various checkpoints, settled for the one at 1 epoch, which had the style and the least signs of overfitting.
I merged this checkpoint back into the original instruct which cleaned up any remaining overfitting while still retaining the changes of the finetune.
Trained using Axolotl.
Mergekit Config
models:
- model: google/gemma-4-31B-it
- model: ApocalypseParty/G4-31B-SFT-v3-1-1ep
merge_method: slerp
parameters:
t: 0.5
base_model: google/gemma-4-31B-it
dtype: bfloat16
Axolotl Config
base_model: google/gemma-4-31B-it
plugins:
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
- axolotl.integrations.liger.LigerPlugin
liger_layer_norm: true
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_rms_norm_gated: true
strict: false
cut_cross_entropy: true
datasets:
- path: zerofata/pretok
val_set_size: 0.02
output_dir: ./G4-31B-SFT-v3-1
sequence_len: 10756
pad_to_sequence_len: true
sample_packing: true
load_in_4bit: false
adapter: lora
lora_r: 64
lora_alpha: 64
peft_use_rslora: true
lora_dropout: 0.0
freeze_mm_modules: true
lora_target_modules: 'model.language_model.layers.[\d]+.(_checkpoint_wrapped_module.)?(mlp|self_attn).(up|down|gate|q|k|v|o)_proj'
wandb_project: G4-31B-SFT
wandb_name: G4-31B-SFT-v3-1
gradient_accumulation_steps: 1
micro_batch_size: 4
num_epochs: 2
optimizer: adamw_torch_fused
lr_scheduler: constant_with_warmup
learning_rate: 1e-5
max_grad_norm: 1.0
bf16: auto
tf32: true
logging_steps: 1
# FA2 not supported
sdp_attention: true
#flex_attention: true
#torch_compile: true
flash_attention: false
warmup_ratio: 0.1
evals_per_epoch: 4
saves_per_epoch: 2
weight_decay: 0.05
special_tokens:
fsdp_config:
fsdp_version: 2
offload_params: false
cpu_ram_efficient_loading: false
auto_wrap_policy: TRANSFORMER_BASED_WRAP
transformer_layer_cls_to_wrap: Gemma4TextDecoderLayer
state_dict_type: FULL_STATE_DICT
sharding_strategy: FULL_SHARD
reshard_after_forward: true
activation_checkpointing: true
- Downloads last month
- 87
Model tree for zerofata/G4-MeroMero-31B
Base model
google/gemma-4-31B-it