allura-forge/expr-rp-sft-mix
Viewer • Updated • 46.6k • 17
How to use allura-forge/g12bsftep2 with Transformers:
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("allura-forge/g12bsftep2", dtype="auto")axolotl version: 0.13.0.dev0
## model
base_model: ./model
## qlora COPE!!!
load_in_8bit: false
load_in_4bit: false #false
strict: false
# === Data Configuration ===
datasets:
- path: allura-forge/expr-rp-sft-mix
type: chat_template
split: train
field_messages: conversations
message_field_role: from
message_field_content: value
chat_template: jinja
chat_template_jinja: "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = 'user' %}{% endif %}{{ '<start_of_turn>' + role + '\n' + message['content'] | trim + '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model\n'}}{% endif %}"
shuffle_merged_datasets: true
dataset_prepared_path: dataset_prepareds
val_set_size: 0.0
output_dir: ./output
max_grad_norm: 0.1
## Liger + CCE
plugins:
- axolotl.integrations.liger.LigerPlugin
- axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
liger_rope: true
liger_rms_norm: true
liger_layer_norm: true
liger_glu_activation: true
liger_fused_linear_cross_entropy: false
cut_cross_entropy: true
## CTX settings
sequence_len: 16384
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true
## WandB
wandb_project: g12b-slopification
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:
## hoe params
gradient_accumulation_steps: 2 # ???
micro_batch_size: 2
num_epochs: 2
lr_scheduler: rex
learning_rate: 2e-6
optimizer: adamw_torch_8bit # Options: "paged_ademamix_8bit", "adamw_bnb_8bit", "paged_adamw_8bit"
train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false
gradient_checkpointing: offload
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true
s2_attention:
special_tokens:
eos_token: "<end_of_turn>"
warmup_steps: 25
saves_per_epoch: 4
debug:
weight_decay: 0.0
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_activation_checkpointing: true
fsdp_limit_all_gathers: true
fsdp_use_orig_params: false
fsdp_cpu_ram_efficient_loading: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: Gemma3DecoderLayer
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_reshard_after_forward: true
fsdp_version: 2
This model was trained from scratch on the allura-forge/expr-rp-sft-mix dataset.
More information needed
More information needed
More information needed
The following hyperparameters were used during training: