O37BB / README.md

End of training

333fec9 verified about 1 month ago

5.59 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: allenai/Olmo-3-1025-7B
	tags:
	- axolotl
	- generated_from_trainer
	datasets:
	- dataset-tfs-mk-IMP-SOS-processed-olmo3-think.jsonl
	model-index:
	- name: O37BB
	results: []
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
	<details><summary>See axolotl config</summary>

	axolotl version: `0.13.0.dev0`
	```yaml
	# --- Base Model & Tokenizer Configuration ---
	base_model: allenai/Olmo-3-1025-7B
	trust_remote_code: true
	hub_model_id: Auditt/O37BB # Push the model to the Hugging Face Hub
	chat_template_jinja: /workspace/data/model-output/chat_template.jinja # Uses the template defined in tokenizer_config.json

	# --- Dataset Configuration ---
	# Assuming a standard conversation format (ShareGPT/ChatML style)
	datasets:
	- path: dataset-tfs-mk-IMP-SOS-processed-olmo3-think.jsonl
	type: chat_template
	field_messages: messages # The top-level key containing the list
	message_field_role: role # The key inside the list for 'user'/'assistant'
	message_field_content: content # The key inside the list for the actual text

	# 4. MAP YOUR ROLES
	# The keys (left) are what Axolotl expects.
	# The values (right) are what exist in your raw JSONL file.
	roles:
	user: ["user"]
	assistant: ["assistant"]
	system: ["system"]

	# 5. SUPERVISION
	# This ensures loss is calculated ONLY on the "assistant" turns.
	roles_to_train: ["assistant"]

	val_set_size: 0.1 # 10% Validation, 90% Training
	dataset_prepared_path: last_run_prepared

	# --- Training Strategy ---
	sequence_len: 60000 # Max sequence length
	sample_packing: true # Efficiently packs samples to fill sequence_len
	pad_to_sequence_len: true

	# Supervision Settings
	train_on_inputs: false # False = Mask User prompts (Supervise Assistant only)
	group_by_length: false # Usually false when sample_packing is true

	# --- Hyperparameters & Training Loop ---
	num_epochs: 2
	micro_batch_size: 1 # Keep small due to 60k context
	gradient_accumulation_steps: 4 # Adjust based on desired global batch size
	learning_rate: 0.00001
	optimizer: adamw_torch

	# --- Distributed Training & Memory ---
	context_parallel_size: 2 # Splits the 60k sequence across 2 GPUs
	gradient_checkpointing: true # Essential for 60k context
	flash_attention: true # Essential for speed/memory at this length

	# --- Logging & Evaluation ---
	logging_steps: 1 # Log training loss every step
	evals_per_epoch: 1 # Run eval 1 times per epoch (roughly)
	#eval_strategy: epoch
	#save_strategy: epoch # Save checkpoint at end of epoch
	#wandb_project: olmo3-finetune # Optional: Weights & Biases logging
	#wandb_entity: your-entity # Optional
	output_dir: /workspace/data/model-output-base

	# --- Precision ---
	bf16: true # Bfloat16 is recommended for OLMo
	fp16: false
	tf32: true

	tokens: # Add these to the tokenizer
	- "𐔲"
	- "𐔾"
	- "〉"
	- "𝜎"
	- "⋁"
	- "𐕠"
	- "𐕜"
	- "𐔸"
	- "∧"
	- "≥"
	- "𐕟"
	- "𐕖"
	- "⟂"
	- "𐕏"
	- "⋀"
	- "𐕣"
	- "𐕃"
	- "𐕙"
	- "𐕕"
	- "χ"
	- "𐕊"
	- "〈"
	- "𐕐"
	- "𐔻"
	- "𐕀"
	- "𐔳"
	- "≠"
	- "𐔷"
	- "≤"
	- "𐕞"
	- "𐔱"
	- "𐕂"
	- "↦"
	- "𐕎"
	- "→"
	- "𐕛"
	- "𐔰"
	- "ε"

	```

	</details><br>

	# O37BB

	This model is a fine-tuned version of [allenai/Olmo-3-1025-7B](https://huggingface.co/allenai/Olmo-3-1025-7B) on the dataset-tfs-mk-IMP-SOS-processed-olmo3-think.jsonl dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.0019
	- Memory/max Active (gib): 85.95
	- Memory/max Allocated (gib): 82.72
	- Memory/device Reserved (gib): 93.36

	## Model description

	More information needed

	## Intended uses & limitations

	More information needed

	## Training and evaluation data

	More information needed

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 1e-05
	- train_batch_size: 1
	- eval_batch_size: 1
	- seed: 42
	- distributed_type: multi-GPU
	- num_devices: 2
	- gradient_accumulation_steps: 4
	- total_train_batch_size: 8
	- total_eval_batch_size: 2
	- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_steps: 10
	- training_steps: 348

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Active (gib) \| Allocated (gib) \| Reserved (gib) \|
	\|:-------------:\|:------:\|:----:\|:---------------:\|:------------:\|:---------------:\|:--------------:\|
	\| No log \| 0 \| 0 \| 1.0680 \| 58.72 \| 55.5 \| 65.44 \|
	\| 0.0647 \| 0.9943 \| 174 \| 0.0021 \| 85.95 \| 82.72 \| 106.04 \|
	\| 0.0296 \| 1.9943 \| 348 \| 0.0019 \| 85.95 \| 82.72 \| 93.36 \|


	### Framework versions

	- Transformers 4.57.0
	- Pytorch 2.7.1+cu126
	- Datasets 4.0.0
	- Tokenizers 0.22.1