--- library_name: transformers license: apache-2.0 base_model: allenai/Olmo-3-1025-7B tags: - axolotl - generated_from_trainer datasets: - dataset-tfs-mk-IMP-SOS-processed-olmo3-think.jsonl model-index: - name: O37BB results: [] --- [Built with Axolotl](https://github.com/axolotl-ai-cloud/axolotl)
See axolotl config axolotl version: `0.13.0.dev0` ```yaml # --- Base Model & Tokenizer Configuration --- base_model: allenai/Olmo-3-1025-7B trust_remote_code: true hub_model_id: Auditt/O37BB # Push the model to the Hugging Face Hub chat_template_jinja: /workspace/data/model-output/chat_template.jinja # Uses the template defined in tokenizer_config.json # --- Dataset Configuration --- # Assuming a standard conversation format (ShareGPT/ChatML style) datasets: - path: dataset-tfs-mk-IMP-SOS-processed-olmo3-think.jsonl type: chat_template field_messages: messages # The top-level key containing the list message_field_role: role # The key inside the list for 'user'/'assistant' message_field_content: content # The key inside the list for the actual text # 4. MAP YOUR ROLES # The keys (left) are what Axolotl expects. # The values (right) are what exist in your raw JSONL file. roles: user: ["user"] assistant: ["assistant"] system: ["system"] # 5. SUPERVISION # This ensures loss is calculated ONLY on the "assistant" turns. roles_to_train: ["assistant"] val_set_size: 0.1 # 10% Validation, 90% Training dataset_prepared_path: last_run_prepared # --- Training Strategy --- sequence_len: 60000 # Max sequence length sample_packing: true # Efficiently packs samples to fill sequence_len pad_to_sequence_len: true # Supervision Settings train_on_inputs: false # False = Mask User prompts (Supervise Assistant only) group_by_length: false # Usually false when sample_packing is true # --- Hyperparameters & Training Loop --- num_epochs: 2 micro_batch_size: 1 # Keep small due to 60k context gradient_accumulation_steps: 4 # Adjust based on desired global batch size learning_rate: 0.00001 optimizer: adamw_torch # --- Distributed Training & Memory --- context_parallel_size: 2 # Splits the 60k sequence across 2 GPUs gradient_checkpointing: true # Essential for 60k context flash_attention: true # Essential for speed/memory at this length # --- Logging & Evaluation --- logging_steps: 1 # Log training loss every step evals_per_epoch: 1 # Run eval 1 times per epoch (roughly) #eval_strategy: epoch #save_strategy: epoch # Save checkpoint at end of epoch #wandb_project: olmo3-finetune # Optional: Weights & Biases logging #wandb_entity: your-entity # Optional output_dir: /workspace/data/model-output-base # --- Precision --- bf16: true # Bfloat16 is recommended for OLMo fp16: false tf32: true tokens: # Add these to the tokenizer - "𐔲" - "𐔾" - "〉" - "𝜎" - "⋁" - "𐕠" - "𐕜" - "𐔸" - "∧" - "≥" - "𐕟" - "𐕖" - "⟂" - "𐕏" - "⋀" - "𐕣" - "𐕃" - "𐕙" - "𐕕" - "χ" - "𐕊" - "〈" - "𐕐" - "𐔻" - "𐕀" - "𐔳" - "≠" - "𐔷" - "≤" - "𐕞" - "𐔱" - "𐕂" - "↦" - "𐕎" - "→" - "𐕛" - "𐔰" - "ε" ```

# O37BB This model is a fine-tuned version of [allenai/Olmo-3-1025-7B](https://huggingface.co/allenai/Olmo-3-1025-7B) on the dataset-tfs-mk-IMP-SOS-processed-olmo3-think.jsonl dataset. It achieves the following results on the evaluation set: - Loss: 0.0019 - Memory/max Active (gib): 85.95 - Memory/max Allocated (gib): 82.72 - Memory/device Reserved (gib): 93.36 ## Model description More information needed ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1e-05 - train_batch_size: 1 - eval_batch_size: 1 - seed: 42 - distributed_type: multi-GPU - num_devices: 2 - gradient_accumulation_steps: 4 - total_train_batch_size: 8 - total_eval_batch_size: 2 - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments - lr_scheduler_type: cosine - lr_scheduler_warmup_steps: 10 - training_steps: 348 ### Training results | Training Loss | Epoch | Step | Validation Loss | Active (gib) | Allocated (gib) | Reserved (gib) | |:-------------:|:------:|:----:|:---------------:|:------------:|:---------------:|:--------------:| | No log | 0 | 0 | 1.0680 | 58.72 | 55.5 | 65.44 | | 0.0647 | 0.9943 | 174 | 0.0021 | 85.95 | 82.72 | 106.04 | | 0.0296 | 1.9943 | 348 | 0.0019 | 85.95 | 82.72 | 93.36 | ### Framework versions - Transformers 4.57.0 - Pytorch 2.7.1+cu126 - Datasets 4.0.0 - Tokenizers 0.22.1