justinj92
/

Delphermes-0.6B-R1

@@ -1,174 +1,51 @@
 ---
-library_name: peft
-license: apache-2.0
 base_model: Qwen/Qwen3-0.6B
 tags:
-- axolotl
-- generated_from_trainer
-datasets:
-- open-r1/Mixture-of-Thoughts
-- NousResearch/Hermes-3-Dataset
-model-index:
-- name: Delphermes-0.6B-R1-LORA
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
-<details><summary>See axolotl config</summary>
-axolotl version: `0.11.0`
-```yaml
-# ==== MODEL ====
-base_model: Qwen/Qwen3-0.6B
-hub_model_id: justinj92/Delphermes-0.6B-R1-LORA
-strict: false
-chat_template: qwen3
-# ==== DATASETS (unchanged) ====
-datasets:
-  - path: open-r1/Mixture-of-Thoughts
-    name: all
-    split: train
-    type: chat_template
-    field_messages: messages
-  - path: NousResearch/Hermes-3-Dataset
-    split: train
-    type: chat_template
-    field_messages: conversations
-    message_property_mappings:
-      role: from
-      content: value
-val_set_size: 0.05
-output_dir: ./outputs/Delphermes-0.6B-R1-LORA
-dataset_prepared_path: last_run_prepared
-# ==== LENGTH / PACKING ====
-sequence_len: 8192
-sample_packing: true
-eval_sample_packing: true
-pad_to_sequence_len: true
-remove_unused_columns: true
-# ==== LoRA ====
-adapter: lora
-lora_r: 16
-lora_alpha: 64
-lora_dropout: 0.1
-lora_target_modules:
-  - q_proj
-  - k_proj
-  - v_proj
-  - o_proj
-  - gate_proj
-  - up_proj
-  - down_proj
-# ==== OPTIMIZER & SCHEDULE ====
-optimizer: adamw_torch_fused
-learning_rate: 0.0002             # Aggressive scenario (4× tokens, sqrt scale). (Baseline: 0.0002; Moderate: 0.00028)
-lr_scheduler: cosine
-weight_decay: 0.0
-max_grad_norm: 1.0
-warmup_steps: 10                  # Keep numeric; absolute steps per epoch shrink -> relative warmup % decreases; can raise to 30 if large batch.
-num_epochs: 3
-# ==== BATCHING (Aggressive) ====
-micro_batch_size: 4              # Change to 2 / 4 / 12 / 16 per scenario table
-gradient_accumulation_steps: 2    # Keep 1; raise only if chasing larger effective batch without OOM headroom.
-# ==== PRECISION / PERF ====
-bf16: true
-tf32: true
-flash_attention: true
-gradient_checkpointing: true
-gradient_checkpointing_kwargs:
-  use_reentrant: false
-# Optionally enable if micro_batch_size > 12:
-# activation_checkpointing: true   # (Axolotl flag if supported) or toggle in DS JSON.
-# ==== LOGGING ====
-wandb_project: updesh-ft
-logging_steps: 1
-evals_per_epoch: 2
-saves_per_epoch: 1
-save_first_step: true
-eval_max_new_tokens: 500
-# ==== DEEPSPEED ====
-deepspeed: deepspeed_configs/zero2_b200.json
-# ==== DISTRIBUTED CONTROL ====
-fsdp: []
-fsdp_config: {}
-# ==== QUANTIZATION (disabled) ====
-load_in_4bit: false
-load_in_8bit: true
-special_tokens:
-```
-</details><br>
 # Delphermes-0.6B-R1-LORA
-This model is a fine-tuned version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) on the open-r1/Mixture-of-Thoughts and the NousResearch/Hermes-3-Dataset datasets.
-It achieves the following results on the evaluation set:
-- Loss: 0.8526
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 0.0002
-- train_batch_size: 4
-- eval_batch_size: 4
-- seed: 42
-- distributed_type: multi-GPU
-- num_devices: 8
-- gradient_accumulation_steps: 2
-- total_train_batch_size: 64
-- total_eval_batch_size: 32
-- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_steps: 10
-- training_steps: 6411
-### Training results
-| Training Loss | Epoch  | Step | Validation Loss |
-|:-------------:|:------:|:----:|:---------------:|
-| No log        | 0      | 0    | 1.0617          |
-| 0.8758        | 0.5001 | 1069 | 0.8699          |
-| 0.8335        | 1.0    | 2138 | 0.8615          |
-| 0.8603        | 1.5001 | 3207 | 0.8571          |
-| 0.8178        | 2.0    | 4276 | 0.8541          |
-| 0.8527        | 2.5001 | 5345 | 0.8526          |
-### Framework versions
-- PEFT 0.15.2
-- Transformers 4.53.1
-- Pytorch 2.7.0+cu128
-- Datasets 3.6.0
-- Tokenizers 0.21.2

 ---
+language:
+- ml
+- en
 base_model: Qwen/Qwen3-0.6B
+library_name: transformers
+pipeline_tag: text-generation
 tags:
+- malayalam
+- text-generation
+- lora
+- merged
+license: apache-2.0
 ---
 # Delphermes-0.6B-R1-LORA
+This is a merged LoRA model based on Qwen/Qwen3-0.6B, fine-tuned for Malayalam language tasks.
+## Model Details
+- **Base Model**: Qwen/Qwen3-0.6B
+- **Language**: Malayalam (ml), English (en)
+- **Type**: Merged LoRA model
+- **Library**: transformers
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+model_name = "justinj92/Delphermes-0.6B-R1-LORA"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype=torch.float16,
+    device_map="auto"
+)
+# Example usage
+text = "നമസ്കാരം"
+inputs = tokenizer(text, return_tensors="pt")
+outputs = model.generate(**inputs, max_length=100)
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(response)
+```
+## Training Details
+This model was created by merging a LoRA adapter trained for Malayalam language understanding and generation.