ergotts
/

r1-objection

text-generation-inference

Model card Files Files and versions

ergotts commited on Feb 13, 2025

Commit

54a6861

·

verified ·

1 Parent(s): 169a776

Update README.md

Files changed (1) hide show

README.md +1 -39

README.md CHANGED Viewed

@@ -1,42 +1,4 @@
-# Model Card: LoRA-Finetuned Qwen2.5-3B-Instruct
-## Model Overview
-This model is built on top of **[Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)** and finetuned using **LoRA** (Low-Rank Adaptation) and **RLHF**-style reward optimization, leveraging **vLLM** for fast inference. It is designed to respond with a specific structure (i.e., `<reasoning> ... </reasoning>` and `<final_argument> ... </final_argument>` sections) and maximize the number of well-formed argument-objection pairs.
-## Key Features
-- **Base Model**: Qwen/Qwen2.5-3B-Instruct
-- **Quantization & Optimization**:
-  - 4-bit quantization (`load_in_4bit = True`) for reduced memory footprint.
-  - LoRA rank can be set (`lora_rank = 16` in the example) for efficient finetuning.
-  - Partial GPU memory utilization (`gpu_memory_utilization = 0.5`) can be adjusted.
-- **LoRA Finetuning**:
-  - Applied to `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, and `down_proj`.
-  - LoRA alpha set to `lora_rank`.
-  - Uses gradient checkpointing (`use_gradient_checkpointing = "unsloth"`) to manage memory.
-- **Reward Functions**:
-  1. **Easy Format Reward**: Checks for `<reasoning>` and `<final_argument>` tags in the generated content.
-  2. **Hard Format Reward**: Ensures correct alternation of `<argument>` and `<objection>` tags within `<reasoning>`.
-  3. **Number of Objections Reward**: Uses a logarithmic scale reward to encourage more argument-objection pairs.
-- **Trainer Configuration (GRPO)**:
-  - **Learning Rate**: `5e-6`
-  - **Scheduler**: Cosine (`lr_scheduler_type = "cosine"`)
-  - **Batch Size & Accumulation**: `per_device_train_batch_size = 1` and `gradient_accumulation_steps = 1`
-  - **Precision**: Automatic selection of `bf16` if available, otherwise `fp16`
-  - **Train Steps**: `max_steps = 2500` (with a single epoch in the example)
-  - **Warmup Ratio**: `0.1`
-  - **Optimizer**: `adamw_8bit`
-  - **Maximum Generation Length**: Up to `2000` tokens in completion
-  - **vLLM** support via `use_vllm = True` for efficient inference
-## Dataset
-- The training script loads a **custom JSON file** (`questions.json`) with a simple question prompt structure.
-- Each sample is mapped to a system prompt enforcing the `<reasoning>` / `<final_argument>` format, and a user prompt containing the question.
-- The model is trained to provide structured arguments and objections based on these questions.
 base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
 tags:
 - text-generation-inference

+---
 base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
 tags:
 - text-generation-inference