ergotts
/

r1-objection

text-generation-inference

Model card Files Files and versions

ergotts commited on Feb 13, 2025

Commit

169a776

·

verified ·

1 Parent(s): ea983e4

Update README.md

Files changed (1) hide show

README.md +39 -1

README.md CHANGED Viewed

@@ -1,4 +1,42 @@
----
 base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
 tags:
 - text-generation-inference

+# Model Card: LoRA-Finetuned Qwen2.5-3B-Instruct
+## Model Overview
+This model is built on top of **[Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)** and finetuned using **LoRA** (Low-Rank Adaptation) and **RLHF**-style reward optimization, leveraging **vLLM** for fast inference. It is designed to respond with a specific structure (i.e., `<reasoning> ... </reasoning>` and `<final_argument> ... </final_argument>` sections) and maximize the number of well-formed argument-objection pairs.
+## Key Features
+- **Base Model**: Qwen/Qwen2.5-3B-Instruct
+- **Quantization & Optimization**:
+  - 4-bit quantization (`load_in_4bit = True`) for reduced memory footprint.
+  - LoRA rank can be set (`lora_rank = 16` in the example) for efficient finetuning.
+  - Partial GPU memory utilization (`gpu_memory_utilization = 0.5`) can be adjusted.
+- **LoRA Finetuning**:
+  - Applied to `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, and `down_proj`.
+  - LoRA alpha set to `lora_rank`.
+  - Uses gradient checkpointing (`use_gradient_checkpointing = "unsloth"`) to manage memory.
+- **Reward Functions**:
+  1. **Easy Format Reward**: Checks for `<reasoning>` and `<final_argument>` tags in the generated content.
+  2. **Hard Format Reward**: Ensures correct alternation of `<argument>` and `<objection>` tags within `<reasoning>`.
+  3. **Number of Objections Reward**: Uses a logarithmic scale reward to encourage more argument-objection pairs.
+- **Trainer Configuration (GRPO)**:
+  - **Learning Rate**: `5e-6`
+  - **Scheduler**: Cosine (`lr_scheduler_type = "cosine"`)
+  - **Batch Size & Accumulation**: `per_device_train_batch_size = 1` and `gradient_accumulation_steps = 1`
+  - **Precision**: Automatic selection of `bf16` if available, otherwise `fp16`
+  - **Train Steps**: `max_steps = 2500` (with a single epoch in the example)
+  - **Warmup Ratio**: `0.1`
+  - **Optimizer**: `adamw_8bit`
+  - **Maximum Generation Length**: Up to `2000` tokens in completion
+  - **vLLM** support via `use_vllm = True` for efficient inference
+## Dataset
+- The training script loads a **custom JSON file** (`questions.json`) with a simple question prompt structure.
+- Each sample is mapped to a system prompt enforcing the `<reasoning>` / `<final_argument>` format, and a user prompt containing the question.
+- The model is trained to provide structured arguments and objections based on these questions.
 base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
 tags:
 - text-generation-inference