Update README.md
Browse files
README.md
CHANGED
|
@@ -1,4 +1,42 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
|
| 3 |
tags:
|
| 4 |
- text-generation-inference
|
|
|
|
| 1 |
+
|
| 2 |
+
# Model Card: LoRA-Finetuned Qwen2.5-3B-Instruct
|
| 3 |
+
|
| 4 |
+
## Model Overview
|
| 5 |
+
This model is built on top of **[Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)** and finetuned using **LoRA** (Low-Rank Adaptation) and **RLHF**-style reward optimization, leveraging **vLLM** for fast inference. It is designed to respond with a specific structure (i.e., `<reasoning> ... </reasoning>` and `<final_argument> ... </final_argument>` sections) and maximize the number of well-formed argument-objection pairs.
|
| 6 |
+
|
| 7 |
+
## Key Features
|
| 8 |
+
- **Base Model**: Qwen/Qwen2.5-3B-Instruct
|
| 9 |
+
- **Quantization & Optimization**:
|
| 10 |
+
- 4-bit quantization (`load_in_4bit = True`) for reduced memory footprint.
|
| 11 |
+
- LoRA rank can be set (`lora_rank = 16` in the example) for efficient finetuning.
|
| 12 |
+
- Partial GPU memory utilization (`gpu_memory_utilization = 0.5`) can be adjusted.
|
| 13 |
+
- **LoRA Finetuning**:
|
| 14 |
+
- Applied to `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, and `down_proj`.
|
| 15 |
+
- LoRA alpha set to `lora_rank`.
|
| 16 |
+
- Uses gradient checkpointing (`use_gradient_checkpointing = "unsloth"`) to manage memory.
|
| 17 |
+
- **Reward Functions**:
|
| 18 |
+
1. **Easy Format Reward**: Checks for `<reasoning>` and `<final_argument>` tags in the generated content.
|
| 19 |
+
2. **Hard Format Reward**: Ensures correct alternation of `<argument>` and `<objection>` tags within `<reasoning>`.
|
| 20 |
+
3. **Number of Objections Reward**: Uses a logarithmic scale reward to encourage more argument-objection pairs.
|
| 21 |
+
- **Trainer Configuration (GRPO)**:
|
| 22 |
+
- **Learning Rate**: `5e-6`
|
| 23 |
+
- **Scheduler**: Cosine (`lr_scheduler_type = "cosine"`)
|
| 24 |
+
- **Batch Size & Accumulation**: `per_device_train_batch_size = 1` and `gradient_accumulation_steps = 1`
|
| 25 |
+
- **Precision**: Automatic selection of `bf16` if available, otherwise `fp16`
|
| 26 |
+
- **Train Steps**: `max_steps = 2500` (with a single epoch in the example)
|
| 27 |
+
- **Warmup Ratio**: `0.1`
|
| 28 |
+
- **Optimizer**: `adamw_8bit`
|
| 29 |
+
- **Maximum Generation Length**: Up to `2000` tokens in completion
|
| 30 |
+
- **vLLM** support via `use_vllm = True` for efficient inference
|
| 31 |
+
|
| 32 |
+
## Dataset
|
| 33 |
+
- The training script loads a **custom JSON file** (`questions.json`) with a simple question prompt structure.
|
| 34 |
+
- Each sample is mapped to a system prompt enforcing the `<reasoning>` / `<final_argument>` format, and a user prompt containing the question.
|
| 35 |
+
- The model is trained to provide structured arguments and objections based on these questions.
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
|
| 40 |
base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
|
| 41 |
tags:
|
| 42 |
- text-generation-inference
|