Update README.md
Browse files
README.md
CHANGED
|
@@ -1,42 +1,4 @@
|
|
| 1 |
-
|
| 2 |
-
# Model Card: LoRA-Finetuned Qwen2.5-3B-Instruct
|
| 3 |
-
|
| 4 |
-
## Model Overview
|
| 5 |
-
This model is built on top of **[Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)** and finetuned using **LoRA** (Low-Rank Adaptation) and **RLHF**-style reward optimization, leveraging **vLLM** for fast inference. It is designed to respond with a specific structure (i.e., `<reasoning> ... </reasoning>` and `<final_argument> ... </final_argument>` sections) and maximize the number of well-formed argument-objection pairs.
|
| 6 |
-
|
| 7 |
-
## Key Features
|
| 8 |
-
- **Base Model**: Qwen/Qwen2.5-3B-Instruct
|
| 9 |
-
- **Quantization & Optimization**:
|
| 10 |
-
- 4-bit quantization (`load_in_4bit = True`) for reduced memory footprint.
|
| 11 |
-
- LoRA rank can be set (`lora_rank = 16` in the example) for efficient finetuning.
|
| 12 |
-
- Partial GPU memory utilization (`gpu_memory_utilization = 0.5`) can be adjusted.
|
| 13 |
-
- **LoRA Finetuning**:
|
| 14 |
-
- Applied to `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, and `down_proj`.
|
| 15 |
-
- LoRA alpha set to `lora_rank`.
|
| 16 |
-
- Uses gradient checkpointing (`use_gradient_checkpointing = "unsloth"`) to manage memory.
|
| 17 |
-
- **Reward Functions**:
|
| 18 |
-
1. **Easy Format Reward**: Checks for `<reasoning>` and `<final_argument>` tags in the generated content.
|
| 19 |
-
2. **Hard Format Reward**: Ensures correct alternation of `<argument>` and `<objection>` tags within `<reasoning>`.
|
| 20 |
-
3. **Number of Objections Reward**: Uses a logarithmic scale reward to encourage more argument-objection pairs.
|
| 21 |
-
- **Trainer Configuration (GRPO)**:
|
| 22 |
-
- **Learning Rate**: `5e-6`
|
| 23 |
-
- **Scheduler**: Cosine (`lr_scheduler_type = "cosine"`)
|
| 24 |
-
- **Batch Size & Accumulation**: `per_device_train_batch_size = 1` and `gradient_accumulation_steps = 1`
|
| 25 |
-
- **Precision**: Automatic selection of `bf16` if available, otherwise `fp16`
|
| 26 |
-
- **Train Steps**: `max_steps = 2500` (with a single epoch in the example)
|
| 27 |
-
- **Warmup Ratio**: `0.1`
|
| 28 |
-
- **Optimizer**: `adamw_8bit`
|
| 29 |
-
- **Maximum Generation Length**: Up to `2000` tokens in completion
|
| 30 |
-
- **vLLM** support via `use_vllm = True` for efficient inference
|
| 31 |
-
|
| 32 |
-
## Dataset
|
| 33 |
-
- The training script loads a **custom JSON file** (`questions.json`) with a simple question prompt structure.
|
| 34 |
-
- Each sample is mapped to a system prompt enforcing the `<reasoning>` / `<final_argument>` format, and a user prompt containing the question.
|
| 35 |
-
- The model is trained to provide structured arguments and objections based on these questions.
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
|
| 41 |
tags:
|
| 42 |
- text-generation-inference
|
|
|
|
| 1 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
|
| 3 |
tags:
|
| 4 |
- text-generation-inference
|