ergotts commited on
Commit
169a776
·
verified ·
1 Parent(s): ea983e4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -1
README.md CHANGED
@@ -1,4 +1,42 @@
1
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
3
  tags:
4
  - text-generation-inference
 
1
+
2
+ # Model Card: LoRA-Finetuned Qwen2.5-3B-Instruct
3
+
4
+ ## Model Overview
5
+ This model is built on top of **[Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)** and finetuned using **LoRA** (Low-Rank Adaptation) and **RLHF**-style reward optimization, leveraging **vLLM** for fast inference. It is designed to respond with a specific structure (i.e., `<reasoning> ... </reasoning>` and `<final_argument> ... </final_argument>` sections) and maximize the number of well-formed argument-objection pairs.
6
+
7
+ ## Key Features
8
+ - **Base Model**: Qwen/Qwen2.5-3B-Instruct
9
+ - **Quantization & Optimization**:
10
+ - 4-bit quantization (`load_in_4bit = True`) for reduced memory footprint.
11
+ - LoRA rank can be set (`lora_rank = 16` in the example) for efficient finetuning.
12
+ - Partial GPU memory utilization (`gpu_memory_utilization = 0.5`) can be adjusted.
13
+ - **LoRA Finetuning**:
14
+ - Applied to `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, and `down_proj`.
15
+ - LoRA alpha set to `lora_rank`.
16
+ - Uses gradient checkpointing (`use_gradient_checkpointing = "unsloth"`) to manage memory.
17
+ - **Reward Functions**:
18
+ 1. **Easy Format Reward**: Checks for `<reasoning>` and `<final_argument>` tags in the generated content.
19
+ 2. **Hard Format Reward**: Ensures correct alternation of `<argument>` and `<objection>` tags within `<reasoning>`.
20
+ 3. **Number of Objections Reward**: Uses a logarithmic scale reward to encourage more argument-objection pairs.
21
+ - **Trainer Configuration (GRPO)**:
22
+ - **Learning Rate**: `5e-6`
23
+ - **Scheduler**: Cosine (`lr_scheduler_type = "cosine"`)
24
+ - **Batch Size & Accumulation**: `per_device_train_batch_size = 1` and `gradient_accumulation_steps = 1`
25
+ - **Precision**: Automatic selection of `bf16` if available, otherwise `fp16`
26
+ - **Train Steps**: `max_steps = 2500` (with a single epoch in the example)
27
+ - **Warmup Ratio**: `0.1`
28
+ - **Optimizer**: `adamw_8bit`
29
+ - **Maximum Generation Length**: Up to `2000` tokens in completion
30
+ - **vLLM** support via `use_vllm = True` for efficient inference
31
+
32
+ ## Dataset
33
+ - The training script loads a **custom JSON file** (`questions.json`) with a simple question prompt structure.
34
+ - Each sample is mapped to a system prompt enforcing the `<reasoning>` / `<final_argument>` format, and a user prompt containing the question.
35
+ - The model is trained to provide structured arguments and objections based on these questions.
36
+
37
+
38
+
39
+
40
  base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
41
  tags:
42
  - text-generation-inference