ergotts commited on
Commit
54a6861
·
verified ·
1 Parent(s): 169a776

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -39
README.md CHANGED
@@ -1,42 +1,4 @@
1
-
2
- # Model Card: LoRA-Finetuned Qwen2.5-3B-Instruct
3
-
4
- ## Model Overview
5
- This model is built on top of **[Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)** and finetuned using **LoRA** (Low-Rank Adaptation) and **RLHF**-style reward optimization, leveraging **vLLM** for fast inference. It is designed to respond with a specific structure (i.e., `<reasoning> ... </reasoning>` and `<final_argument> ... </final_argument>` sections) and maximize the number of well-formed argument-objection pairs.
6
-
7
- ## Key Features
8
- - **Base Model**: Qwen/Qwen2.5-3B-Instruct
9
- - **Quantization & Optimization**:
10
- - 4-bit quantization (`load_in_4bit = True`) for reduced memory footprint.
11
- - LoRA rank can be set (`lora_rank = 16` in the example) for efficient finetuning.
12
- - Partial GPU memory utilization (`gpu_memory_utilization = 0.5`) can be adjusted.
13
- - **LoRA Finetuning**:
14
- - Applied to `q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, and `down_proj`.
15
- - LoRA alpha set to `lora_rank`.
16
- - Uses gradient checkpointing (`use_gradient_checkpointing = "unsloth"`) to manage memory.
17
- - **Reward Functions**:
18
- 1. **Easy Format Reward**: Checks for `<reasoning>` and `<final_argument>` tags in the generated content.
19
- 2. **Hard Format Reward**: Ensures correct alternation of `<argument>` and `<objection>` tags within `<reasoning>`.
20
- 3. **Number of Objections Reward**: Uses a logarithmic scale reward to encourage more argument-objection pairs.
21
- - **Trainer Configuration (GRPO)**:
22
- - **Learning Rate**: `5e-6`
23
- - **Scheduler**: Cosine (`lr_scheduler_type = "cosine"`)
24
- - **Batch Size & Accumulation**: `per_device_train_batch_size = 1` and `gradient_accumulation_steps = 1`
25
- - **Precision**: Automatic selection of `bf16` if available, otherwise `fp16`
26
- - **Train Steps**: `max_steps = 2500` (with a single epoch in the example)
27
- - **Warmup Ratio**: `0.1`
28
- - **Optimizer**: `adamw_8bit`
29
- - **Maximum Generation Length**: Up to `2000` tokens in completion
30
- - **vLLM** support via `use_vllm = True` for efficient inference
31
-
32
- ## Dataset
33
- - The training script loads a **custom JSON file** (`questions.json`) with a simple question prompt structure.
34
- - Each sample is mapped to a system prompt enforcing the `<reasoning>` / `<final_argument>` format, and a user prompt containing the question.
35
- - The model is trained to provide structured arguments and objections based on these questions.
36
-
37
-
38
-
39
-
40
  base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
41
  tags:
42
  - text-generation-inference
 
1
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  base_model: unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit
3
  tags:
4
  - text-generation-inference