Shion1124 commited on
Commit
84b6c9a
·
verified ·
1 Parent(s): a0bbc02

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +127 -13
README.md CHANGED
@@ -1,21 +1,135 @@
1
  ---
2
- base_model: unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit
3
- tags:
4
- - text-generation-inference
5
- - transformers
6
- - unsloth
7
- - qwen3
8
- license: apache-2.0
9
  language:
10
  - en
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- # Uploaded finetuned model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- - **Developed by:** Shion1124
16
- - **License:** apache-2.0
17
- - **Finetuned from model :** unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit
 
 
18
 
19
- This qwen3 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
1
  ---
2
+ base_model: unsloth/Qwen3-4B-Instruct-2507
3
+ datasets:
4
+ - u-10bei/dpo-dataset-qwen-cot
5
+ - u-10bei/structured_data_with_cot_dataset_512_v5
6
+ - daichira/structured-hard-sft-4k
 
 
7
  language:
8
  - en
9
+ license: apache-2.0
10
+ library_name: transformers
11
+ pipeline_tag: text-generation
12
+ tags:
13
+ - dapo
14
+ - rlvr
15
+ - dora
16
+ - unsloth
17
+ - qwen
18
+ - structured-output
19
+ - verifiable-rewards
20
  ---
21
 
22
+ # Qwen3-4B-DAPO-DoRA-StructEval-v1
23
+
24
+ This model implements **DAPO (Direct Alignment from Preference Optimization)**, an RLVR (Reinforcement Learning from Verifiable Rewards) approach, combined with **DoRA (Weight-Decomposed Low-Rank Adaptation)**.
25
+
26
+ ## 🎯 Key Innovation: DAPO + DoRA
27
+
28
+ ### What is DAPO?
29
+ DAPO extends DPO by incorporating **verifiable rewards** during training:
30
+ - Traditional DPO: Learns from preference pairs (chosen vs. rejected)
31
+ - **DAPO**: Adds automated verification of structured outputs (JSON/XML/YAML validity)
32
+ - **Verification Weight**: 30% of the loss signal comes from format validation
33
+
34
+ ### Why DoRA for DAPO?
35
+ DoRA's weight decomposition (magnitude + direction) is ideal for DAPO because:
36
+ - Stable learning with stronger reward signals
37
+ - Better convergence with verification-augmented loss
38
+ - Lower rank (r=32) achieves higher quality than standard LoRA
39
+
40
+ ## 📊 Training Pipeline
41
+
42
+ ### Stage 1: SFT + DoRA
43
+ - **Data**: 70% v5 (high-quality) + 30% Hard-Mix (complex reasoning)
44
+ - **Method**: DoRA (r=32, alpha=64)
45
+ - **Focus**: Learn structured output generation with CoT masking
46
+
47
+ ### Stage 2: DAPO + DoRA (This Model)
48
+ - **Data**: DPO preference dataset with CoT reasoning
49
+ - **Method**: DAPO with 30% verification reward
50
+ - **Focus**: Align outputs to preferred structures + validate syntax
51
+
52
+ ## 🔧 Training Configuration
53
+
54
+ **DAPO Settings:**
55
+ - Learning rate: 2e-05 (optimized for DoRA stability)
56
+ - Beta: 0.15 (preference strength)
57
+ - Verification weight: 0.3 (30% validation reward)
58
+ - Max sequence length: 1536
59
+
60
+ **DoRA Settings:**
61
+ - Rank: 32 (optimal for DoRA)
62
+ - Alpha: 64 (r * 2 ratio)
63
+ - Dropout: 0 (DoRA recommendation)
64
+ - Target modules: All attention + MLP layers
65
 
66
+ **Optimization:**
67
+ - Epochs: 1
68
+ - Batch size: 2 × 4 accumulation = 8 effective
69
+ - Weight decay: 0.005 (light for DoRA)
70
+ - Warmup ratio: 0.15 (DoRA stability)
71
 
72
+ ## 🚀 Usage
73
+ ```python
74
+ from transformers import AutoModelForCausalLM, AutoTokenizer
75
+ import torch
76
+
77
+ model_id = "Shion1124/dapo-dora-qwen-struct"
78
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
79
+ model = AutoModelForCausalLM.from_pretrained(
80
+ model_id,
81
+ torch_dtype=torch.float16,
82
+ device_map="auto"
83
+ )
84
+
85
+ prompt = "Convert this to JSON: Name: Alice, Age: 30, City: Tokyo"
86
+ inputs = tokenizer.apply_chat_template(
87
+ [{"role": "user", "content": prompt}],
88
+ tokenize=True,
89
+ add_generation_prompt=True,
90
+ return_tensors="pt"
91
+ ).to("cuda")
92
+
93
+ outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
94
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
95
+ ```
96
+
97
+ ## 📈 Expected Performance
98
+
99
+ Compared to base DPO:
100
+ - **Format Accuracy**: +5-10% (from verification rewards)
101
+ - **Reasoning Quality**: +3-7% (from DoRA stability)
102
+ - **Overall Score**: 0.85-0.92 on StructEval-T
103
+
104
+ ## 📚 Training Data
105
+
106
+ 1. **SFT Stage**:
107
+ - u-10bei/structured_data_with_cot_dataset_512_v5
108
+ - daichira/structured-hard-sft-4k
109
+
110
+ 2. **DAPO Stage**:
111
+ - u-10bei/dpo-dataset-qwen-cot (preference pairs)
112
+
113
+ ## ⚖️ License
114
+
115
+ - **Model**: Apache 2.0
116
+ - **Dataset**: MIT License (see original datasets)
117
+ - Users must comply with base model and dataset terms
118
+
119
+ ## 🔬 Technical Details
120
+
121
+ **Verifiable Rewards**:
122
+ - JSON validation: `json.loads()` success = 1.0 reward
123
+ - XML validation: `ElementTree.fromstring()` success = 1.0 reward
124
+ - YAML validation: `yaml.safe_load()` success = 1.0 reward
125
+ - Partial credit: 0.3 for attempted format with errors
126
+
127
+ **Loss Function**:
128
+ ```
129
+ DAPO_loss = (1 - α) × DPO_loss + α × Verification_penalty
130
+ where α = 0.3 (verification weight)
131
+ ```
132
+
133
+ ---
134
 
135
+ Built with ❤️ using Unsloth + DAPO + DoRA