Freakz3z commited on
Commit
826fc97
·
verified ·
1 Parent(s): 75a7e15

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -5
README.md CHANGED
@@ -1,6 +1,56 @@
1
  ---
2
- license: mit
3
- base_model:
4
- - Qwen/Qwen3-4B-Instruct-2507
5
- pipeline_tag: translation
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model: Qwen/Qwen3-4B-Instruct-2507
3
+ library_name: transformers
4
+ model_name: qwen-json
5
+ tags:
6
+ - unsloth
7
+ - trl
8
+ - grpo
9
+ - reinforcement-learning
10
+ - json
11
+ - recipe
12
+ license: apache-2.0
13
+ language:
14
+ - en
15
+ ---
16
+
17
+ # RL-Struct: Bridging the Structure Gap
18
+
19
+ [中文版本](./README_CN.md)
20
+
21
+ We introduce **RL-Struct**, a lightweight Reinforcement Learning framework designed to solve the "Structure Gap"—the tension between probabilistic token generation and deterministic structured formats (e.g., JSON). By leveraging **GRPO (Gradient Regularized Policy Optimization)** and a **Multi-dimensional Reward Function**, our model achieves superior structural reliability without the high inference latency of constrained decoding.
22
+
23
+ ## 🚀 Key Features
24
+
25
+ - **Multi-dimensional Reward Function**: Decomposes the objective into Structure, Format, Validity, Correctness, and Length.
26
+ - **Efficient Training**: Uses GRPO to eliminate the critic network, reducing VRAM usage by ~40% compared to PPO.
27
+ - **Emergent Curriculum**: The model spontaneously learns syntax (how to speak) before semantics (what to say).
28
+ - **High Performance**: Achieves **89.7% Structural Accuracy** and **92.1% JSON Validity** on complex recipe generation, outperforming LLaMA-3-8B and GPT-3.5.
29
+
30
+ ## 📊 Model Details
31
+
32
+ - **Base Model:** [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
33
+ - **Training Method:** GRPO (Reinforcement Learning) + LoRA
34
+ - **Task:** Structured Output Generation (JSON Recipes, GSM8K-JSON, ToolUse)
35
+ - **License:** Apache-2.0
36
+
37
+ ## 🛠️ Usage
38
+
39
+ The following is the system prompt:
40
+
41
+ ```text
42
+ You are a precise recipe assistant. Always respond in the following JSON format:
43
+ {
44
+ "reasoning": "Your step-by-step reasoning here...",
45
+ "answer": "{\"name\": \"Recipe Name\", \"nutrition\": \"Calories: ..., Protein: ..., Fat: ...\"}"
46
+ }
47
+ Do not include any other text, explanations, or markdown. Only output valid JSON.
48
+ ```
49
+
50
+ ## 📈 Performance
51
+
52
+ | Method | Structural Acc. | JSON Validity | Content Acc. |
53
+ | :--- | :---: | :---: | :---: |
54
+ | GPT-3.5 (Zero-shot) | 45.5% | 82.1% | 88.0% |
55
+ | LLaMA-3-8B (SFT) | 78.2% | 85.4% | 86.0% |
56
+ | **RL-Struct (Ours)** | **89.7%** | **92.1%** | **84.5%** |