| | --- |
| | base_model: Qwen/Qwen3-4B-Instruct-2507 |
| | language: |
| | - en |
| | library_name: transformers |
| | license: apache-2.0 |
| | model_name: qwen-json |
| | pipeline_tag: text-generation |
| | tags: |
| | - unsloth |
| | - trl |
| | - grpo |
| | - reinforcement-learning |
| | - json |
| | - recipe |
| | --- |
| | |
| | # RL-Struct: Bridging the Structure Gap |
| |
|
| | [δΈζηζ¬](./README_CN.md) | [π Paper](https://huggingface.co/papers/2512.00319) |
| |
|
| | We introduce **RL-Struct**, a lightweight Reinforcement Learning framework designed to solve the "Structure Gap"βthe tension between probabilistic token generation and deterministic structured formats (e.g., JSON). By leveraging **GRPO (Gradient Regularized Policy Optimization)** and a **Multi-dimensional Reward Function**, our model achieves superior structural reliability without the high inference latency of constrained decoding. |
| |
|
| | ## π Key Features |
| |
|
| | - **Multi-dimensional Reward Function**: Decomposes the objective into Structure, Format, Validity, Correctness, and Length. |
| | - **Efficient Training**: Uses GRPO to eliminate the critic network, reducing VRAM usage by ~40% compared to PPO. |
| | - **Emergent Curriculum**: The model spontaneously learns syntax (how to speak) before semantics (what to say). |
| | - **High Performance**: Achieves **89.7% Structural Accuracy** and **92.1% JSON Validity** on complex recipe generation, outperforming LLaMA-3-8B and GPT-3.5. |
| |
|
| | ## π Model Details |
| |
|
| | - **Base Model:** [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) |
| | - **Training Method:** GRPO (Reinforcement Learning) + LoRA |
| | - **Task:** Structured Output Generation (JSON Recipes, GSM8K-JSON, ToolUse) |
| | - **License:** Apache-2.0 |
| |
|
| | ## π οΈ Usage |
| |
|
| | The following is the system prompt: |
| |
|
| | ```text |
| | You are a precise recipe assistant. Always respond in the following JSON format: |
| | { |
| | "reasoning": "Your step-by-step reasoning here...", |
| | "answer": "{\"name\": \"Recipe Name\", \"nutrition\": \"Calories: ..., Protein: ..., Fat: ...\"}" |
| | } |
| | Do not include any other text, explanations, or markdown. Only output valid JSON. |
| | ``` |
| |
|
| | ## π Performance |
| |
|
| | | Method | Structural Acc. | JSON Validity | Content Acc. | |
| | | :--- | :---: | :---: | :---: | |
| | | GPT-3.5 (Zero-shot) | 45.5% | 82.1% | 88.0% | |
| | | LLaMA-3-8B (SFT) | 78.2% | 85.4% | 86.0% | |
| | | **RL-Struct (Ours)** | **89.7%** | **92.1%** | **84.5%** | |