Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +127 -13

README.md CHANGED Viewed

@@ -1,21 +1,135 @@
 ---
-base_model: unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit
-tags:
-- text-generation-inference
-- transformers
-- unsloth
-- qwen3
-license: apache-2.0
 language:
 - en
 ---
-# Uploaded finetuned  model
-- **Developed by:** Shion1124
-- **License:** apache-2.0
-- **Finetuned from model :** unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit
-This qwen3 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
-[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

 ---
+base_model: unsloth/Qwen3-4B-Instruct-2507
+datasets:
+- u-10bei/dpo-dataset-qwen-cot
+- u-10bei/structured_data_with_cot_dataset_512_v5
+- daichira/structured-hard-sft-4k
 language:
 - en
+license: apache-2.0
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+- dapo
+- rlvr
+- dora
+- unsloth
+- qwen
+- structured-output
+- verifiable-rewards
 ---
+# Qwen3-4B-DAPO-DoRA-StructEval-v1
+This model implements **DAPO (Direct Alignment from Preference Optimization)**, an RLVR (Reinforcement Learning from Verifiable Rewards) approach, combined with **DoRA (Weight-Decomposed Low-Rank Adaptation)**.
+## 🎯 Key Innovation: DAPO + DoRA
+### What is DAPO?
+DAPO extends DPO by incorporating **verifiable rewards** during training:
+- Traditional DPO: Learns from preference pairs (chosen vs. rejected)
+- **DAPO**: Adds automated verification of structured outputs (JSON/XML/YAML validity)
+- **Verification Weight**: 30% of the loss signal comes from format validation
+### Why DoRA for DAPO?
+DoRA's weight decomposition (magnitude + direction) is ideal for DAPO because:
+- Stable learning with stronger reward signals
+- Better convergence with verification-augmented loss
+- Lower rank (r=32) achieves higher quality than standard LoRA
+## 📊 Training Pipeline
+### Stage 1: SFT + DoRA
+- **Data**: 70% v5 (high-quality) + 30% Hard-Mix (complex reasoning)
+- **Method**: DoRA (r=32, alpha=64)
+- **Focus**: Learn structured output generation with CoT masking
+### Stage 2: DAPO + DoRA (This Model)
+- **Data**: DPO preference dataset with CoT reasoning
+- **Method**: DAPO with 30% verification reward
+- **Focus**: Align outputs to preferred structures + validate syntax
+## 🔧 Training Configuration
+**DAPO Settings:**
+- Learning rate: 2e-05 (optimized for DoRA stability)
+- Beta: 0.15 (preference strength)
+- Verification weight: 0.3 (30% validation reward)
+- Max sequence length: 1536
+**DoRA Settings:**
+- Rank: 32 (optimal for DoRA)
+- Alpha: 64 (r * 2 ratio)
+- Dropout: 0 (DoRA recommendation)
+- Target modules: All attention + MLP layers
+**Optimization:**
+- Epochs: 1
+- Batch size: 2 × 4 accumulation = 8 effective
+- Weight decay: 0.005 (light for DoRA)
+- Warmup ratio: 0.15 (DoRA stability)
+## 🚀 Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model_id = "Shion1124/dapo-dora-qwen-struct"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.float16,
+    device_map="auto"
+)
+prompt = "Convert this to JSON: Name: Alice, Age: 30, City: Tokyo"
+inputs = tokenizer.apply_chat_template(
+    [{"role": "user", "content": prompt}],
+    tokenize=True,
+    add_generation_prompt=True,
+    return_tensors="pt"
+).to("cuda")
+outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## 📈 Expected Performance
+Compared to base DPO:
+- **Format Accuracy**: +5-10% (from verification rewards)
+- **Reasoning Quality**: +3-7% (from DoRA stability)
+- **Overall Score**: 0.85-0.92 on StructEval-T
+## 📚 Training Data
+1. **SFT Stage**:
+   - u-10bei/structured_data_with_cot_dataset_512_v5
+   - daichira/structured-hard-sft-4k
+2. **DAPO Stage**:
+   - u-10bei/dpo-dataset-qwen-cot (preference pairs)
+## ⚖️ License
+- **Model**: Apache 2.0
+- **Dataset**: MIT License (see original datasets)
+- Users must comply with base model and dataset terms
+## 🔬 Technical Details
+**Verifiable Rewards**:
+- JSON validation: `json.loads()` success = 1.0 reward
+- XML validation: `ElementTree.fromstring()` success = 1.0 reward
+- YAML validation: `yaml.safe_load()` success = 1.0 reward
+- Partial credit: 0.3 for attempted format with errors
+**Loss Function**:
+```
+DAPO_loss = (1 - α) × DPO_loss + α × Verification_penalty
+where α = 0.3 (verification weight)
+```
+---
+Built with ❤️ using Unsloth + DAPO + DoRA