contextboxai
/

Qwen3-1.7B-FC

Text Generation

function-calling

text-generation-inference

Model card Files Files and versions

nguyenhuy commited on 22 days ago

Commit

11a42f5

·

verified ·

1 Parent(s): 73c5adc

Update model card

Files changed (1) hide show

README.md +10 -7

README.md CHANGED Viewed

@@ -109,16 +109,19 @@ The model is trained to **refuse appropriately** using diverse negative samples:
 ### Two-Stage RLVR Fine-tuning
-1. **Stage 1**: Accuracy-focused training
    - Trained from Qwen3-1.7B base
-   - Dataset: 117K samples (positive + negative)
-   - Reward: Correctness + Format + **Strong Refusal Penalty (-1.0 for hallucination)**
-   - 100 steps, LR: 1e-6
-2. **Stage 2**: Efficiency optimization
-   - Loaded from V3 checkpoint-100
    - Focus: Reduce verbosity, discourage `<think>` tags
-   - 3000 steps, LR: 2e-7
    - **Result**: 36% reduction in response tokens
 ### Reward Function Design

 ### Two-Stage RLVR Fine-tuning
+1. **Stage 1**: Accuracy-focused training (V3)
    - Trained from Qwen3-1.7B base
+   - Dataset: ~40K samples (stage2.parquet)
+   - Reward: Correctness (1.0) + Format (0.1) + Efficiency (0.3) + Refusal (0.3)
+   - Config: max_steps=5000, LR=5e-7, temp=1.2
+   - **Best checkpoint: step 100** (early stopping, highest accuracy)
+2. **Stage 2**: Efficiency optimization (V4)
+   - Loaded from Stage 1 checkpoint-100
    - Focus: Reduce verbosity, discourage `<think>` tags
+   - Reward weights: Efficiency=1.0, Correctness=0.5, Format=0.1, Refusal=0.3
+   - Config: max_steps=3000, LR=2e-7
+   - **Selected checkpoint: step 1100**
    - **Result**: 36% reduction in response tokens
 ### Reward Function Design