abcsk123 commited on
Commit
f6b76b2
·
verified ·
1 Parent(s): 385acc6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -58,18 +58,18 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
58
 
59
 
60
  ## 🛠️ Core Workflow
61
- Data Engineering: Implemented streaming collection, three-layer quality filtering, and MinHashLSH-based fuzzy deduplication.
62
- Instruction Evolution: Utilized DeepSeek APIs for Evol-Instruct difficulty enhancement and diversity expansion.
63
- Supervised Fine-Tuning (SFT): Applied QLoRA with a custom Instruction Masking strategy (QwenDataCollator) to ensure the model only learns from assistant responses.
64
- Rejection Sampling (RFT): Developed a high-throughput engine using vLLM for 10-path sampling, verified through a multi-process safe execution sandbox.
65
- Preference Alignment (DPO): Investigated Direct Preference Optimization, identifying critical failure modes such as length bias and low-quality negative samples.
66
- Quantization & Deployment: Performed 4-bit activation-aware quantization (AutoAWQ) and deployed the model via a vLLM OpenAI-compatible API.
67
 
68
  ## 📈 Experimental Results (HumanEval Pass@1)
69
  The project tracked performance gains and losses across multiple iterations:
70
- Base Model: 0.628
71
- SFT v3 (released): 0.671 (+6.8%) — achieved through precise loss calculation and data cleaning.
72
- DPO Merged: 0.280 — highlighting the extreme sensitivity of code models to preference data quality.
73
 
74
  ## ⚠️ Status & Roadmap
75
  This project is actively under development. Currently, the DPO alignment exhibits performance regression (Pass@1 < 0.628) due to preference data sensitivity. We are investigating advanced filtering and reward modeling to resolve this. Optimized weights will be uploaded as soon as the alignment bottleneck is cleared.
 
58
 
59
 
60
  ## 🛠️ Core Workflow
61
+ - Data Engineering: Implemented streaming collection, three-layer quality filtering, and MinHashLSH-based fuzzy deduplication.
62
+ - Instruction Evolution: Utilized DeepSeek APIs for Evol-Instruct difficulty enhancement and diversity expansion.
63
+ - Supervised Fine-Tuning (SFT): Applied QLoRA with a custom Instruction Masking strategy (QwenDataCollator) to ensure the model only learns from assistant responses.
64
+ - Rejection Sampling (RFT): Developed a high-throughput engine using vLLM for 10-path sampling, verified through a multi-process safe execution sandbox.
65
+ - Preference Alignment (DPO): Investigated Direct Preference Optimization, identifying critical failure modes such as length bias and low-quality negative samples.
66
+ - Quantization & Deployment: Performed 4-bit activation-aware quantization (AutoAWQ) and deployed the model via a vLLM OpenAI-compatible API.
67
 
68
  ## 📈 Experimental Results (HumanEval Pass@1)
69
  The project tracked performance gains and losses across multiple iterations:
70
+ - Base Model: 0.628
71
+ - **SFT v3 (released): 0.671 (+6.8%)** — achieved through precise loss calculation and data cleaning.
72
+ - DPO Merged: 0.280 — highlighting the extreme sensitivity of code models to preference data quality.
73
 
74
  ## ⚠️ Status & Roadmap
75
  This project is actively under development. Currently, the DPO alignment exhibits performance regression (Pass@1 < 0.628) due to preference data sensitivity. We are investigating advanced filtering and reward modeling to resolve this. Optimized weights will be uploaded as soon as the alignment bottleneck is cleared.