abcsk123
/

Code-Centric-Align

@@ -58,18 +58,18 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ## 🛠️ Core Workflow
-Data Engineering: Implemented streaming collection, three-layer quality filtering, and MinHashLSH-based fuzzy deduplication.
-Instruction Evolution: Utilized DeepSeek APIs for Evol-Instruct difficulty enhancement and diversity expansion.
-Supervised Fine-Tuning (SFT): Applied QLoRA with a custom Instruction Masking strategy (QwenDataCollator) to ensure the model only learns from assistant responses.
-Rejection Sampling (RFT): Developed a high-throughput engine using vLLM for 10-path sampling, verified through a multi-process safe execution sandbox.
-Preference Alignment (DPO): Investigated Direct Preference Optimization, identifying critical failure modes such as length bias and low-quality negative samples.
-Quantization & Deployment: Performed 4-bit activation-aware quantization (AutoAWQ) and deployed the model via a vLLM OpenAI-compatible API.
 ## 📈 Experimental Results (HumanEval Pass@1)
 The project tracked performance gains and losses across multiple iterations:
-Base Model: 0.628
-SFT v3 (released): 0.671 (+6.8%) — achieved through precise loss calculation and data cleaning.
-DPO Merged: 0.280 — highlighting the extreme sensitivity of code models to preference data quality.
 ## ⚠️ Status & Roadmap
 This project is actively under development. Currently, the DPO alignment exhibits performance regression (Pass@1 < 0.628) due to preference data sensitivity. We are investigating advanced filtering and reward modeling to resolve this. Optimized weights will be uploaded as soon as the alignment bottleneck is cleared.

 ## 🛠️ Core Workflow
+- Data Engineering: Implemented streaming collection, three-layer quality filtering, and MinHashLSH-based fuzzy deduplication.
+- Instruction Evolution: Utilized DeepSeek APIs for Evol-Instruct difficulty enhancement and diversity expansion.
+- Supervised Fine-Tuning (SFT): Applied QLoRA with a custom Instruction Masking strategy (QwenDataCollator) to ensure the model only learns from assistant responses.
+- Rejection Sampling (RFT): Developed a high-throughput engine using vLLM for 10-path sampling, verified through a multi-process safe execution sandbox.
+- Preference Alignment (DPO): Investigated Direct Preference Optimization, identifying critical failure modes such as length bias and low-quality negative samples.
+- Quantization & Deployment: Performed 4-bit activation-aware quantization (AutoAWQ) and deployed the model via a vLLM OpenAI-compatible API.
 ## 📈 Experimental Results (HumanEval Pass@1)
 The project tracked performance gains and losses across multiple iterations:
+- Base Model: 0.628
+- **SFT v3 (released): 0.671 (+6.8%)** — achieved through precise loss calculation and data cleaning.
+- DPO Merged: 0.280 — highlighting the extreme sensitivity of code models to preference data quality.
 ## ⚠️ Status & Roadmap
 This project is actively under development. Currently, the DPO alignment exhibits performance regression (Pass@1 < 0.628) due to preference data sensitivity. We are investigating advanced filtering and reward modeling to resolve this. Optimized weights will be uploaded as soon as the alignment bottleneck is cleared.