abcsk123 commited on
Commit
6fd2ebb
·
verified ·
1 Parent(s): ed29bdb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -20
README.md CHANGED
@@ -2,31 +2,30 @@
2
  license: mit
3
  language:
4
  - en
 
 
 
 
 
5
  ---
 
6
  # Code-Centric-Align: A Post-Training Pipeline for Code LLMs
7
 
8
- [cite_start]This project presents a systematic study of the post-training engineering pipeline for code-specific large language models, using **Qwen2.5-Coder-7B** as the base model[cite: 2, 6, 17, 179]. [cite_start]It establishes a "diagnosable and iterative" framework covering the full lifecycle from data engineering to deployment[cite: 25, 28].
9
 
10
  ## 🛠️ Core Workflow
11
- * [cite_start]**Data Engineering**: Implemented streaming collection, three-layer quality filtering, and MinHashLSH-based fuzzy deduplication[cite: 50, 52, 57, 180].
12
- * [cite_start]**Instruction Evolution**: Utilized DeepSeek APIs for Evol-Instruct difficulty enhancement and diversity expansion[cite: 62, 65, 180].
13
- * [cite_start]**Supervised Fine-Tuning (SFT)**: Applied QLoRA with a custom **Instruction Masking** strategy (QwenDataCollator) to ensure the model only learns from assistant responses[cite: 77, 113, 114, 181].
14
- * [cite_start]**Rejection Sampling (RFT)**: Developed a high-throughput engine using vLLM for 10-path sampling, verified through a multi-process safe execution sandbox[cite: 135, 137, 151, 181].
15
- * [cite_start]**Preference Alignment (DPO)**: Investigated Direct Preference Optimization, identifying critical failure modes such as length bias and low-quality negative samples[cite: 17, 138, 145].
16
- * [cite_start]**Quantization & Deployment**: Performed 4-bit activation-aware quantization (AutoAWQ) and deployed the model via a vLLM OpenAI-compatible API[cite: 146, 151, 182].
17
 
18
  ## 📈 Experimental Results (HumanEval Pass@1)
19
- [cite_start]The project tracked performance gains and losses across multiple iterations[cite: 17, 143, 179]:
20
- * [cite_start]**Base Model**: 0.628[cite: 17, 143].
21
- * [cite_start]**SFT v3 (Final)**: **0.671 (+6.8%)** — achieved through precise loss calculation and data cleaning[cite: 17, 143, 144].
22
- * [cite_start]**DPO Merged**: 0.280 — highlighting the extreme sensitivity of code models to preference data quality[cite: 17, 143, 145].
23
-
24
- ## 💡 Key Engineering Insights
25
- * [cite_start]**Format Over Hyperparameters**: Correct data formatting (Instruction Masking and Packing strategies) was found to be as critical as learning rate tuning for code SFT[cite: 131, 133].
26
- * [cite_start]**Alignment Challenges**: Code DPO requires strict length balance and functional correctness verification (unit tests) rather than simple runtime checks[cite: 145, 150].
27
-
28
- ---
29
- [cite_start]**Author**: Kaige Shi [cite: 159]
30
- [cite_start]**Affiliation**: Dalian University of Technology / University of Science and Technology of China [cite: 166, 167]
31
 
32
- Would you like me to generate a **Quick Start** guide in English to explain how to run the SFT or RFT scripts?
 
 
2
  license: mit
3
  language:
4
  - en
5
+ tags:
6
+ - code-llm
7
+ - qwen
8
+ - sft
9
+ - dpo
10
  ---
11
+
12
  # Code-Centric-Align: A Post-Training Pipeline for Code LLMs
13
 
14
+ This project presents a systematic study of the post-training engineering pipeline for code-specific large language models, using **Qwen2.5-Coder-7B** as the base model. It establishes a "diagnosable and iterative" framework covering the full lifecycle from data engineering to deployment.
15
 
16
  ## 🛠️ Core Workflow
17
+ * **Data Engineering**: Implemented streaming collection, three-layer quality filtering, and MinHashLSH-based fuzzy deduplication.
18
+ * **Instruction Evolution**: Utilized DeepSeek APIs for Evol-Instruct difficulty enhancement and diversity expansion.
19
+ * **Supervised Fine-Tuning (SFT)**: Applied QLoRA with a custom **Instruction Masking** strategy (QwenDataCollator) to ensure the model only learns from assistant responses.
20
+ * **Rejection Sampling (RFT)**: Developed a high-throughput engine using vLLM for 10-path sampling, verified through a multi-process safe execution sandbox.
21
+ * **Preference Alignment (DPO)**: Investigated Direct Preference Optimization, identifying critical failure modes such as length bias and low-quality negative samples.
22
+ * **Quantization & Deployment**: Performed 4-bit activation-aware quantization (AutoAWQ) and deployed the model via a vLLM OpenAI-compatible API.
23
 
24
  ## 📈 Experimental Results (HumanEval Pass@1)
25
+ The project tracked performance gains and losses across multiple iterations:
26
+ * **Base Model**: 0.628
27
+ * **SFT v3 (released)**: **0.671 (+6.8%)** — achieved through precise loss calculation and data cleaning.
28
+ * **DPO Merged**: < 0.628 — highlighting the extreme sensitivity of code models to preference data quality.
 
 
 
 
 
 
 
 
29
 
30
+ ⚠️ Status & Roadmap
31
+ This project is actively under development. Currently, the DPO alignment exhibits performance regression (Pass@1 < 0.628) due to preference data sensitivity. We are investigating advanced filtering and reward modeling to resolve this. Optimized weights will be uploaded as soon as the alignment bottleneck is cleared.