Update README.md
Browse files
README.md
CHANGED
|
@@ -2,31 +2,30 @@
|
|
| 2 |
license: mit
|
| 3 |
language:
|
| 4 |
- en
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
|
|
|
| 6 |
# Code-Centric-Align: A Post-Training Pipeline for Code LLMs
|
| 7 |
|
| 8 |
-
|
| 9 |
|
| 10 |
## 🛠️ Core Workflow
|
| 11 |
-
*
|
| 12 |
-
*
|
| 13 |
-
*
|
| 14 |
-
*
|
| 15 |
-
*
|
| 16 |
-
*
|
| 17 |
|
| 18 |
## 📈 Experimental Results (HumanEval Pass@1)
|
| 19 |
-
|
| 20 |
-
*
|
| 21 |
-
*
|
| 22 |
-
*
|
| 23 |
-
|
| 24 |
-
## 💡 Key Engineering Insights
|
| 25 |
-
* [cite_start]**Format Over Hyperparameters**: Correct data formatting (Instruction Masking and Packing strategies) was found to be as critical as learning rate tuning for code SFT[cite: 131, 133].
|
| 26 |
-
* [cite_start]**Alignment Challenges**: Code DPO requires strict length balance and functional correctness verification (unit tests) rather than simple runtime checks[cite: 145, 150].
|
| 27 |
-
|
| 28 |
-
---
|
| 29 |
-
[cite_start]**Author**: Kaige Shi [cite: 159]
|
| 30 |
-
[cite_start]**Affiliation**: Dalian University of Technology / University of Science and Technology of China [cite: 166, 167]
|
| 31 |
|
| 32 |
-
|
|
|
|
|
|
| 2 |
license: mit
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
+
tags:
|
| 6 |
+
- code-llm
|
| 7 |
+
- qwen
|
| 8 |
+
- sft
|
| 9 |
+
- dpo
|
| 10 |
---
|
| 11 |
+
|
| 12 |
# Code-Centric-Align: A Post-Training Pipeline for Code LLMs
|
| 13 |
|
| 14 |
+
This project presents a systematic study of the post-training engineering pipeline for code-specific large language models, using **Qwen2.5-Coder-7B** as the base model. It establishes a "diagnosable and iterative" framework covering the full lifecycle from data engineering to deployment.
|
| 15 |
|
| 16 |
## 🛠️ Core Workflow
|
| 17 |
+
* **Data Engineering**: Implemented streaming collection, three-layer quality filtering, and MinHashLSH-based fuzzy deduplication.
|
| 18 |
+
* **Instruction Evolution**: Utilized DeepSeek APIs for Evol-Instruct difficulty enhancement and diversity expansion.
|
| 19 |
+
* **Supervised Fine-Tuning (SFT)**: Applied QLoRA with a custom **Instruction Masking** strategy (QwenDataCollator) to ensure the model only learns from assistant responses.
|
| 20 |
+
* **Rejection Sampling (RFT)**: Developed a high-throughput engine using vLLM for 10-path sampling, verified through a multi-process safe execution sandbox.
|
| 21 |
+
* **Preference Alignment (DPO)**: Investigated Direct Preference Optimization, identifying critical failure modes such as length bias and low-quality negative samples.
|
| 22 |
+
* **Quantization & Deployment**: Performed 4-bit activation-aware quantization (AutoAWQ) and deployed the model via a vLLM OpenAI-compatible API.
|
| 23 |
|
| 24 |
## 📈 Experimental Results (HumanEval Pass@1)
|
| 25 |
+
The project tracked performance gains and losses across multiple iterations:
|
| 26 |
+
* **Base Model**: 0.628
|
| 27 |
+
* **SFT v3 (released)**: **0.671 (+6.8%)** — achieved through precise loss calculation and data cleaning.
|
| 28 |
+
* **DPO Merged**: < 0.628 — highlighting the extreme sensitivity of code models to preference data quality.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
+
⚠️ Status & Roadmap
|
| 31 |
+
This project is actively under development. Currently, the DPO alignment exhibits performance regression (Pass@1 < 0.628) due to preference data sensitivity. We are investigating advanced filtering and reward modeling to resolve this. Optimized weights will be uploaded as soon as the alignment bottleneck is cleared.
|