Update README.md
Browse files
README.md
CHANGED
|
@@ -58,18 +58,18 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
|
| 58 |
|
| 59 |
|
| 60 |
## 🛠️ Core Workflow
|
| 61 |
-
Data Engineering: Implemented streaming collection, three-layer quality filtering, and MinHashLSH-based fuzzy deduplication.
|
| 62 |
-
Instruction Evolution: Utilized DeepSeek APIs for Evol-Instruct difficulty enhancement and diversity expansion.
|
| 63 |
-
Supervised Fine-Tuning (SFT): Applied QLoRA with a custom Instruction Masking strategy (QwenDataCollator) to ensure the model only learns from assistant responses.
|
| 64 |
-
Rejection Sampling (RFT): Developed a high-throughput engine using vLLM for 10-path sampling, verified through a multi-process safe execution sandbox.
|
| 65 |
-
Preference Alignment (DPO): Investigated Direct Preference Optimization, identifying critical failure modes such as length bias and low-quality negative samples.
|
| 66 |
-
Quantization & Deployment: Performed 4-bit activation-aware quantization (AutoAWQ) and deployed the model via a vLLM OpenAI-compatible API.
|
| 67 |
|
| 68 |
## 📈 Experimental Results (HumanEval Pass@1)
|
| 69 |
The project tracked performance gains and losses across multiple iterations:
|
| 70 |
-
Base Model: 0.628
|
| 71 |
-
SFT v3 (released): 0.671 (+6.8%) — achieved through precise loss calculation and data cleaning.
|
| 72 |
-
DPO Merged: 0.280 — highlighting the extreme sensitivity of code models to preference data quality.
|
| 73 |
|
| 74 |
## ⚠️ Status & Roadmap
|
| 75 |
This project is actively under development. Currently, the DPO alignment exhibits performance regression (Pass@1 < 0.628) due to preference data sensitivity. We are investigating advanced filtering and reward modeling to resolve this. Optimized weights will be uploaded as soon as the alignment bottleneck is cleared.
|
|
|
|
| 58 |
|
| 59 |
|
| 60 |
## 🛠️ Core Workflow
|
| 61 |
+
- Data Engineering: Implemented streaming collection, three-layer quality filtering, and MinHashLSH-based fuzzy deduplication.
|
| 62 |
+
- Instruction Evolution: Utilized DeepSeek APIs for Evol-Instruct difficulty enhancement and diversity expansion.
|
| 63 |
+
- Supervised Fine-Tuning (SFT): Applied QLoRA with a custom Instruction Masking strategy (QwenDataCollator) to ensure the model only learns from assistant responses.
|
| 64 |
+
- Rejection Sampling (RFT): Developed a high-throughput engine using vLLM for 10-path sampling, verified through a multi-process safe execution sandbox.
|
| 65 |
+
- Preference Alignment (DPO): Investigated Direct Preference Optimization, identifying critical failure modes such as length bias and low-quality negative samples.
|
| 66 |
+
- Quantization & Deployment: Performed 4-bit activation-aware quantization (AutoAWQ) and deployed the model via a vLLM OpenAI-compatible API.
|
| 67 |
|
| 68 |
## 📈 Experimental Results (HumanEval Pass@1)
|
| 69 |
The project tracked performance gains and losses across multiple iterations:
|
| 70 |
+
- Base Model: 0.628
|
| 71 |
+
- **SFT v3 (released): 0.671 (+6.8%)** — achieved through precise loss calculation and data cleaning.
|
| 72 |
+
- DPO Merged: 0.280 — highlighting the extreme sensitivity of code models to preference data quality.
|
| 73 |
|
| 74 |
## ⚠️ Status & Roadmap
|
| 75 |
This project is actively under development. Currently, the DPO alignment exhibits performance regression (Pass@1 < 0.628) due to preference data sensitivity. We are investigating advanced filtering and reward modeling to resolve this. Optimized weights will be uploaded as soon as the alignment bottleneck is cleared.
|