File size: 11,866 Bytes
66153d5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 | # π€ LLM Pipeline
> An end-to-end automated pipeline for **discovering**, **merging**, **evaluating**, and **fine-tuning** open-source LLMs β with full MLOps integration.
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/Apache-2.0)
[](https://huggingface.co)
[](https://wandb.ai)
---
## πΊοΈ Architecture
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Phase 1 β Discovery β Scan HF Hub β filter β rank β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Phase 2 β Merging β SLERP Β· TIES Β· DARE Β· TA β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Phase 3 β Evaluation β ROUGE Β· BERTScore Β· Judge β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Phase 4 β Fine-Tuning β LoRA/QLoRA Β· Synthetic Data β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Phase 5 β MLOps β vLLM Β· W&B Β· MLflow Β· HF Hub β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β__________________________|
Iterative improvement loop
```
---
## β¨ Features
### Phase 1 β Model Discovery
- Automated HF Hub crawler with category-based keyword search
- Quality filtering: downloads, likes, parameter count, model card completeness
- Optional lightweight perplexity probe for fast quality estimation
- Composite scoring and ranked shortlist output
### Phase 2 β Model Composition
- **Union merging** (capability aggregation):
- SLERP β spherical linear interpolation
- TIES β trim, elect sign, merge
- DARE-TIES β dropout + TIES
- Task Arithmetic β delta-weight addition
- **Intersection merging** (conservative):
- Breadcrumbs β consensus-only parameter updates
- DOM-tree-style architecture introspection (layers, attention heads, MLP blocks)
- mergekit integration + pure-PyTorch fallback
### Phase 3 β Evaluation Framework
- ROUGE (rouge1, rouge2, rougeL)
- BERTScore (semantic similarity)
- Faithfulness + hallucination detection (NLI-based)
- LLM-as-Judge scoring (0β10)
- Multi-model side-by-side comparison
- **Knowledge gap detector** β feeds Phase 4
### Phase 4 β Efficient Fine-Tuning
- QLoRA (4-bit NF4 quantization + LoRA adapters)
- Response-only training (loss on assistant turns only)
- Synthetic data generation per detected gap category
- Delta adapter extraction (merge-ready weights)
- **Iterative improvement loop**: eval β gap detect β generate β train β repeat
### Phase 5 β MLOps
- vLLM inference with PagedAttention (OpenAI-compatible API)
- Throughput benchmarking
- Dual tracking: Weights & Biases + MLflow
- Auto-generated model cards
- One-command HF Hub deployment
---
## π Quick Start
### 1. Install
```bash
git clone https://github.com/YOUR_USERNAME/llm-pipeline.git
cd llm-pipeline
pip install -r requirements.txt
```
### 2. Set environment variables
```bash
export HF_TOKEN="hf_..." # Hugging Face token
export WANDB_API_KEY="..." # W&B token (optional)
```
### 3. Run the full pipeline
```bash
# Full pipeline for reasoning models
python pipeline.py run reasoning
# With iterative improvement loop
python pipeline.py run code --loop --max-iter 3
# Custom merge strategy
python pipeline.py run medical --strategy breadcrumbs --top-k 3
```
---
## π Usage β Individual Phases
### Phase 1: Discovery
```bash
python -m phase1_discovery.discover run reasoning --top-k 5
python -m phase1_discovery.discover run code --perplexity # adds perplexity probe
python -m phase1_discovery.discover run --all # all categories
```
### Phase 2: Merging
```bash
# TIES merge (recommended for union)
python -m phase2_merging.merge run ties \
--model mistralai/Mistral-7B-v0.3 \
--model teknium/OpenHermes-2.5-Mistral-7B \
--base mistralai/Mistral-7B-v0.3
# SLERP interpolation
python -m phase2_merging.merge run slerp \
--model model_a --model model_b --alpha 0.6
# Breadcrumbs (conservative / intersection)
python -m phase2_merging.merge run breadcrumbs \
--model base --model ft_a --model ft_b --density 0.7
# Inspect architecture (DOM-tree view)
python -m phase2_merging.merge run ties \
--introspect mistralai/Mistral-7B-v0.3
```
### Phase 3: Evaluation
```bash
# Evaluate on SQuAD v2
python -m phase3_evaluation.evaluate run ./merged_model --dataset squad --n-samples 200
# Compare multiple models
python -m phase3_evaluation.evaluate run model_a \
--compare model_b --compare model_c
# Disable LLM judge (faster)
python -m phase3_evaluation.evaluate run ./merged --no-judge
```
### Phase 4: Fine-Tuning
```bash
# Fine-tune targeting specific gaps
python -m phase4_finetuning.finetune run \
--base mistralai/Mistral-7B-v0.3 \
--gap factual_recall --gap numerical \
--n-syn 100 --output ./adapters/run1
# Use existing synthetic data
python -m phase4_finetuning.finetune run \
--base mistralai/Mistral-7B-v0.3 \
--data-path ./artifacts/data/synthetic_data.jsonl
# Iterative loop
python -m phase4_finetuning.finetune run --loop --max-iter 3
```
### Phase 5: Inference & MLOps
```bash
# Start vLLM server (OpenAI-compatible)
python -m phase5_mlops.serve serve ./merged_model --port 8000
# Benchmark throughput
python -m phase5_mlops.serve serve ./merged_model --bench
# Track experiment
python -m phase5_mlops.serve track my-run \
--model ./merged --strategy ties \
--rouge1 0.42 --bertscore 0.71 --judge 7.3
# Deploy to HF Hub
python -m phase5_mlops.serve deploy ./merged_model \
--repo your-username/my-merged-7b
# View leaderboard
python -m phase5_mlops.serve leaderboard
```
---
## π Project Structure
```
llm-pipeline/
βββ pipeline.py # Master orchestrator
βββ requirements.txt
βββ configs/
β βββ settings.py # All config: paths, scale, hyperparams
βββ utils/
β βββ logger.py # Centralized logging
βββ phase1_discovery/
β βββ discover.py # HF Hub crawler + ranking
βββ phase2_merging/
β βββ merge.py # Merging + architecture introspection
βββ phase3_evaluation/
β βββ evaluate.py # Multi-metric eval + gap detection
βββ phase4_finetuning/
β βββ finetune.py # QLoRA + synthetic data + loop
βββ phase5_mlops/
β βββ serve.py # vLLM + W&B + MLflow + HF deploy
βββ artifacts/ # Auto-created at runtime
βββ models/
βββ merges/
βββ adapters/
βββ evaluations/
βββ data/
```
---
## βοΈ Configuration
Edit `configs/settings.py` to customize:
```python
# Scale preset (currently: medium = 7B, single A100)
SCALE = "medium"
# Categories and keywords for discovery
HF_MODEL_CATEGORIES = {
"code": ["starcoder", "codellama", "deepseek-coder"],
"reasoning": ["mistral", "llama", "qwen"],
...
}
# Fine-tuning defaults
FT_BASE_MODEL = "mistralai/Mistral-7B-v0.3"
FT_EPOCHS = 3
FT_LR = 2e-4
# vLLM
VLLM_GPU_MEMORY_UTIL = 0.90
VLLM_MAX_MODEL_LEN = 4096
```
---
## π§© Supported Merge Strategies
| Strategy | Type | Best For |
|---|---|---|
| `slerp` | Union | Two-model smooth interpolation |
| `ties` | Union | Multi-model, removes conflicting deltas |
| `dare_ties` | Union | Aggressive sparsification before TIES |
| `task_arithmetic` | Union | Adding task-specific capabilities |
| `breadcrumbs` | Intersection | Conservative, safety-preserving merge |
---
## π Evaluation Metrics
| Metric | Tool | Threshold |
|---|---|---|
| ROUGE-1/2/L | `rouge-score` | β₯ 0.30 |
| BERTScore F1 | `bert-score` | β₯ 0.50 |
| Faithfulness | `cross-encoder/nli-deberta-v3-small` | β₯ 0.50 |
| Hallucination | Heuristic + NLI | < 10% |
| Judge Score | LLM-as-Judge | β₯ 5.0/10 |
---
## π Iterative Improvement Loop
```
ββ Evaluate model βββββββββββββββββββββββββββββββββββ
β ROUGE / BERTScore / Judge / Faithfulness β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β gaps detected?
βΌ
ββ Detect knowledge gaps ββββββββββββββββββββββββββββ
β factual_recall / numerical / code / reasoning β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
βΌ
ββ Generate synthetic data ββββββββββββββββββββββββββ
β LLM generates targeted QA pairs per gap β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
βΌ
ββ QLoRA fine-tune ββββββββββββββββββββββββββββββββββ
β Response-only loss, 4-bit NF4, paged_adamw β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
ββββββββββββββββΊ repeat until target ROUGE or max_iter
```
---
## π οΈ Hardware Requirements
| Scale | GPU | RAM | Notes |
|---|---|---|---|
| Small (1β3B) | Any CUDA GPU | 16GB | CPU possible but slow |
| **Medium (7B)** | **A100 / H100 40GB** | **32GB** | **Recommended** |
| Large (13B+) | 2Γ A100 80GB | 64GB | Set `tensor_parallel=2` |
---
## π¦ Key Dependencies
- [`transformers`](https://github.com/huggingface/transformers) β model loading
- [`peft`](https://github.com/huggingface/peft) β LoRA/QLoRA adapters
- [`trl`](https://github.com/huggingface/trl) β SFTTrainer
- [`mergekit`](https://github.com/arcee-ai/mergekit) β TIES, DARE, SLERP
- [`vllm`](https://github.com/vllm-project/vllm) β high-throughput inference
- [`bert-score`](https://github.com/Tiiiger/bert_score) β semantic evaluation
- [`wandb`](https://wandb.ai) + [`mlflow`](https://mlflow.org) β experiment tracking
---
## π License
[Apache 2.0](LICENSE)
---
## π Acknowledgements
- [mergekit](https://github.com/arcee-ai/mergekit) by Arcee AI
- [TIES-Merging](https://arxiv.org/abs/2306.01708) β Yadav et al., 2023
- [DARE](https://arxiv.org/abs/2311.03099) β Yu et al., 2023
- [Task Arithmetic](https://arxiv.org/abs/2212.04089) β Ilharco et al., 2023
- [QLoRA](https://arxiv.org/abs/2305.14314) β Dettmers et al., 2023
|