MergeMind / README.md
Tejha's picture
Upload 10 files
66153d5 verified
# πŸ€– LLM Pipeline
> An end-to-end automated pipeline for **discovering**, **merging**, **evaluating**, and **fine-tuning** open-source LLMs β€” with full MLOps integration.
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
[![HF Hub](https://img.shields.io/badge/πŸ€—-Hugging%20Face-yellow)](https://huggingface.co)
[![W&B](https://img.shields.io/badge/Weights%20%26%20Biases-tracking-orange)](https://wandb.ai)
---
## πŸ—ΊοΈ Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Phase 1 β€” Discovery β”‚ Scan HF Hub β†’ filter β†’ rank β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Phase 2 β€” Merging β”‚ SLERP Β· TIES Β· DARE Β· TA β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Phase 3 β€” Evaluation β”‚ ROUGE Β· BERTScore Β· Judge β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Phase 4 β€” Fine-Tuning β”‚ LoRA/QLoRA Β· Synthetic Data β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Phase 5 β€” MLOps β”‚ vLLM Β· W&B Β· MLflow Β· HF Hub β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↑__________________________|
Iterative improvement loop
```
---
## ✨ Features
### Phase 1 β€” Model Discovery
- Automated HF Hub crawler with category-based keyword search
- Quality filtering: downloads, likes, parameter count, model card completeness
- Optional lightweight perplexity probe for fast quality estimation
- Composite scoring and ranked shortlist output
### Phase 2 β€” Model Composition
- **Union merging** (capability aggregation):
- SLERP β€” spherical linear interpolation
- TIES β€” trim, elect sign, merge
- DARE-TIES β€” dropout + TIES
- Task Arithmetic β€” delta-weight addition
- **Intersection merging** (conservative):
- Breadcrumbs β€” consensus-only parameter updates
- DOM-tree-style architecture introspection (layers, attention heads, MLP blocks)
- mergekit integration + pure-PyTorch fallback
### Phase 3 β€” Evaluation Framework
- ROUGE (rouge1, rouge2, rougeL)
- BERTScore (semantic similarity)
- Faithfulness + hallucination detection (NLI-based)
- LLM-as-Judge scoring (0–10)
- Multi-model side-by-side comparison
- **Knowledge gap detector** β†’ feeds Phase 4
### Phase 4 β€” Efficient Fine-Tuning
- QLoRA (4-bit NF4 quantization + LoRA adapters)
- Response-only training (loss on assistant turns only)
- Synthetic data generation per detected gap category
- Delta adapter extraction (merge-ready weights)
- **Iterative improvement loop**: eval β†’ gap detect β†’ generate β†’ train β†’ repeat
### Phase 5 β€” MLOps
- vLLM inference with PagedAttention (OpenAI-compatible API)
- Throughput benchmarking
- Dual tracking: Weights & Biases + MLflow
- Auto-generated model cards
- One-command HF Hub deployment
---
## πŸš€ Quick Start
### 1. Install
```bash
git clone https://github.com/YOUR_USERNAME/llm-pipeline.git
cd llm-pipeline
pip install -r requirements.txt
```
### 2. Set environment variables
```bash
export HF_TOKEN="hf_..." # Hugging Face token
export WANDB_API_KEY="..." # W&B token (optional)
```
### 3. Run the full pipeline
```bash
# Full pipeline for reasoning models
python pipeline.py run reasoning
# With iterative improvement loop
python pipeline.py run code --loop --max-iter 3
# Custom merge strategy
python pipeline.py run medical --strategy breadcrumbs --top-k 3
```
---
## πŸ“– Usage β€” Individual Phases
### Phase 1: Discovery
```bash
python -m phase1_discovery.discover run reasoning --top-k 5
python -m phase1_discovery.discover run code --perplexity # adds perplexity probe
python -m phase1_discovery.discover run --all # all categories
```
### Phase 2: Merging
```bash
# TIES merge (recommended for union)
python -m phase2_merging.merge run ties \
--model mistralai/Mistral-7B-v0.3 \
--model teknium/OpenHermes-2.5-Mistral-7B \
--base mistralai/Mistral-7B-v0.3
# SLERP interpolation
python -m phase2_merging.merge run slerp \
--model model_a --model model_b --alpha 0.6
# Breadcrumbs (conservative / intersection)
python -m phase2_merging.merge run breadcrumbs \
--model base --model ft_a --model ft_b --density 0.7
# Inspect architecture (DOM-tree view)
python -m phase2_merging.merge run ties \
--introspect mistralai/Mistral-7B-v0.3
```
### Phase 3: Evaluation
```bash
# Evaluate on SQuAD v2
python -m phase3_evaluation.evaluate run ./merged_model --dataset squad --n-samples 200
# Compare multiple models
python -m phase3_evaluation.evaluate run model_a \
--compare model_b --compare model_c
# Disable LLM judge (faster)
python -m phase3_evaluation.evaluate run ./merged --no-judge
```
### Phase 4: Fine-Tuning
```bash
# Fine-tune targeting specific gaps
python -m phase4_finetuning.finetune run \
--base mistralai/Mistral-7B-v0.3 \
--gap factual_recall --gap numerical \
--n-syn 100 --output ./adapters/run1
# Use existing synthetic data
python -m phase4_finetuning.finetune run \
--base mistralai/Mistral-7B-v0.3 \
--data-path ./artifacts/data/synthetic_data.jsonl
# Iterative loop
python -m phase4_finetuning.finetune run --loop --max-iter 3
```
### Phase 5: Inference & MLOps
```bash
# Start vLLM server (OpenAI-compatible)
python -m phase5_mlops.serve serve ./merged_model --port 8000
# Benchmark throughput
python -m phase5_mlops.serve serve ./merged_model --bench
# Track experiment
python -m phase5_mlops.serve track my-run \
--model ./merged --strategy ties \
--rouge1 0.42 --bertscore 0.71 --judge 7.3
# Deploy to HF Hub
python -m phase5_mlops.serve deploy ./merged_model \
--repo your-username/my-merged-7b
# View leaderboard
python -m phase5_mlops.serve leaderboard
```
---
## πŸ“ Project Structure
```
llm-pipeline/
β”œβ”€β”€ pipeline.py # Master orchestrator
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ configs/
β”‚ └── settings.py # All config: paths, scale, hyperparams
β”œβ”€β”€ utils/
β”‚ └── logger.py # Centralized logging
β”œβ”€β”€ phase1_discovery/
β”‚ └── discover.py # HF Hub crawler + ranking
β”œβ”€β”€ phase2_merging/
β”‚ └── merge.py # Merging + architecture introspection
β”œβ”€β”€ phase3_evaluation/
β”‚ └── evaluate.py # Multi-metric eval + gap detection
β”œβ”€β”€ phase4_finetuning/
β”‚ └── finetune.py # QLoRA + synthetic data + loop
β”œβ”€β”€ phase5_mlops/
β”‚ └── serve.py # vLLM + W&B + MLflow + HF deploy
└── artifacts/ # Auto-created at runtime
β”œβ”€β”€ models/
β”œβ”€β”€ merges/
β”œβ”€β”€ adapters/
β”œβ”€β”€ evaluations/
└── data/
```
---
## βš™οΈ Configuration
Edit `configs/settings.py` to customize:
```python
# Scale preset (currently: medium = 7B, single A100)
SCALE = "medium"
# Categories and keywords for discovery
HF_MODEL_CATEGORIES = {
"code": ["starcoder", "codellama", "deepseek-coder"],
"reasoning": ["mistral", "llama", "qwen"],
...
}
# Fine-tuning defaults
FT_BASE_MODEL = "mistralai/Mistral-7B-v0.3"
FT_EPOCHS = 3
FT_LR = 2e-4
# vLLM
VLLM_GPU_MEMORY_UTIL = 0.90
VLLM_MAX_MODEL_LEN = 4096
```
---
## 🧩 Supported Merge Strategies
| Strategy | Type | Best For |
|---|---|---|
| `slerp` | Union | Two-model smooth interpolation |
| `ties` | Union | Multi-model, removes conflicting deltas |
| `dare_ties` | Union | Aggressive sparsification before TIES |
| `task_arithmetic` | Union | Adding task-specific capabilities |
| `breadcrumbs` | Intersection | Conservative, safety-preserving merge |
---
## πŸ“Š Evaluation Metrics
| Metric | Tool | Threshold |
|---|---|---|
| ROUGE-1/2/L | `rouge-score` | β‰₯ 0.30 |
| BERTScore F1 | `bert-score` | β‰₯ 0.50 |
| Faithfulness | `cross-encoder/nli-deberta-v3-small` | β‰₯ 0.50 |
| Hallucination | Heuristic + NLI | < 10% |
| Judge Score | LLM-as-Judge | β‰₯ 5.0/10 |
---
## πŸ”„ Iterative Improvement Loop
```
β”Œβ”€ Evaluate model ──────────────────────────────────┐
β”‚ ROUGE / BERTScore / Judge / Faithfulness β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ gaps detected?
β–Ό
β”Œβ”€ Detect knowledge gaps ───────────────────────────┐
β”‚ factual_recall / numerical / code / reasoning β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€ Generate synthetic data ─────────────────────────┐
β”‚ LLM generates targeted QA pairs per gap β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€ QLoRA fine-tune ─────────────────────────────────┐
β”‚ Response-only loss, 4-bit NF4, paged_adamw β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
└──────────────► repeat until target ROUGE or max_iter
```
---
## πŸ› οΈ Hardware Requirements
| Scale | GPU | RAM | Notes |
|---|---|---|---|
| Small (1–3B) | Any CUDA GPU | 16GB | CPU possible but slow |
| **Medium (7B)** | **A100 / H100 40GB** | **32GB** | **Recommended** |
| Large (13B+) | 2Γ— A100 80GB | 64GB | Set `tensor_parallel=2` |
---
## πŸ“¦ Key Dependencies
- [`transformers`](https://github.com/huggingface/transformers) β€” model loading
- [`peft`](https://github.com/huggingface/peft) β€” LoRA/QLoRA adapters
- [`trl`](https://github.com/huggingface/trl) β€” SFTTrainer
- [`mergekit`](https://github.com/arcee-ai/mergekit) β€” TIES, DARE, SLERP
- [`vllm`](https://github.com/vllm-project/vllm) β€” high-throughput inference
- [`bert-score`](https://github.com/Tiiiger/bert_score) β€” semantic evaluation
- [`wandb`](https://wandb.ai) + [`mlflow`](https://mlflow.org) β€” experiment tracking
---
## πŸ“„ License
[Apache 2.0](LICENSE)
---
## πŸ™ Acknowledgements
- [mergekit](https://github.com/arcee-ai/mergekit) by Arcee AI
- [TIES-Merging](https://arxiv.org/abs/2306.01708) β€” Yadav et al., 2023
- [DARE](https://arxiv.org/abs/2311.03099) β€” Yu et al., 2023
- [Task Arithmetic](https://arxiv.org/abs/2212.04089) β€” Ilharco et al., 2023
- [QLoRA](https://arxiv.org/abs/2305.14314) β€” Dettmers et al., 2023