MergeMind / README.md

Tejha

Upload 10 files

66153d5 verified 8 days ago

preview code

raw

history blame contribute delete

11.9 kB

🤖 LLM Pipeline

An end-to-end automated pipeline for discovering, merging, evaluating, and fine-tuning open-source LLMs — with full MLOps integration.

🗺️ Architecture

┌─────────────────────────────────────────────────────────┐
│  Phase 1 — Discovery    │  Scan HF Hub → filter → rank  │
├─────────────────────────────────────────────────────────┤
│  Phase 2 — Merging      │  SLERP · TIES · DARE · TA     │
├─────────────────────────────────────────────────────────┤
│  Phase 3 — Evaluation   │  ROUGE · BERTScore · Judge    │
├─────────────────────────────────────────────────────────┤
│  Phase 4 — Fine-Tuning  │  LoRA/QLoRA · Synthetic Data  │
├─────────────────────────────────────────────────────────┤
│  Phase 5 — MLOps        │  vLLM · W&B · MLflow · HF Hub │
└─────────────────────────────────────────────────────────┘
            ↑__________________________|
                 Iterative improvement loop

✨ Features

Phase 1 — Model Discovery

Automated HF Hub crawler with category-based keyword search
Quality filtering: downloads, likes, parameter count, model card completeness
Optional lightweight perplexity probe for fast quality estimation
Composite scoring and ranked shortlist output

Phase 2 — Model Composition

Union merging (capability aggregation):
- SLERP — spherical linear interpolation
- TIES — trim, elect sign, merge
- DARE-TIES — dropout + TIES
- Task Arithmetic — delta-weight addition
Intersection merging (conservative):
- Breadcrumbs — consensus-only parameter updates
DOM-tree-style architecture introspection (layers, attention heads, MLP blocks)
mergekit integration + pure-PyTorch fallback

Phase 3 — Evaluation Framework

ROUGE (rouge1, rouge2, rougeL)
BERTScore (semantic similarity)
Faithfulness + hallucination detection (NLI-based)
LLM-as-Judge scoring (0–10)
Multi-model side-by-side comparison
Knowledge gap detector → feeds Phase 4

Phase 4 — Efficient Fine-Tuning

QLoRA (4-bit NF4 quantization + LoRA adapters)
Response-only training (loss on assistant turns only)
Synthetic data generation per detected gap category
Delta adapter extraction (merge-ready weights)
Iterative improvement loop: eval → gap detect → generate → train → repeat

Phase 5 — MLOps

vLLM inference with PagedAttention (OpenAI-compatible API)
Throughput benchmarking
Dual tracking: Weights & Biases + MLflow
Auto-generated model cards
One-command HF Hub deployment

🚀 Quick Start

1. Install

git clone https://github.com/YOUR_USERNAME/llm-pipeline.git
cd llm-pipeline
pip install -r requirements.txt

2. Set environment variables

export HF_TOKEN="hf_..."          # Hugging Face token
export WANDB_API_KEY="..."        # W&B token (optional)

3. Run the full pipeline

# Full pipeline for reasoning models
python pipeline.py run reasoning

# With iterative improvement loop
python pipeline.py run code --loop --max-iter 3

# Custom merge strategy
python pipeline.py run medical --strategy breadcrumbs --top-k 3

📖 Usage — Individual Phases

Phase 1: Discovery

python -m phase1_discovery.discover run reasoning --top-k 5
python -m phase1_discovery.discover run code --perplexity  # adds perplexity probe
python -m phase1_discovery.discover run --all              # all categories

Phase 2: Merging

# TIES merge (recommended for union)
python -m phase2_merging.merge run ties \
  --model mistralai/Mistral-7B-v0.3 \
  --model teknium/OpenHermes-2.5-Mistral-7B \
  --base mistralai/Mistral-7B-v0.3

# SLERP interpolation
python -m phase2_merging.merge run slerp \
  --model model_a --model model_b --alpha 0.6

# Breadcrumbs (conservative / intersection)
python -m phase2_merging.merge run breadcrumbs \
  --model base --model ft_a --model ft_b --density 0.7

# Inspect architecture (DOM-tree view)
python -m phase2_merging.merge run ties \
  --introspect mistralai/Mistral-7B-v0.3

Phase 3: Evaluation

# Evaluate on SQuAD v2
python -m phase3_evaluation.evaluate run ./merged_model --dataset squad --n-samples 200

# Compare multiple models
python -m phase3_evaluation.evaluate run model_a \
  --compare model_b --compare model_c

# Disable LLM judge (faster)
python -m phase3_evaluation.evaluate run ./merged --no-judge

Phase 4: Fine-Tuning

# Fine-tune targeting specific gaps
python -m phase4_finetuning.finetune run \
  --base mistralai/Mistral-7B-v0.3 \
  --gap factual_recall --gap numerical \
  --n-syn 100 --output ./adapters/run1

# Use existing synthetic data
python -m phase4_finetuning.finetune run \
  --base mistralai/Mistral-7B-v0.3 \
  --data-path ./artifacts/data/synthetic_data.jsonl

# Iterative loop
python -m phase4_finetuning.finetune run --loop --max-iter 3

Phase 5: Inference & MLOps

# Start vLLM server (OpenAI-compatible)
python -m phase5_mlops.serve serve ./merged_model --port 8000

# Benchmark throughput
python -m phase5_mlops.serve serve ./merged_model --bench

# Track experiment
python -m phase5_mlops.serve track my-run \
  --model ./merged --strategy ties \
  --rouge1 0.42 --bertscore 0.71 --judge 7.3

# Deploy to HF Hub
python -m phase5_mlops.serve deploy ./merged_model \
  --repo your-username/my-merged-7b

# View leaderboard
python -m phase5_mlops.serve leaderboard

📁 Project Structure

llm-pipeline/
├── pipeline.py                  # Master orchestrator
├── requirements.txt
├── configs/
│   └── settings.py              # All config: paths, scale, hyperparams
├── utils/
│   └── logger.py                # Centralized logging
├── phase1_discovery/
│   └── discover.py              # HF Hub crawler + ranking
├── phase2_merging/
│   └── merge.py                 # Merging + architecture introspection
├── phase3_evaluation/
│   └── evaluate.py              # Multi-metric eval + gap detection
├── phase4_finetuning/
│   └── finetune.py              # QLoRA + synthetic data + loop
├── phase5_mlops/
│   └── serve.py                 # vLLM + W&B + MLflow + HF deploy
└── artifacts/                   # Auto-created at runtime
    ├── models/
    ├── merges/
    ├── adapters/
    ├── evaluations/
    └── data/

⚙️ Configuration

Edit configs/settings.py to customize:

# Scale preset (currently: medium = 7B, single A100)
SCALE = "medium"

# Categories and keywords for discovery
HF_MODEL_CATEGORIES = {
    "code":      ["starcoder", "codellama", "deepseek-coder"],
    "reasoning": ["mistral", "llama", "qwen"],
    ...
}

# Fine-tuning defaults
FT_BASE_MODEL = "mistralai/Mistral-7B-v0.3"
FT_EPOCHS     = 3
FT_LR         = 2e-4

# vLLM
VLLM_GPU_MEMORY_UTIL = 0.90
VLLM_MAX_MODEL_LEN   = 4096

🧩 Supported Merge Strategies

Strategy	Type	Best For
`slerp`	Union	Two-model smooth interpolation
`ties`	Union	Multi-model, removes conflicting deltas
`dare_ties`	Union	Aggressive sparsification before TIES
`task_arithmetic`	Union	Adding task-specific capabilities
`breadcrumbs`	Intersection	Conservative, safety-preserving merge

📊 Evaluation Metrics

Metric	Tool	Threshold
ROUGE-1/2/L	`rouge-score`	≥ 0.30
BERTScore F1	`bert-score`	≥ 0.50
Faithfulness	`cross-encoder/nli-deberta-v3-small`	≥ 0.50
Hallucination	Heuristic + NLI	< 10%
Judge Score	LLM-as-Judge	≥ 5.0/10

🔄 Iterative Improvement Loop

┌─ Evaluate model ──────────────────────────────────┐
│   ROUGE / BERTScore / Judge / Faithfulness         │
└──────────────────┬────────────────────────────────┘
                   │ gaps detected?
                   ▼
┌─ Detect knowledge gaps ───────────────────────────┐
│   factual_recall / numerical / code / reasoning    │
└──────────────────┬────────────────────────────────┘
                   │
                   ▼
┌─ Generate synthetic data ─────────────────────────┐
│   LLM generates targeted QA pairs per gap          │
└──────────────────┬────────────────────────────────┘
                   │
                   ▼
┌─ QLoRA fine-tune ─────────────────────────────────┐
│   Response-only loss, 4-bit NF4, paged_adamw       │
└──────────────────┬────────────────────────────────┘
                   │
                   └──────────────► repeat until target ROUGE or max_iter

🛠️ Hardware Requirements

Scale	GPU	RAM	Notes
Small (1–3B)	Any CUDA GPU	16GB	CPU possible but slow
Medium (7B)	A100 / H100 40GB	32GB	Recommended
Large (13B+)	2× A100 80GB	64GB	Set `tensor_parallel=2`

📦 Key Dependencies

transformers — model loading
peft — LoRA/QLoRA adapters
trl — SFTTrainer
mergekit — TIES, DARE, SLERP
vllm — high-throughput inference
bert-score — semantic evaluation
wandb + mlflow — experiment tracking

📄 License

Apache 2.0

🙏 Acknowledgements

mergekit by Arcee AI
TIES-Merging — Yadav et al., 2023
DARE — Yu et al., 2023
Task Arithmetic — Ilharco et al., 2023
QLoRA — Dettmers et al., 2023