# πŸ€– LLM Pipeline > An end-to-end automated pipeline for **discovering**, **merging**, **evaluating**, and **fine-tuning** open-source LLMs β€” with full MLOps integration. [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0) [![HF Hub](https://img.shields.io/badge/πŸ€—-Hugging%20Face-yellow)](https://huggingface.co) [![W&B](https://img.shields.io/badge/Weights%20%26%20Biases-tracking-orange)](https://wandb.ai) --- ## πŸ—ΊοΈ Architecture ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Phase 1 β€” Discovery β”‚ Scan HF Hub β†’ filter β†’ rank β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Phase 2 β€” Merging β”‚ SLERP Β· TIES Β· DARE Β· TA β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Phase 3 β€” Evaluation β”‚ ROUGE Β· BERTScore Β· Judge β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Phase 4 β€” Fine-Tuning β”‚ LoRA/QLoRA Β· Synthetic Data β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ Phase 5 β€” MLOps β”‚ vLLM Β· W&B Β· MLflow Β· HF Hub β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ↑__________________________| Iterative improvement loop ``` --- ## ✨ Features ### Phase 1 β€” Model Discovery - Automated HF Hub crawler with category-based keyword search - Quality filtering: downloads, likes, parameter count, model card completeness - Optional lightweight perplexity probe for fast quality estimation - Composite scoring and ranked shortlist output ### Phase 2 β€” Model Composition - **Union merging** (capability aggregation): - SLERP β€” spherical linear interpolation - TIES β€” trim, elect sign, merge - DARE-TIES β€” dropout + TIES - Task Arithmetic β€” delta-weight addition - **Intersection merging** (conservative): - Breadcrumbs β€” consensus-only parameter updates - DOM-tree-style architecture introspection (layers, attention heads, MLP blocks) - mergekit integration + pure-PyTorch fallback ### Phase 3 β€” Evaluation Framework - ROUGE (rouge1, rouge2, rougeL) - BERTScore (semantic similarity) - Faithfulness + hallucination detection (NLI-based) - LLM-as-Judge scoring (0–10) - Multi-model side-by-side comparison - **Knowledge gap detector** β†’ feeds Phase 4 ### Phase 4 β€” Efficient Fine-Tuning - QLoRA (4-bit NF4 quantization + LoRA adapters) - Response-only training (loss on assistant turns only) - Synthetic data generation per detected gap category - Delta adapter extraction (merge-ready weights) - **Iterative improvement loop**: eval β†’ gap detect β†’ generate β†’ train β†’ repeat ### Phase 5 β€” MLOps - vLLM inference with PagedAttention (OpenAI-compatible API) - Throughput benchmarking - Dual tracking: Weights & Biases + MLflow - Auto-generated model cards - One-command HF Hub deployment --- ## πŸš€ Quick Start ### 1. Install ```bash git clone https://github.com/YOUR_USERNAME/llm-pipeline.git cd llm-pipeline pip install -r requirements.txt ``` ### 2. Set environment variables ```bash export HF_TOKEN="hf_..." # Hugging Face token export WANDB_API_KEY="..." # W&B token (optional) ``` ### 3. Run the full pipeline ```bash # Full pipeline for reasoning models python pipeline.py run reasoning # With iterative improvement loop python pipeline.py run code --loop --max-iter 3 # Custom merge strategy python pipeline.py run medical --strategy breadcrumbs --top-k 3 ``` --- ## πŸ“– Usage β€” Individual Phases ### Phase 1: Discovery ```bash python -m phase1_discovery.discover run reasoning --top-k 5 python -m phase1_discovery.discover run code --perplexity # adds perplexity probe python -m phase1_discovery.discover run --all # all categories ``` ### Phase 2: Merging ```bash # TIES merge (recommended for union) python -m phase2_merging.merge run ties \ --model mistralai/Mistral-7B-v0.3 \ --model teknium/OpenHermes-2.5-Mistral-7B \ --base mistralai/Mistral-7B-v0.3 # SLERP interpolation python -m phase2_merging.merge run slerp \ --model model_a --model model_b --alpha 0.6 # Breadcrumbs (conservative / intersection) python -m phase2_merging.merge run breadcrumbs \ --model base --model ft_a --model ft_b --density 0.7 # Inspect architecture (DOM-tree view) python -m phase2_merging.merge run ties \ --introspect mistralai/Mistral-7B-v0.3 ``` ### Phase 3: Evaluation ```bash # Evaluate on SQuAD v2 python -m phase3_evaluation.evaluate run ./merged_model --dataset squad --n-samples 200 # Compare multiple models python -m phase3_evaluation.evaluate run model_a \ --compare model_b --compare model_c # Disable LLM judge (faster) python -m phase3_evaluation.evaluate run ./merged --no-judge ``` ### Phase 4: Fine-Tuning ```bash # Fine-tune targeting specific gaps python -m phase4_finetuning.finetune run \ --base mistralai/Mistral-7B-v0.3 \ --gap factual_recall --gap numerical \ --n-syn 100 --output ./adapters/run1 # Use existing synthetic data python -m phase4_finetuning.finetune run \ --base mistralai/Mistral-7B-v0.3 \ --data-path ./artifacts/data/synthetic_data.jsonl # Iterative loop python -m phase4_finetuning.finetune run --loop --max-iter 3 ``` ### Phase 5: Inference & MLOps ```bash # Start vLLM server (OpenAI-compatible) python -m phase5_mlops.serve serve ./merged_model --port 8000 # Benchmark throughput python -m phase5_mlops.serve serve ./merged_model --bench # Track experiment python -m phase5_mlops.serve track my-run \ --model ./merged --strategy ties \ --rouge1 0.42 --bertscore 0.71 --judge 7.3 # Deploy to HF Hub python -m phase5_mlops.serve deploy ./merged_model \ --repo your-username/my-merged-7b # View leaderboard python -m phase5_mlops.serve leaderboard ``` --- ## πŸ“ Project Structure ``` llm-pipeline/ β”œβ”€β”€ pipeline.py # Master orchestrator β”œβ”€β”€ requirements.txt β”œβ”€β”€ configs/ β”‚ └── settings.py # All config: paths, scale, hyperparams β”œβ”€β”€ utils/ β”‚ └── logger.py # Centralized logging β”œβ”€β”€ phase1_discovery/ β”‚ └── discover.py # HF Hub crawler + ranking β”œβ”€β”€ phase2_merging/ β”‚ └── merge.py # Merging + architecture introspection β”œβ”€β”€ phase3_evaluation/ β”‚ └── evaluate.py # Multi-metric eval + gap detection β”œβ”€β”€ phase4_finetuning/ β”‚ └── finetune.py # QLoRA + synthetic data + loop β”œβ”€β”€ phase5_mlops/ β”‚ └── serve.py # vLLM + W&B + MLflow + HF deploy └── artifacts/ # Auto-created at runtime β”œβ”€β”€ models/ β”œβ”€β”€ merges/ β”œβ”€β”€ adapters/ β”œβ”€β”€ evaluations/ └── data/ ``` --- ## βš™οΈ Configuration Edit `configs/settings.py` to customize: ```python # Scale preset (currently: medium = 7B, single A100) SCALE = "medium" # Categories and keywords for discovery HF_MODEL_CATEGORIES = { "code": ["starcoder", "codellama", "deepseek-coder"], "reasoning": ["mistral", "llama", "qwen"], ... } # Fine-tuning defaults FT_BASE_MODEL = "mistralai/Mistral-7B-v0.3" FT_EPOCHS = 3 FT_LR = 2e-4 # vLLM VLLM_GPU_MEMORY_UTIL = 0.90 VLLM_MAX_MODEL_LEN = 4096 ``` --- ## 🧩 Supported Merge Strategies | Strategy | Type | Best For | |---|---|---| | `slerp` | Union | Two-model smooth interpolation | | `ties` | Union | Multi-model, removes conflicting deltas | | `dare_ties` | Union | Aggressive sparsification before TIES | | `task_arithmetic` | Union | Adding task-specific capabilities | | `breadcrumbs` | Intersection | Conservative, safety-preserving merge | --- ## πŸ“Š Evaluation Metrics | Metric | Tool | Threshold | |---|---|---| | ROUGE-1/2/L | `rouge-score` | β‰₯ 0.30 | | BERTScore F1 | `bert-score` | β‰₯ 0.50 | | Faithfulness | `cross-encoder/nli-deberta-v3-small` | β‰₯ 0.50 | | Hallucination | Heuristic + NLI | < 10% | | Judge Score | LLM-as-Judge | β‰₯ 5.0/10 | --- ## πŸ”„ Iterative Improvement Loop ``` β”Œβ”€ Evaluate model ──────────────────────────────────┐ β”‚ ROUGE / BERTScore / Judge / Faithfulness β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ gaps detected? β–Ό β”Œβ”€ Detect knowledge gaps ───────────────────────────┐ β”‚ factual_recall / numerical / code / reasoning β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€ Generate synthetic data ─────────────────────────┐ β”‚ LLM generates targeted QA pairs per gap β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β–Ό β”Œβ”€ QLoRA fine-tune ─────────────────────────────────┐ β”‚ Response-only loss, 4-bit NF4, paged_adamw β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ └──────────────► repeat until target ROUGE or max_iter ``` --- ## πŸ› οΈ Hardware Requirements | Scale | GPU | RAM | Notes | |---|---|---|---| | Small (1–3B) | Any CUDA GPU | 16GB | CPU possible but slow | | **Medium (7B)** | **A100 / H100 40GB** | **32GB** | **Recommended** | | Large (13B+) | 2Γ— A100 80GB | 64GB | Set `tensor_parallel=2` | --- ## πŸ“¦ Key Dependencies - [`transformers`](https://github.com/huggingface/transformers) β€” model loading - [`peft`](https://github.com/huggingface/peft) β€” LoRA/QLoRA adapters - [`trl`](https://github.com/huggingface/trl) β€” SFTTrainer - [`mergekit`](https://github.com/arcee-ai/mergekit) β€” TIES, DARE, SLERP - [`vllm`](https://github.com/vllm-project/vllm) β€” high-throughput inference - [`bert-score`](https://github.com/Tiiiger/bert_score) β€” semantic evaluation - [`wandb`](https://wandb.ai) + [`mlflow`](https://mlflow.org) β€” experiment tracking --- ## πŸ“„ License [Apache 2.0](LICENSE) --- ## πŸ™ Acknowledgements - [mergekit](https://github.com/arcee-ai/mergekit) by Arcee AI - [TIES-Merging](https://arxiv.org/abs/2306.01708) β€” Yadav et al., 2023 - [DARE](https://arxiv.org/abs/2311.03099) β€” Yu et al., 2023 - [Task Arithmetic](https://arxiv.org/abs/2212.04089) β€” Ilharco et al., 2023 - [QLoRA](https://arxiv.org/abs/2305.14314) β€” Dettmers et al., 2023