| # π€ LLM Pipeline |
|
|
| > An end-to-end automated pipeline for **discovering**, **merging**, **evaluating**, and **fine-tuning** open-source LLMs β with full MLOps integration. |
|
|
| [](https://www.python.org/downloads/) |
| [](https://opensource.org/licenses/Apache-2.0) |
| [](https://huggingface.co) |
| [](https://wandb.ai) |
|
|
| --- |
|
|
| ## πΊοΈ Architecture |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β Phase 1 β Discovery β Scan HF Hub β filter β rank β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ |
| β Phase 2 β Merging β SLERP Β· TIES Β· DARE Β· TA β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ |
| β Phase 3 β Evaluation β ROUGE Β· BERTScore Β· Judge β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ |
| β Phase 4 β Fine-Tuning β LoRA/QLoRA Β· Synthetic Data β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ |
| β Phase 5 β MLOps β vLLM Β· W&B Β· MLflow Β· HF Hub β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β__________________________| |
| Iterative improvement loop |
| ``` |
|
|
| --- |
|
|
| ## β¨ Features |
|
|
| ### Phase 1 β Model Discovery |
| - Automated HF Hub crawler with category-based keyword search |
| - Quality filtering: downloads, likes, parameter count, model card completeness |
| - Optional lightweight perplexity probe for fast quality estimation |
| - Composite scoring and ranked shortlist output |
|
|
| ### Phase 2 β Model Composition |
| - **Union merging** (capability aggregation): |
| - SLERP β spherical linear interpolation |
| - TIES β trim, elect sign, merge |
| - DARE-TIES β dropout + TIES |
| - Task Arithmetic β delta-weight addition |
| - **Intersection merging** (conservative): |
| - Breadcrumbs β consensus-only parameter updates |
| - DOM-tree-style architecture introspection (layers, attention heads, MLP blocks) |
| - mergekit integration + pure-PyTorch fallback |
|
|
| ### Phase 3 β Evaluation Framework |
| - ROUGE (rouge1, rouge2, rougeL) |
| - BERTScore (semantic similarity) |
| - Faithfulness + hallucination detection (NLI-based) |
| - LLM-as-Judge scoring (0β10) |
| - Multi-model side-by-side comparison |
| - **Knowledge gap detector** β feeds Phase 4 |
|
|
| ### Phase 4 β Efficient Fine-Tuning |
| - QLoRA (4-bit NF4 quantization + LoRA adapters) |
| - Response-only training (loss on assistant turns only) |
| - Synthetic data generation per detected gap category |
| - Delta adapter extraction (merge-ready weights) |
| - **Iterative improvement loop**: eval β gap detect β generate β train β repeat |
|
|
| ### Phase 5 β MLOps |
| - vLLM inference with PagedAttention (OpenAI-compatible API) |
| - Throughput benchmarking |
| - Dual tracking: Weights & Biases + MLflow |
| - Auto-generated model cards |
| - One-command HF Hub deployment |
|
|
| --- |
|
|
| ## π Quick Start |
|
|
| ### 1. Install |
|
|
| ```bash |
| git clone https://github.com/YOUR_USERNAME/llm-pipeline.git |
| cd llm-pipeline |
| pip install -r requirements.txt |
| ``` |
|
|
| ### 2. Set environment variables |
|
|
| ```bash |
| export HF_TOKEN="hf_..." # Hugging Face token |
| export WANDB_API_KEY="..." # W&B token (optional) |
| ``` |
|
|
| ### 3. Run the full pipeline |
|
|
| ```bash |
| # Full pipeline for reasoning models |
| python pipeline.py run reasoning |
| |
| # With iterative improvement loop |
| python pipeline.py run code --loop --max-iter 3 |
| |
| # Custom merge strategy |
| python pipeline.py run medical --strategy breadcrumbs --top-k 3 |
| ``` |
|
|
| --- |
|
|
| ## π Usage β Individual Phases |
|
|
| ### Phase 1: Discovery |
| ```bash |
| python -m phase1_discovery.discover run reasoning --top-k 5 |
| python -m phase1_discovery.discover run code --perplexity # adds perplexity probe |
| python -m phase1_discovery.discover run --all # all categories |
| ``` |
|
|
| ### Phase 2: Merging |
| ```bash |
| # TIES merge (recommended for union) |
| python -m phase2_merging.merge run ties \ |
| --model mistralai/Mistral-7B-v0.3 \ |
| --model teknium/OpenHermes-2.5-Mistral-7B \ |
| --base mistralai/Mistral-7B-v0.3 |
| |
| # SLERP interpolation |
| python -m phase2_merging.merge run slerp \ |
| --model model_a --model model_b --alpha 0.6 |
| |
| # Breadcrumbs (conservative / intersection) |
| python -m phase2_merging.merge run breadcrumbs \ |
| --model base --model ft_a --model ft_b --density 0.7 |
| |
| # Inspect architecture (DOM-tree view) |
| python -m phase2_merging.merge run ties \ |
| --introspect mistralai/Mistral-7B-v0.3 |
| ``` |
|
|
| ### Phase 3: Evaluation |
| ```bash |
| # Evaluate on SQuAD v2 |
| python -m phase3_evaluation.evaluate run ./merged_model --dataset squad --n-samples 200 |
| |
| # Compare multiple models |
| python -m phase3_evaluation.evaluate run model_a \ |
| --compare model_b --compare model_c |
| |
| # Disable LLM judge (faster) |
| python -m phase3_evaluation.evaluate run ./merged --no-judge |
| ``` |
|
|
| ### Phase 4: Fine-Tuning |
| ```bash |
| # Fine-tune targeting specific gaps |
| python -m phase4_finetuning.finetune run \ |
| --base mistralai/Mistral-7B-v0.3 \ |
| --gap factual_recall --gap numerical \ |
| --n-syn 100 --output ./adapters/run1 |
| |
| # Use existing synthetic data |
| python -m phase4_finetuning.finetune run \ |
| --base mistralai/Mistral-7B-v0.3 \ |
| --data-path ./artifacts/data/synthetic_data.jsonl |
| |
| # Iterative loop |
| python -m phase4_finetuning.finetune run --loop --max-iter 3 |
| ``` |
|
|
| ### Phase 5: Inference & MLOps |
| ```bash |
| # Start vLLM server (OpenAI-compatible) |
| python -m phase5_mlops.serve serve ./merged_model --port 8000 |
| |
| # Benchmark throughput |
| python -m phase5_mlops.serve serve ./merged_model --bench |
| |
| # Track experiment |
| python -m phase5_mlops.serve track my-run \ |
| --model ./merged --strategy ties \ |
| --rouge1 0.42 --bertscore 0.71 --judge 7.3 |
| |
| # Deploy to HF Hub |
| python -m phase5_mlops.serve deploy ./merged_model \ |
| --repo your-username/my-merged-7b |
| |
| # View leaderboard |
| python -m phase5_mlops.serve leaderboard |
| ``` |
|
|
| --- |
|
|
| ## π Project Structure |
|
|
| ``` |
| llm-pipeline/ |
| βββ pipeline.py # Master orchestrator |
| βββ requirements.txt |
| βββ configs/ |
| β βββ settings.py # All config: paths, scale, hyperparams |
| βββ utils/ |
| β βββ logger.py # Centralized logging |
| βββ phase1_discovery/ |
| β βββ discover.py # HF Hub crawler + ranking |
| βββ phase2_merging/ |
| β βββ merge.py # Merging + architecture introspection |
| βββ phase3_evaluation/ |
| β βββ evaluate.py # Multi-metric eval + gap detection |
| βββ phase4_finetuning/ |
| β βββ finetune.py # QLoRA + synthetic data + loop |
| βββ phase5_mlops/ |
| β βββ serve.py # vLLM + W&B + MLflow + HF deploy |
| βββ artifacts/ # Auto-created at runtime |
| βββ models/ |
| βββ merges/ |
| βββ adapters/ |
| βββ evaluations/ |
| βββ data/ |
| ``` |
|
|
| --- |
|
|
| ## βοΈ Configuration |
|
|
| Edit `configs/settings.py` to customize: |
|
|
| ```python |
| # Scale preset (currently: medium = 7B, single A100) |
| SCALE = "medium" |
| |
| # Categories and keywords for discovery |
| HF_MODEL_CATEGORIES = { |
| "code": ["starcoder", "codellama", "deepseek-coder"], |
| "reasoning": ["mistral", "llama", "qwen"], |
| ... |
| } |
| |
| # Fine-tuning defaults |
| FT_BASE_MODEL = "mistralai/Mistral-7B-v0.3" |
| FT_EPOCHS = 3 |
| FT_LR = 2e-4 |
| |
| # vLLM |
| VLLM_GPU_MEMORY_UTIL = 0.90 |
| VLLM_MAX_MODEL_LEN = 4096 |
| ``` |
|
|
| --- |
|
|
| ## π§© Supported Merge Strategies |
|
|
| | Strategy | Type | Best For | |
| |---|---|---| |
| | `slerp` | Union | Two-model smooth interpolation | |
| | `ties` | Union | Multi-model, removes conflicting deltas | |
| | `dare_ties` | Union | Aggressive sparsification before TIES | |
| | `task_arithmetic` | Union | Adding task-specific capabilities | |
| | `breadcrumbs` | Intersection | Conservative, safety-preserving merge | |
|
|
| --- |
|
|
| ## π Evaluation Metrics |
|
|
| | Metric | Tool | Threshold | |
| |---|---|---| |
| | ROUGE-1/2/L | `rouge-score` | β₯ 0.30 | |
| | BERTScore F1 | `bert-score` | β₯ 0.50 | |
| | Faithfulness | `cross-encoder/nli-deberta-v3-small` | β₯ 0.50 | |
| | Hallucination | Heuristic + NLI | < 10% | |
| | Judge Score | LLM-as-Judge | β₯ 5.0/10 | |
|
|
| --- |
|
|
| ## π Iterative Improvement Loop |
|
|
| ``` |
| ββ Evaluate model βββββββββββββββββββββββββββββββββββ |
| β ROUGE / BERTScore / Judge / Faithfulness β |
| ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ |
| β gaps detected? |
| βΌ |
| ββ Detect knowledge gaps ββββββββββββββββββββββββββββ |
| β factual_recall / numerical / code / reasoning β |
| ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ |
| β |
| βΌ |
| ββ Generate synthetic data ββββββββββββββββββββββββββ |
| β LLM generates targeted QA pairs per gap β |
| ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ |
| β |
| βΌ |
| ββ QLoRA fine-tune ββββββββββββββββββββββββββββββββββ |
| β Response-only loss, 4-bit NF4, paged_adamw β |
| ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ |
| β |
| ββββββββββββββββΊ repeat until target ROUGE or max_iter |
| ``` |
|
|
| --- |
|
|
| ## π οΈ Hardware Requirements |
|
|
| | Scale | GPU | RAM | Notes | |
| |---|---|---|---| |
| | Small (1β3B) | Any CUDA GPU | 16GB | CPU possible but slow | |
| | **Medium (7B)** | **A100 / H100 40GB** | **32GB** | **Recommended** | |
| | Large (13B+) | 2Γ A100 80GB | 64GB | Set `tensor_parallel=2` | |
|
|
| --- |
|
|
| ## π¦ Key Dependencies |
|
|
| - [`transformers`](https://github.com/huggingface/transformers) β model loading |
| - [`peft`](https://github.com/huggingface/peft) β LoRA/QLoRA adapters |
| - [`trl`](https://github.com/huggingface/trl) β SFTTrainer |
| - [`mergekit`](https://github.com/arcee-ai/mergekit) β TIES, DARE, SLERP |
| - [`vllm`](https://github.com/vllm-project/vllm) β high-throughput inference |
| - [`bert-score`](https://github.com/Tiiiger/bert_score) β semantic evaluation |
| - [`wandb`](https://wandb.ai) + [`mlflow`](https://mlflow.org) β experiment tracking |
|
|
| --- |
|
|
| ## π License |
|
|
| [Apache 2.0](LICENSE) |
|
|
| --- |
|
|
| ## π Acknowledgements |
|
|
| - [mergekit](https://github.com/arcee-ai/mergekit) by Arcee AI |
| - [TIES-Merging](https://arxiv.org/abs/2306.01708) β Yadav et al., 2023 |
| - [DARE](https://arxiv.org/abs/2311.03099) β Yu et al., 2023 |
| - [Task Arithmetic](https://arxiv.org/abs/2212.04089) β Ilharco et al., 2023 |
| - [QLoRA](https://arxiv.org/abs/2305.14314) β Dettmers et al., 2023 |
|
|