π€ LLM Pipeline
An end-to-end automated pipeline for discovering, merging, evaluating, and fine-tuning open-source LLMs β with full MLOps integration.

πΊοΈ Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Phase 1 β Discovery β Scan HF Hub β filter β rank β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Phase 2 β Merging β SLERP Β· TIES Β· DARE Β· TA β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Phase 3 β Evaluation β ROUGE Β· BERTScore Β· Judge β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Phase 4 β Fine-Tuning β LoRA/QLoRA Β· Synthetic Data β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Phase 5 β MLOps β vLLM Β· W&B Β· MLflow Β· HF Hub β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β__________________________|
Iterative improvement loop
β¨ Features
Phase 1 β Model Discovery
- Automated HF Hub crawler with category-based keyword search
- Quality filtering: downloads, likes, parameter count, model card completeness
- Optional lightweight perplexity probe for fast quality estimation
- Composite scoring and ranked shortlist output
Phase 2 β Model Composition
- Union merging (capability aggregation):
- SLERP β spherical linear interpolation
- TIES β trim, elect sign, merge
- DARE-TIES β dropout + TIES
- Task Arithmetic β delta-weight addition
- Intersection merging (conservative):
- Breadcrumbs β consensus-only parameter updates
- DOM-tree-style architecture introspection (layers, attention heads, MLP blocks)
- mergekit integration + pure-PyTorch fallback
Phase 3 β Evaluation Framework
- ROUGE (rouge1, rouge2, rougeL)
- BERTScore (semantic similarity)
- Faithfulness + hallucination detection (NLI-based)
- LLM-as-Judge scoring (0β10)
- Multi-model side-by-side comparison
- Knowledge gap detector β feeds Phase 4
Phase 4 β Efficient Fine-Tuning
- QLoRA (4-bit NF4 quantization + LoRA adapters)
- Response-only training (loss on assistant turns only)
- Synthetic data generation per detected gap category
- Delta adapter extraction (merge-ready weights)
- Iterative improvement loop: eval β gap detect β generate β train β repeat
Phase 5 β MLOps
- vLLM inference with PagedAttention (OpenAI-compatible API)
- Throughput benchmarking
- Dual tracking: Weights & Biases + MLflow
- Auto-generated model cards
- One-command HF Hub deployment
π Quick Start
1. Install
git clone https://github.com/YOUR_USERNAME/llm-pipeline.git
cd llm-pipeline
pip install -r requirements.txt
2. Set environment variables
export HF_TOKEN="hf_..."
export WANDB_API_KEY="..."
3. Run the full pipeline
python pipeline.py run reasoning
python pipeline.py run code --loop --max-iter 3
python pipeline.py run medical --strategy breadcrumbs --top-k 3
π Usage β Individual Phases
Phase 1: Discovery
python -m phase1_discovery.discover run reasoning --top-k 5
python -m phase1_discovery.discover run code --perplexity
python -m phase1_discovery.discover run --all
Phase 2: Merging
python -m phase2_merging.merge run ties \
--model mistralai/Mistral-7B-v0.3 \
--model teknium/OpenHermes-2.5-Mistral-7B \
--base mistralai/Mistral-7B-v0.3
python -m phase2_merging.merge run slerp \
--model model_a --model model_b --alpha 0.6
python -m phase2_merging.merge run breadcrumbs \
--model base --model ft_a --model ft_b --density 0.7
python -m phase2_merging.merge run ties \
--introspect mistralai/Mistral-7B-v0.3
Phase 3: Evaluation
python -m phase3_evaluation.evaluate run ./merged_model --dataset squad --n-samples 200
python -m phase3_evaluation.evaluate run model_a \
--compare model_b --compare model_c
python -m phase3_evaluation.evaluate run ./merged --no-judge
Phase 4: Fine-Tuning
python -m phase4_finetuning.finetune run \
--base mistralai/Mistral-7B-v0.3 \
--gap factual_recall --gap numerical \
--n-syn 100 --output ./adapters/run1
python -m phase4_finetuning.finetune run \
--base mistralai/Mistral-7B-v0.3 \
--data-path ./artifacts/data/synthetic_data.jsonl
python -m phase4_finetuning.finetune run --loop --max-iter 3
Phase 5: Inference & MLOps
python -m phase5_mlops.serve serve ./merged_model --port 8000
python -m phase5_mlops.serve serve ./merged_model --bench
python -m phase5_mlops.serve track my-run \
--model ./merged --strategy ties \
--rouge1 0.42 --bertscore 0.71 --judge 7.3
python -m phase5_mlops.serve deploy ./merged_model \
--repo your-username/my-merged-7b
python -m phase5_mlops.serve leaderboard
π Project Structure
llm-pipeline/
βββ pipeline.py # Master orchestrator
βββ requirements.txt
βββ configs/
β βββ settings.py # All config: paths, scale, hyperparams
βββ utils/
β βββ logger.py # Centralized logging
βββ phase1_discovery/
β βββ discover.py # HF Hub crawler + ranking
βββ phase2_merging/
β βββ merge.py # Merging + architecture introspection
βββ phase3_evaluation/
β βββ evaluate.py # Multi-metric eval + gap detection
βββ phase4_finetuning/
β βββ finetune.py # QLoRA + synthetic data + loop
βββ phase5_mlops/
β βββ serve.py # vLLM + W&B + MLflow + HF deploy
βββ artifacts/ # Auto-created at runtime
βββ models/
βββ merges/
βββ adapters/
βββ evaluations/
βββ data/
βοΈ Configuration
Edit configs/settings.py to customize:
SCALE = "medium"
HF_MODEL_CATEGORIES = {
"code": ["starcoder", "codellama", "deepseek-coder"],
"reasoning": ["mistral", "llama", "qwen"],
...
}
FT_BASE_MODEL = "mistralai/Mistral-7B-v0.3"
FT_EPOCHS = 3
FT_LR = 2e-4
VLLM_GPU_MEMORY_UTIL = 0.90
VLLM_MAX_MODEL_LEN = 4096
π§© Supported Merge Strategies
| Strategy |
Type |
Best For |
slerp |
Union |
Two-model smooth interpolation |
ties |
Union |
Multi-model, removes conflicting deltas |
dare_ties |
Union |
Aggressive sparsification before TIES |
task_arithmetic |
Union |
Adding task-specific capabilities |
breadcrumbs |
Intersection |
Conservative, safety-preserving merge |
π Evaluation Metrics
| Metric |
Tool |
Threshold |
| ROUGE-1/2/L |
rouge-score |
β₯ 0.30 |
| BERTScore F1 |
bert-score |
β₯ 0.50 |
| Faithfulness |
cross-encoder/nli-deberta-v3-small |
β₯ 0.50 |
| Hallucination |
Heuristic + NLI |
< 10% |
| Judge Score |
LLM-as-Judge |
β₯ 5.0/10 |
π Iterative Improvement Loop
ββ Evaluate model βββββββββββββββββββββββββββββββββββ
β ROUGE / BERTScore / Judge / Faithfulness β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β gaps detected?
βΌ
ββ Detect knowledge gaps ββββββββββββββββββββββββββββ
β factual_recall / numerical / code / reasoning β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
βΌ
ββ Generate synthetic data ββββββββββββββββββββββββββ
β LLM generates targeted QA pairs per gap β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
βΌ
ββ QLoRA fine-tune ββββββββββββββββββββββββββββββββββ
β Response-only loss, 4-bit NF4, paged_adamw β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β
ββββββββββββββββΊ repeat until target ROUGE or max_iter
π οΈ Hardware Requirements
| Scale |
GPU |
RAM |
Notes |
| Small (1β3B) |
Any CUDA GPU |
16GB |
CPU possible but slow |
| Medium (7B) |
A100 / H100 40GB |
32GB |
Recommended |
| Large (13B+) |
2Γ A100 80GB |
64GB |
Set tensor_parallel=2 |
π¦ Key Dependencies
π License
Apache 2.0
π Acknowledgements