dbhavery
/

fineforge-qlora-pipeline

@@ -1,55 +1,396 @@
----
-library_name: peft
-tags:
-  - fine-tuning
-  - qlora
-  - lora
-  - gguf
-  - ollama
-  - consumer-gpu
-license: mit
----
-# FineForge — QLoRA Fine-Tuning Pipeline
-End-to-end LoRA/QLoRA fine-tuning pipeline designed for consumer GPUs (RTX 3090, 4090). Curate datasets, train adapters, evaluate, and export to GGUF for local inference with Ollama.
-## Pipeline
-```
-Dataset Curation -> QLoRA Training -> Evaluation -> GGUF Export -> Ollama Deploy
-     |                   |                |              |              |
-  Filter/clean     4-bit quant      Perplexity     llama.cpp       ollama create
-  Format/split     LoRA adapters    Task metrics   conversion      model:tag
-  JSONL output     Checkpoints      Comparisons    Quantization    Local serve
-```
-## Features
-- **Dataset curation** — Filter, clean, format, train/val/test split
-- **QLoRA training** — 4-bit quantization, LoRA rank/alpha configuration
-- **Multi-GPU** — DataParallel for multi-GPU setups
-- **Evaluation** — Perplexity, task-specific metrics, baseline comparison
-- **GGUF export** — Convert to GGUF format for llama.cpp / Ollama
-- **Ollama integration** — Auto-create Modelfile, register with Ollama
-## Hardware Requirements
-| GPU | VRAM | Max Model Size |
-|-----|------|---------------|
-| RTX 3090 | 24GB | 13B (QLoRA) |
-| RTX 4090 | 24GB | 13B (QLoRA) |
-| RTX 3060 | 12GB | 7B (QLoRA) |
-| RTX 4060 | 8GB | 3B (QLoRA) |
-## Usage
-```bash
-pip install fineforge
-fineforge curate --input data.jsonl --output curated/ --strategy quality
-fineforge train --config train.yaml --output checkpoints/
-fineforge eval --checkpoint checkpoints/best --benchmark mmlu
-fineforge export --checkpoint checkpoints/best --format gguf --quant q4_k_m
-```
-**48 tests** | [GitHub](https://github.com/dbhavery/fineforge) | [Author](https://github.com/dbhavery)

+---
+library_name: peft
+license: mit
+tags:
+  - fine-tuning
+  - qlora
+  - lora
+  - gguf
+  - ollama
+  - consumer-gpu
+  - peft
+  - quantization
+language:
+  - en
+pipeline_tag: text-generation
+---
+# FineForge -- QLoRA Fine-Tuning Pipeline for Consumer GPUs
+An end-to-end CLI pipeline that takes raw chat data and produces a fine-tuned language model running locally in Ollama. FineForge handles dataset curation (validate, score, deduplicate, split), QLoRA training with 4-bit quantization, before/after evaluation, GGUF export, and Ollama registration -- all designed to run on a single consumer GPU.
+The core problem FineForge solves: fine-tuning a 7B parameter model normally requires 28+ GB of VRAM (full FP32 weights plus optimizer states). QLoRA reduces this to under 8 GB by quantizing the base model to 4-bit NormalFloat and training only small rank-decomposition matrices injected into the attention layers. FineForge wraps this technique into a repeatable, config-driven pipeline.
+**Source**: [github.com/dbhavery/fineforge](https://github.com/dbhavery/fineforge)
+---
+## What is QLoRA
+QLoRA (Quantized Low-Rank Adaptation) combines two techniques:
+**4-bit quantization (NF4)**: The pretrained base model weights are loaded in 4-bit NormalFloat format, a data type specifically designed for normally-distributed neural network weights. Double quantization further compresses the quantization constants themselves. This reduces the memory footprint of a 7B model from ~14 GB (FP16) to ~3.5 GB, freeing VRAM for training.
+**Low-Rank Adaptation (LoRA)**: Instead of updating all model weights during training, LoRA freezes the quantized base model and injects small trainable rank-decomposition matrices into specific layers. For a weight matrix W of shape (d, k), LoRA adds a bypass: `W' = W + BA`, where `B` is (d, r) and `A` is (r, k), with rank `r` much smaller than both `d` and `k`. With r=16 on a 7B model, this means training ~10 million parameters instead of 7 billion -- a 700x reduction.
+The combination means you can fine-tune a 7B model on an 8 GB GPU that would otherwise only be able to run inference. The trained adapter (typically 20-50 MB) can be merged back into the base model for deployment or served as a standalone LoRA layer.
+**Paper**: [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) (Dettmers et al., 2023)
+---
+## Pipeline Architecture
+```
++-------------------+     +-------------------+     +-------------------+
+|                   |     |                   |     |                   |
+|   Raw Chat Data   |     |   Training Config |     |   Test Prompts    |
+|   (JSONL)         |     |   (YAML)          |     |   (YAML)          |
+|                   |     |                   |     |                   |
++--------+----------+     +--------+----------+     +--------+----------+
+         |                         |                          |
+         v                         v                          |
++--------+----------+     +--------+----------+              |
+|                   |     |                   |              |
+|  fineforge        |     |  fineforge        |              |
+|  prepare          |     |  train            |              |
+|                   |     |                   |              |
+|  - Validate fmt   |     |  - Load base      |              |
+|  - Score quality  |     |    model (4-bit)  |              |
+|  - Deduplicate    |     |  - Apply LoRA     |              |
+|  - Filter         |     |  - Tokenize data  |              |
+|  - Train/eval     |     |  - SFTTrainer     |              |
+|    split          |     |  - Save adapter   |              |
+|                   |     |                   |              |
++--------+----------+     +--------+----------+              |
+         |                         |                          |
+         v                         v                          v
++--------+----------+     +--------+----------+     +--------+----------+
+|                   |     |                   |     |                   |
+|  train.jsonl      |     |  LoRA Adapter     +---->+  fineforge        |
+|  eval.jsonl       |     |  (adapter/)       |     |  eval             |
+|                   |     |                   |     |                   |
++-------------------+     +--------+----------+     |  - Load base      |
+                                   |                |  - Load tuned     |
+                                   v                |  - Generate both  |
+                          +--------+----------+     |  - Score & compare|
+                          |                   |     |                   |
+                          |  fineforge        |     +--------+----------+
+                          |  export           |              |
+                          |                   |              v
+                          |  - Merge adapter  |     +--------+----------+
+                          |  - Convert GGUF   |     |                   |
+                          |  - Quantize       |     |  Eval Results     |
+                          |  - Ollama create  |     |  (JSON)           |
+                          |                   |     |                   |
+                          +--------+----------+     +-------------------+
+                                   |
+                                   v
+                          +--------+----------+
+                          |                   |
+                          |  Ollama           |
+                          |  ollama run       |
+                          |  my-tuned-model   |
+                          |                   |
+                          +-------------------+
+```
+---
+## Pipeline Stages
+### Stage 1: Prepare -- Dataset Curation
+```bash
+fineforge prepare my_chats.jsonl --output-dir ./data --min-quality 0.4
+```
+The prepare stage applies five operations in sequence:
+1. **Format validation**: Each sample must follow the OpenAI chat format -- a JSON object with a `messages` array containing objects with `role` (system/user/assistant) and `content` fields. Samples must have at least one user and one assistant message. Malformed samples are rejected with specific error messages.
+2. **Quality scoring**: Each valid sample receives a score from 0.0 to 1.0 based on five heuristics: assistant response length (0-0.3), multi-turn depth (0-0.2), presence of a system prompt (0-0.1), user message quality (0-0.2), and vocabulary diversity in assistant responses (0-0.2).
+3. **Deduplication**: SHA-256 content hashing over the role/content pairs. Exact duplicates are removed.
+4. **Filtering**: Samples below the minimum quality threshold and samples with very short assistant responses are discarded.
+5. **Train/eval split**: The remaining samples are shuffled with a fixed seed and split into training and evaluation sets (default 90/10).
+### Stage 2: Train -- QLoRA Fine-Tuning
+```bash
+fineforge train config.yaml
+```
+Training is fully configured via YAML. The trainer:
+1. Validates the configuration (LoRA rank, learning rate ranges, mutual exclusivity of fp16/bf16).
+2. Checks GPU availability and reports device name, VRAM, and CUDA version.
+3. Loads the base model with 4-bit NF4 quantization via `BitsAndBytesConfig` with double quantization enabled.
+4. Applies LoRA adapters to the specified target modules (default: q_proj, k_proj, v_proj, o_proj).
+5. Reports trainable parameter count (typically ~0.1-0.5% of total parameters).
+6. Loads and tokenizes the dataset using the model's chat template.
+7. Trains using `trl.SFTTrainer` with paged AdamW 8-bit optimizer and cosine learning rate schedule.
+8. Saves the adapter weights, tokenizer, and training metadata (loss, elapsed time, config snapshot).
+### Stage 3: Evaluate -- Before/After Comparison
+```bash
+fineforge eval ./output/adapter --prompts test_prompts.yaml --base-model unsloth/Qwen2.5-7B
+```
+Evaluation loads both the base model and the fine-tuned model, runs each test prompt through both, and compares the outputs. Responses are scored on length appropriateness, keyword coverage (if expected keywords are defined in the prompt YAML), and vocabulary diversity. Results are displayed as a side-by-side comparison table with per-prompt improvement scores.
+The base model is unloaded from GPU memory before the fine-tuned model is loaded, so evaluation fits within the same VRAM budget as training.
+### Stage 4: Export -- GGUF and Ollama Registration
+```bash
+fineforge export ./output/adapter \
+  --base-model unsloth/Qwen2.5-7B \
+  --quantization q4_k_m \
+  --ollama-name my-tuned-model
+```
+Export performs three steps:
+1. **Merge**: Load the base model in FP16, apply the LoRA adapter, call `merge_and_unload()` to fold the adapter weights permanently into the base model, and save the merged model.
+2. **GGUF conversion**: Use `llama.cpp`'s `convert-hf-to-gguf` script to convert the merged HuggingFace model to GGUF format (first to f16, then quantize to the target type).
+3. **Ollama registration**: Generate a Modelfile with the GGUF path, system prompt, and sampling parameters, then run `ollama create` to register the model locally.
+Supported quantization types: `q4_k_m` (recommended balance of size/quality), `q5_k_m`, `q8_0`, `f16`.
+---
+## Hardware Requirements
+| GPU VRAM | What You Can Fine-Tune | Notes |
+|----------|------------------------|-------|
+| 8 GB | 7B models (QLoRA 4-bit) | Tight -- reduce batch_size to 1-2, max_seq_length to 1024 |
+| 12 GB | 7B models comfortably | batch_size=4, max_seq_length=2048 |
+| 16 GB | 7B models with headroom, 13B tight | Enough for eval to load base + tuned sequentially |
+| 24 GB | 7B-13B models comfortably | batch_size=8+, longer sequences, larger LoRA rank |
+| Component | Minimum | Recommended |
+|-----------|---------|-------------|
+| GPU VRAM | 8 GB (NVIDIA, CUDA) | 16-24 GB |
+| System RAM | 16 GB | 32+ GB |
+| Disk | 20 GB (model weights + checkpoints) | 50+ GB |
+| CUDA | 11.8+ | 12.0+ |
+Tested on NVIDIA RTX 3090 (24 GB VRAM) with Qwen2.5-7B. AMD ROCm GPUs may work via PyTorch ROCm builds but are untested.
+---
+## Training Configuration Reference
+```yaml
+# config.yaml -- all parameters with defaults
+base_model: unsloth/Qwen2.5-7B     # HuggingFace model ID or local path
+dataset_path: ./data/train.jsonl     # Path to training JSONL
+output_dir: ./output                 # Output directory
+# LoRA hyperparameters
+lora_r: 16                           # Rank of the low-rank matrices
+lora_alpha: 32                       # Scaling factor (effective lr = alpha/r * lr)
+lora_dropout: 0.05                   # Dropout on LoRA layers
+lora_target_modules:                 # Which attention projections to adapt
+  - q_proj
+  - k_proj
+  - v_proj
+  - o_proj
+# Training hyperparameters
+learning_rate: 2e-4                  # Peak LR (cosine schedule with warmup)
+num_epochs: 3                        # Training epochs
+batch_size: 4                        # Per-device batch size
+gradient_accumulation_steps: 4       # Effective batch = batch_size * grad_accum
+max_seq_length: 2048                 # Truncation length
+warmup_steps: 10                     # Linear LR warmup
+fp16: true                           # Mixed-precision (use bf16 on Ampere+)
+bf16: false                          # BF16 -- mutually exclusive with fp16
+logging_steps: 10                    # Log loss every N steps
+save_steps: 100                      # Checkpoint every N steps
+eval_steps: 0                        # 0 = evaluate at end of epoch only
+seed: 42                             # Reproducibility seed
+# Data handling
+chat_template: chatml                # Chat template format
+trust_remote_code: false             # Allow custom model code from HF Hub
+```
+### Key Hyperparameter Guidance
+**LoRA rank (`lora_r`)**: Controls the expressiveness of the adaptation. r=8 is sufficient for style transfer and simple behavioral changes. r=16 (default) handles most instruction tuning. r=32-64 for complex domain adaptation. Higher rank = more trainable parameters = more VRAM = longer training.
+**`lora_alpha`**: Scaling factor for the LoRA update. The effective learning rate for LoRA parameters is `alpha/r * lr`. The default alpha=32 with r=16 gives a 2x scaling. If training is unstable, reduce alpha.
+**`ef_search` tuning analog -- `batch_size` and `gradient_accumulation_steps`**: The effective batch size is `batch_size * gradient_accumulation_steps`. Larger effective batches smooth the gradient but require more memory. With 8 GB VRAM, use batch_size=1 and gradient_accumulation_steps=8-16.
+---
+## Dataset Format
+FineForge uses the OpenAI chat completion format. Each line in the JSONL file is one conversation:
+```json
+{"messages": [
+  {"role": "system", "content": "You are a concise technical writer."},
+  {"role": "user", "content": "Explain what a mutex is."},
+  {"role": "assistant", "content": "A mutex (mutual exclusion) is a synchronization primitive that prevents multiple threads from accessing a shared resource simultaneously. A thread must acquire the mutex before entering a critical section and release it afterward. If the mutex is already held, the requesting thread blocks until it becomes available."}
+]}
+```
+Multi-turn conversations are supported:
+```json
+{"messages": [
+  {"role": "system", "content": "You are a Python tutor."},
+  {"role": "user", "content": "What is a list comprehension?"},
+  {"role": "assistant", "content": "A list comprehension is a concise way to create lists. The syntax is [expression for item in iterable if condition]."},
+  {"role": "user", "content": "Give me an example with filtering."},
+  {"role": "assistant", "content": "Here is an example that keeps only even numbers: `evens = [x for x in range(20) if x % 2 == 0]`. This produces [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]."}
+]}
+```
+Requirements:
+- Each sample must have at least one `user` and one `assistant` message.
+- `system` message is optional but improves training signal.
+- `role` must be one of: `system`, `user`, `assistant`.
+- `content` must be a non-empty string.
+---
+## Module Architecture
+```
+fineforge/
+  __init__.py        # Package metadata (__version__)
+  cli.py             # Click CLI: prepare, train, eval, export commands
+  config.py          # TrainConfig dataclass with validation + YAML I/O
+  dataset.py         # JSONL loading, format validation, quality scoring,
+                     #   SHA-256 deduplication, filtering, train/eval splitting
+  trainer.py         # QLoRA training: BitsAndBytesConfig, LoRA injection,
+                     #   SFTTrainer, adapter + metadata saving
+  evaluator.py       # Base vs tuned comparison: prompt loading, generation,
+                     #   response scoring, side-by-side results
+  exporter.py        # LoRA merge, HF-to-GGUF conversion via llama.cpp,
+                     #   Modelfile generation, Ollama registration
+```
+### Design Decisions
+**Lazy imports**: All heavy ML dependencies (`torch`, `transformers`, `peft`, `trl`, `bitsandbytes`, `datasets`) are imported inside the functions that need them, not at module level. This means the CLI, dataset tools, and test suite all work without a GPU or GPU libraries installed. You can curate datasets on a CPU-only machine and train on a different machine with a GPU.
+**Config-driven training**: All hyperparameters live in a YAML file, not in code. This makes runs reproducible (commit the config alongside the adapter), diffable (compare two training runs by diffing their configs), and shareable (send someone a config file, not instructions).
+**Modular stages**: Each pipeline stage is independent. Use `prepare` to curate data without ever training. Use `export` to convert a PEFT adapter from any source, not just FineForge-trained ones. Use `eval` to benchmark any LoRA adapter against its base model.
+---
+## Installation
+```bash
+# Core (dataset tools + CLI) -- no GPU required
+pip install fineforge
+# With training support (requires NVIDIA GPU + CUDA)
+pip install fineforge[train]
+# Everything including GGUF export and dev tools
+pip install fineforge[all]
+# From source
+git clone https://github.com/dbhavery/fineforge.git
+cd fineforge
+pip install -e ".[dev]"
+```
+### Dependencies
+**Core** (always installed):
+- `click>=8.0` -- CLI framework
+- `pyyaml>=6.0` -- Config file parsing
+- `rich>=13.0` -- Terminal formatting and progress display
+**Training** (install with `[train]`):
+- `torch>=2.0` -- Tensor computation and CUDA backend
+- `transformers>=4.40` -- Model loading and tokenization
+- `peft>=0.12` -- LoRA adapter injection and management
+- `trl>=0.9` -- SFTTrainer for supervised fine-tuning
+- `bitsandbytes>=0.43` -- 4-bit NF4 quantization
+- `datasets>=2.20` -- Dataset loading utilities
+- `accelerate>=0.30` -- Device placement and mixed precision
+**Export** (install with `[export]`):
+- `llama-cpp-python>=0.2` -- Python bindings for GGUF operations
+---
+## Full Workflow Example
+```bash
+# 1. Prepare: curate 10,000 chat samples down to high-quality training data
+fineforge prepare raw_conversations.jsonl \
+  --output-dir ./data \
+  --min-quality 0.4 \
+  --eval-ratio 0.1 \
+  --seed 42
+# Output:
+#   Dataset Statistics
+#   Raw samples:           10,000
+#   After filtering:        7,234
+#   Duplicates removed:       412
+#   Low quality removed:    2,354
+#   Avg turns/conversation:   4.2
+#   Train set: 6,511 samples -> ./data/train.jsonl
+#   Eval set:    723 samples -> ./data/eval.jsonl
+# 2. Train: fine-tune Qwen2.5-7B with QLoRA
+cat > config.yaml << 'EOF'
+base_model: unsloth/Qwen2.5-7B
+dataset_path: ./data/train.jsonl
+output_dir: ./output
+lora_r: 16
+lora_alpha: 32
+num_epochs: 3
+learning_rate: 2e-4
+batch_size: 4
+max_seq_length: 2048
+EOF
+fineforge train config.yaml
+# 3. Evaluate: compare base vs tuned
+fineforge eval ./output/adapter \
+  --prompts test_prompts.yaml \
+  --base-model unsloth/Qwen2.5-7B \
+  --output eval_results.json
+# 4. Export: GGUF + Ollama
+fineforge export ./output/adapter \
+  --base-model unsloth/Qwen2.5-7B \
+  --quantization q4_k_m \
+  --ollama-name my-tuned-qwen
+# 5. Use it
+ollama run my-tuned-qwen
+```
+---
+## References
+- Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). *QLoRA: Efficient Finetuning of Quantized LLMs*. [arXiv:2305.14314](https://arxiv.org/abs/2305.14314)
+- Hu, E. J., et al. (2021). *LoRA: Low-Rank Adaptation of Large Language Models*. [arXiv:2106.09685](https://arxiv.org/abs/2106.09685)
+- Dettmers, T., et al. (2022). *LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale*. [arXiv:2208.07339](https://arxiv.org/abs/2208.07339)
+---
+## License
+MIT License. See [LICENSE](https://github.com/dbhavery/fineforge/blob/main/LICENSE) for details.