FineForge -- QLoRA Fine-Tuning Pipeline for Consumer GPUs

An end-to-end CLI pipeline that takes raw chat data and produces a fine-tuned language model running locally in Ollama. FineForge handles dataset curation (validate, score, deduplicate, split), QLoRA training with 4-bit quantization, before/after evaluation, GGUF export, and Ollama registration -- all designed to run on a single consumer GPU.

The core problem FineForge solves: fine-tuning a 7B parameter model normally requires 28+ GB of VRAM (full FP32 weights plus optimizer states). QLoRA reduces this to under 8 GB by quantizing the base model to 4-bit NormalFloat and training only small rank-decomposition matrices injected into the attention layers. FineForge wraps this technique into a repeatable, config-driven pipeline.

Source: github.com/dbhavery/fineforge


What is QLoRA

QLoRA (Quantized Low-Rank Adaptation) combines two techniques:

4-bit quantization (NF4): The pretrained base model weights are loaded in 4-bit NormalFloat format, a data type specifically designed for normally-distributed neural network weights. Double quantization further compresses the quantization constants themselves. This reduces the memory footprint of a 7B model from ~14 GB (FP16) to ~3.5 GB, freeing VRAM for training.

Low-Rank Adaptation (LoRA): Instead of updating all model weights during training, LoRA freezes the quantized base model and injects small trainable rank-decomposition matrices into specific layers. For a weight matrix W of shape (d, k), LoRA adds a bypass: W' = W + BA, where B is (d, r) and A is (r, k), with rank r much smaller than both d and k. With r=16 on a 7B model, this means training ~10 million parameters instead of 7 billion -- a 700x reduction.

The combination means you can fine-tune a 7B model on an 8 GB GPU that would otherwise only be able to run inference. The trained adapter (typically 20-50 MB) can be merged back into the base model for deployment or served as a standalone LoRA layer.

Paper: QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)


Pipeline Architecture

+-------------------+     +-------------------+     +-------------------+
|                   |     |                   |     |                   |
|   Raw Chat Data   |     |   Training Config |     |   Test Prompts    |
|   (JSONL)         |     |   (YAML)          |     |   (YAML)          |
|                   |     |                   |     |                   |
+--------+----------+     +--------+----------+     +--------+----------+
         |                         |                          |
         v                         v                          |
+--------+----------+     +--------+----------+              |
|                   |     |                   |              |
|  fineforge        |     |  fineforge        |              |
|  prepare          |     |  train            |              |
|                   |     |                   |              |
|  - Validate fmt   |     |  - Load base      |              |
|  - Score quality  |     |    model (4-bit)  |              |
|  - Deduplicate    |     |  - Apply LoRA     |              |
|  - Filter         |     |  - Tokenize data  |              |
|  - Train/eval     |     |  - SFTTrainer     |              |
|    split          |     |  - Save adapter   |              |
|                   |     |                   |              |
+--------+----------+     +--------+----------+              |
         |                         |                          |
         v                         v                          v
+--------+----------+     +--------+----------+     +--------+----------+
|                   |     |                   |     |                   |
|  train.jsonl      |     |  LoRA Adapter     +---->+  fineforge        |
|  eval.jsonl       |     |  (adapter/)       |     |  eval             |
|                   |     |                   |     |                   |
+-------------------+     +--------+----------+     |  - Load base      |
                                   |                |  - Load tuned     |
                                   v                |  - Generate both  |
                          +--------+----------+     |  - Score & compare|
                          |                   |     |                   |
                          |  fineforge        |     +--------+----------+
                          |  export           |              |
                          |                   |              v
                          |  - Merge adapter  |     +--------+----------+
                          |  - Convert GGUF   |     |                   |
                          |  - Quantize       |     |  Eval Results     |
                          |  - Ollama create  |     |  (JSON)           |
                          |                   |     |                   |
                          +--------+----------+     +-------------------+
                                   |
                                   v
                          +--------+----------+
                          |                   |
                          |  Ollama           |
                          |  ollama run       |
                          |  my-tuned-model   |
                          |                   |
                          +-------------------+

Pipeline Stages

Stage 1: Prepare -- Dataset Curation

fineforge prepare my_chats.jsonl --output-dir ./data --min-quality 0.4

The prepare stage applies five operations in sequence:

  1. Format validation: Each sample must follow the OpenAI chat format -- a JSON object with a messages array containing objects with role (system/user/assistant) and content fields. Samples must have at least one user and one assistant message. Malformed samples are rejected with specific error messages.

  2. Quality scoring: Each valid sample receives a score from 0.0 to 1.0 based on five heuristics: assistant response length (0-0.3), multi-turn depth (0-0.2), presence of a system prompt (0-0.1), user message quality (0-0.2), and vocabulary diversity in assistant responses (0-0.2).

  3. Deduplication: SHA-256 content hashing over the role/content pairs. Exact duplicates are removed.

  4. Filtering: Samples below the minimum quality threshold and samples with very short assistant responses are discarded.

  5. Train/eval split: The remaining samples are shuffled with a fixed seed and split into training and evaluation sets (default 90/10).

Stage 2: Train -- QLoRA Fine-Tuning

fineforge train config.yaml

Training is fully configured via YAML. The trainer:

  1. Validates the configuration (LoRA rank, learning rate ranges, mutual exclusivity of fp16/bf16).
  2. Checks GPU availability and reports device name, VRAM, and CUDA version.
  3. Loads the base model with 4-bit NF4 quantization via BitsAndBytesConfig with double quantization enabled.
  4. Applies LoRA adapters to the specified target modules (default: q_proj, k_proj, v_proj, o_proj).
  5. Reports trainable parameter count (typically ~0.1-0.5% of total parameters).
  6. Loads and tokenizes the dataset using the model's chat template.
  7. Trains using trl.SFTTrainer with paged AdamW 8-bit optimizer and cosine learning rate schedule.
  8. Saves the adapter weights, tokenizer, and training metadata (loss, elapsed time, config snapshot).

Stage 3: Evaluate -- Before/After Comparison

fineforge eval ./output/adapter --prompts test_prompts.yaml --base-model unsloth/Qwen2.5-7B

Evaluation loads both the base model and the fine-tuned model, runs each test prompt through both, and compares the outputs. Responses are scored on length appropriateness, keyword coverage (if expected keywords are defined in the prompt YAML), and vocabulary diversity. Results are displayed as a side-by-side comparison table with per-prompt improvement scores.

The base model is unloaded from GPU memory before the fine-tuned model is loaded, so evaluation fits within the same VRAM budget as training.

Stage 4: Export -- GGUF and Ollama Registration

fineforge export ./output/adapter \
  --base-model unsloth/Qwen2.5-7B \
  --quantization q4_k_m \
  --ollama-name my-tuned-model

Export performs three steps:

  1. Merge: Load the base model in FP16, apply the LoRA adapter, call merge_and_unload() to fold the adapter weights permanently into the base model, and save the merged model.

  2. GGUF conversion: Use llama.cpp's convert-hf-to-gguf script to convert the merged HuggingFace model to GGUF format (first to f16, then quantize to the target type).

  3. Ollama registration: Generate a Modelfile with the GGUF path, system prompt, and sampling parameters, then run ollama create to register the model locally.

Supported quantization types: q4_k_m (recommended balance of size/quality), q5_k_m, q8_0, f16.


Hardware Requirements

GPU VRAM What You Can Fine-Tune Notes
8 GB 7B models (QLoRA 4-bit) Tight -- reduce batch_size to 1-2, max_seq_length to 1024
12 GB 7B models comfortably batch_size=4, max_seq_length=2048
16 GB 7B models with headroom, 13B tight Enough for eval to load base + tuned sequentially
24 GB 7B-13B models comfortably batch_size=8+, longer sequences, larger LoRA rank
Component Minimum Recommended
GPU VRAM 8 GB (NVIDIA, CUDA) 16-24 GB
System RAM 16 GB 32+ GB
Disk 20 GB (model weights + checkpoints) 50+ GB
CUDA 11.8+ 12.0+

Tested on NVIDIA RTX 3090 (24 GB VRAM) with Qwen2.5-7B. AMD ROCm GPUs may work via PyTorch ROCm builds but are untested.


Training Configuration Reference

# config.yaml -- all parameters with defaults
base_model: unsloth/Qwen2.5-7B     # HuggingFace model ID or local path
dataset_path: ./data/train.jsonl     # Path to training JSONL
output_dir: ./output                 # Output directory

# LoRA hyperparameters
lora_r: 16                           # Rank of the low-rank matrices
lora_alpha: 32                       # Scaling factor (effective lr = alpha/r * lr)
lora_dropout: 0.05                   # Dropout on LoRA layers
lora_target_modules:                 # Which attention projections to adapt
  - q_proj
  - k_proj
  - v_proj
  - o_proj

# Training hyperparameters
learning_rate: 2e-4                  # Peak LR (cosine schedule with warmup)
num_epochs: 3                        # Training epochs
batch_size: 4                        # Per-device batch size
gradient_accumulation_steps: 4       # Effective batch = batch_size * grad_accum
max_seq_length: 2048                 # Truncation length
warmup_steps: 10                     # Linear LR warmup
fp16: true                           # Mixed-precision (use bf16 on Ampere+)
bf16: false                          # BF16 -- mutually exclusive with fp16
logging_steps: 10                    # Log loss every N steps
save_steps: 100                      # Checkpoint every N steps
eval_steps: 0                        # 0 = evaluate at end of epoch only
seed: 42                             # Reproducibility seed

# Data handling
chat_template: chatml                # Chat template format
trust_remote_code: false             # Allow custom model code from HF Hub

Key Hyperparameter Guidance

LoRA rank (lora_r): Controls the expressiveness of the adaptation. r=8 is sufficient for style transfer and simple behavioral changes. r=16 (default) handles most instruction tuning. r=32-64 for complex domain adaptation. Higher rank = more trainable parameters = more VRAM = longer training.

lora_alpha: Scaling factor for the LoRA update. The effective learning rate for LoRA parameters is alpha/r * lr. The default alpha=32 with r=16 gives a 2x scaling. If training is unstable, reduce alpha.

ef_search tuning analog -- batch_size and gradient_accumulation_steps: The effective batch size is batch_size * gradient_accumulation_steps. Larger effective batches smooth the gradient but require more memory. With 8 GB VRAM, use batch_size=1 and gradient_accumulation_steps=8-16.


Dataset Format

FineForge uses the OpenAI chat completion format. Each line in the JSONL file is one conversation:

{"messages": [
  {"role": "system", "content": "You are a concise technical writer."},
  {"role": "user", "content": "Explain what a mutex is."},
  {"role": "assistant", "content": "A mutex (mutual exclusion) is a synchronization primitive that prevents multiple threads from accessing a shared resource simultaneously. A thread must acquire the mutex before entering a critical section and release it afterward. If the mutex is already held, the requesting thread blocks until it becomes available."}
]}

Multi-turn conversations are supported:

{"messages": [
  {"role": "system", "content": "You are a Python tutor."},
  {"role": "user", "content": "What is a list comprehension?"},
  {"role": "assistant", "content": "A list comprehension is a concise way to create lists. The syntax is [expression for item in iterable if condition]."},
  {"role": "user", "content": "Give me an example with filtering."},
  {"role": "assistant", "content": "Here is an example that keeps only even numbers: `evens = [x for x in range(20) if x % 2 == 0]`. This produces [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]."}
]}

Requirements:

  • Each sample must have at least one user and one assistant message.
  • system message is optional but improves training signal.
  • role must be one of: system, user, assistant.
  • content must be a non-empty string.

Module Architecture

fineforge/
  __init__.py        # Package metadata (__version__)
  cli.py             # Click CLI: prepare, train, eval, export commands
  config.py          # TrainConfig dataclass with validation + YAML I/O
  dataset.py         # JSONL loading, format validation, quality scoring,
                     #   SHA-256 deduplication, filtering, train/eval splitting
  trainer.py         # QLoRA training: BitsAndBytesConfig, LoRA injection,
                     #   SFTTrainer, adapter + metadata saving
  evaluator.py       # Base vs tuned comparison: prompt loading, generation,
                     #   response scoring, side-by-side results
  exporter.py        # LoRA merge, HF-to-GGUF conversion via llama.cpp,
                     #   Modelfile generation, Ollama registration

Design Decisions

Lazy imports: All heavy ML dependencies (torch, transformers, peft, trl, bitsandbytes, datasets) are imported inside the functions that need them, not at module level. This means the CLI, dataset tools, and test suite all work without a GPU or GPU libraries installed. You can curate datasets on a CPU-only machine and train on a different machine with a GPU.

Config-driven training: All hyperparameters live in a YAML file, not in code. This makes runs reproducible (commit the config alongside the adapter), diffable (compare two training runs by diffing their configs), and shareable (send someone a config file, not instructions).

Modular stages: Each pipeline stage is independent. Use prepare to curate data without ever training. Use export to convert a PEFT adapter from any source, not just FineForge-trained ones. Use eval to benchmark any LoRA adapter against its base model.


Installation

# Core (dataset tools + CLI) -- no GPU required
pip install fineforge

# With training support (requires NVIDIA GPU + CUDA)
pip install fineforge[train]

# Everything including GGUF export and dev tools
pip install fineforge[all]

# From source
git clone https://github.com/dbhavery/fineforge.git
cd fineforge
pip install -e ".[dev]"

Dependencies

Core (always installed):

  • click>=8.0 -- CLI framework
  • pyyaml>=6.0 -- Config file parsing
  • rich>=13.0 -- Terminal formatting and progress display

Training (install with [train]):

  • torch>=2.0 -- Tensor computation and CUDA backend
  • transformers>=4.40 -- Model loading and tokenization
  • peft>=0.12 -- LoRA adapter injection and management
  • trl>=0.9 -- SFTTrainer for supervised fine-tuning
  • bitsandbytes>=0.43 -- 4-bit NF4 quantization
  • datasets>=2.20 -- Dataset loading utilities
  • accelerate>=0.30 -- Device placement and mixed precision

Export (install with [export]):

  • llama-cpp-python>=0.2 -- Python bindings for GGUF operations

Full Workflow Example

# 1. Prepare: curate 10,000 chat samples down to high-quality training data
fineforge prepare raw_conversations.jsonl \
  --output-dir ./data \
  --min-quality 0.4 \
  --eval-ratio 0.1 \
  --seed 42

# Output:
#   Dataset Statistics
#   Raw samples:           10,000
#   After filtering:        7,234
#   Duplicates removed:       412
#   Low quality removed:    2,354
#   Avg turns/conversation:   4.2
#   Train set: 6,511 samples -> ./data/train.jsonl
#   Eval set:    723 samples -> ./data/eval.jsonl

# 2. Train: fine-tune Qwen2.5-7B with QLoRA
cat > config.yaml << 'EOF'
base_model: unsloth/Qwen2.5-7B
dataset_path: ./data/train.jsonl
output_dir: ./output
lora_r: 16
lora_alpha: 32
num_epochs: 3
learning_rate: 2e-4
batch_size: 4
max_seq_length: 2048
EOF

fineforge train config.yaml

# 3. Evaluate: compare base vs tuned
fineforge eval ./output/adapter \
  --prompts test_prompts.yaml \
  --base-model unsloth/Qwen2.5-7B \
  --output eval_results.json

# 4. Export: GGUF + Ollama
fineforge export ./output/adapter \
  --base-model unsloth/Qwen2.5-7B \
  --quantization q4_k_m \
  --ollama-name my-tuned-qwen

# 5. Use it
ollama run my-tuned-qwen

References

  • Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314
  • Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685
  • Dettmers, T., et al. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv:2208.07339

License

MIT License. See LICENSE for details.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for dbhavery/fineforge-qlora-pipeline