FineForge -- QLoRA Fine-Tuning Pipeline for Consumer GPUs
An end-to-end CLI pipeline that takes raw chat data and produces a fine-tuned language model running locally in Ollama. FineForge handles dataset curation (validate, score, deduplicate, split), QLoRA training with 4-bit quantization, before/after evaluation, GGUF export, and Ollama registration -- all designed to run on a single consumer GPU.
The core problem FineForge solves: fine-tuning a 7B parameter model normally requires 28+ GB of VRAM (full FP32 weights plus optimizer states). QLoRA reduces this to under 8 GB by quantizing the base model to 4-bit NormalFloat and training only small rank-decomposition matrices injected into the attention layers. FineForge wraps this technique into a repeatable, config-driven pipeline.
Source: github.com/dbhavery/fineforge
What is QLoRA
QLoRA (Quantized Low-Rank Adaptation) combines two techniques:
4-bit quantization (NF4): The pretrained base model weights are loaded in 4-bit NormalFloat format, a data type specifically designed for normally-distributed neural network weights. Double quantization further compresses the quantization constants themselves. This reduces the memory footprint of a 7B model from ~14 GB (FP16) to ~3.5 GB, freeing VRAM for training.
Low-Rank Adaptation (LoRA): Instead of updating all model weights during training, LoRA freezes the quantized base model and injects small trainable rank-decomposition matrices into specific layers. For a weight matrix W of shape (d, k), LoRA adds a bypass: W' = W + BA, where B is (d, r) and A is (r, k), with rank r much smaller than both d and k. With r=16 on a 7B model, this means training ~10 million parameters instead of 7 billion -- a 700x reduction.
The combination means you can fine-tune a 7B model on an 8 GB GPU that would otherwise only be able to run inference. The trained adapter (typically 20-50 MB) can be merged back into the base model for deployment or served as a standalone LoRA layer.
Paper: QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)
Pipeline Architecture
+-------------------+ +-------------------+ +-------------------+
| | | | | |
| Raw Chat Data | | Training Config | | Test Prompts |
| (JSONL) | | (YAML) | | (YAML) |
| | | | | |
+--------+----------+ +--------+----------+ +--------+----------+
| | |
v v |
+--------+----------+ +--------+----------+ |
| | | | |
| fineforge | | fineforge | |
| prepare | | train | |
| | | | |
| - Validate fmt | | - Load base | |
| - Score quality | | model (4-bit) | |
| - Deduplicate | | - Apply LoRA | |
| - Filter | | - Tokenize data | |
| - Train/eval | | - SFTTrainer | |
| split | | - Save adapter | |
| | | | |
+--------+----------+ +--------+----------+ |
| | |
v v v
+--------+----------+ +--------+----------+ +--------+----------+
| | | | | |
| train.jsonl | | LoRA Adapter +---->+ fineforge |
| eval.jsonl | | (adapter/) | | eval |
| | | | | |
+-------------------+ +--------+----------+ | - Load base |
| | - Load tuned |
v | - Generate both |
+--------+----------+ | - Score & compare|
| | | |
| fineforge | +--------+----------+
| export | |
| | v
| - Merge adapter | +--------+----------+
| - Convert GGUF | | |
| - Quantize | | Eval Results |
| - Ollama create | | (JSON) |
| | | |
+--------+----------+ +-------------------+
|
v
+--------+----------+
| |
| Ollama |
| ollama run |
| my-tuned-model |
| |
+-------------------+
Pipeline Stages
Stage 1: Prepare -- Dataset Curation
fineforge prepare my_chats.jsonl --output-dir ./data --min-quality 0.4
The prepare stage applies five operations in sequence:
Format validation: Each sample must follow the OpenAI chat format -- a JSON object with a
messagesarray containing objects withrole(system/user/assistant) andcontentfields. Samples must have at least one user and one assistant message. Malformed samples are rejected with specific error messages.Quality scoring: Each valid sample receives a score from 0.0 to 1.0 based on five heuristics: assistant response length (0-0.3), multi-turn depth (0-0.2), presence of a system prompt (0-0.1), user message quality (0-0.2), and vocabulary diversity in assistant responses (0-0.2).
Deduplication: SHA-256 content hashing over the role/content pairs. Exact duplicates are removed.
Filtering: Samples below the minimum quality threshold and samples with very short assistant responses are discarded.
Train/eval split: The remaining samples are shuffled with a fixed seed and split into training and evaluation sets (default 90/10).
Stage 2: Train -- QLoRA Fine-Tuning
fineforge train config.yaml
Training is fully configured via YAML. The trainer:
- Validates the configuration (LoRA rank, learning rate ranges, mutual exclusivity of fp16/bf16).
- Checks GPU availability and reports device name, VRAM, and CUDA version.
- Loads the base model with 4-bit NF4 quantization via
BitsAndBytesConfigwith double quantization enabled. - Applies LoRA adapters to the specified target modules (default: q_proj, k_proj, v_proj, o_proj).
- Reports trainable parameter count (typically ~0.1-0.5% of total parameters).
- Loads and tokenizes the dataset using the model's chat template.
- Trains using
trl.SFTTrainerwith paged AdamW 8-bit optimizer and cosine learning rate schedule. - Saves the adapter weights, tokenizer, and training metadata (loss, elapsed time, config snapshot).
Stage 3: Evaluate -- Before/After Comparison
fineforge eval ./output/adapter --prompts test_prompts.yaml --base-model unsloth/Qwen2.5-7B
Evaluation loads both the base model and the fine-tuned model, runs each test prompt through both, and compares the outputs. Responses are scored on length appropriateness, keyword coverage (if expected keywords are defined in the prompt YAML), and vocabulary diversity. Results are displayed as a side-by-side comparison table with per-prompt improvement scores.
The base model is unloaded from GPU memory before the fine-tuned model is loaded, so evaluation fits within the same VRAM budget as training.
Stage 4: Export -- GGUF and Ollama Registration
fineforge export ./output/adapter \
--base-model unsloth/Qwen2.5-7B \
--quantization q4_k_m \
--ollama-name my-tuned-model
Export performs three steps:
Merge: Load the base model in FP16, apply the LoRA adapter, call
merge_and_unload()to fold the adapter weights permanently into the base model, and save the merged model.GGUF conversion: Use
llama.cpp'sconvert-hf-to-ggufscript to convert the merged HuggingFace model to GGUF format (first to f16, then quantize to the target type).Ollama registration: Generate a Modelfile with the GGUF path, system prompt, and sampling parameters, then run
ollama createto register the model locally.
Supported quantization types: q4_k_m (recommended balance of size/quality), q5_k_m, q8_0, f16.
Hardware Requirements
| GPU VRAM | What You Can Fine-Tune | Notes |
|---|---|---|
| 8 GB | 7B models (QLoRA 4-bit) | Tight -- reduce batch_size to 1-2, max_seq_length to 1024 |
| 12 GB | 7B models comfortably | batch_size=4, max_seq_length=2048 |
| 16 GB | 7B models with headroom, 13B tight | Enough for eval to load base + tuned sequentially |
| 24 GB | 7B-13B models comfortably | batch_size=8+, longer sequences, larger LoRA rank |
| Component | Minimum | Recommended |
|---|---|---|
| GPU VRAM | 8 GB (NVIDIA, CUDA) | 16-24 GB |
| System RAM | 16 GB | 32+ GB |
| Disk | 20 GB (model weights + checkpoints) | 50+ GB |
| CUDA | 11.8+ | 12.0+ |
Tested on NVIDIA RTX 3090 (24 GB VRAM) with Qwen2.5-7B. AMD ROCm GPUs may work via PyTorch ROCm builds but are untested.
Training Configuration Reference
# config.yaml -- all parameters with defaults
base_model: unsloth/Qwen2.5-7B # HuggingFace model ID or local path
dataset_path: ./data/train.jsonl # Path to training JSONL
output_dir: ./output # Output directory
# LoRA hyperparameters
lora_r: 16 # Rank of the low-rank matrices
lora_alpha: 32 # Scaling factor (effective lr = alpha/r * lr)
lora_dropout: 0.05 # Dropout on LoRA layers
lora_target_modules: # Which attention projections to adapt
- q_proj
- k_proj
- v_proj
- o_proj
# Training hyperparameters
learning_rate: 2e-4 # Peak LR (cosine schedule with warmup)
num_epochs: 3 # Training epochs
batch_size: 4 # Per-device batch size
gradient_accumulation_steps: 4 # Effective batch = batch_size * grad_accum
max_seq_length: 2048 # Truncation length
warmup_steps: 10 # Linear LR warmup
fp16: true # Mixed-precision (use bf16 on Ampere+)
bf16: false # BF16 -- mutually exclusive with fp16
logging_steps: 10 # Log loss every N steps
save_steps: 100 # Checkpoint every N steps
eval_steps: 0 # 0 = evaluate at end of epoch only
seed: 42 # Reproducibility seed
# Data handling
chat_template: chatml # Chat template format
trust_remote_code: false # Allow custom model code from HF Hub
Key Hyperparameter Guidance
LoRA rank (lora_r): Controls the expressiveness of the adaptation. r=8 is sufficient for style transfer and simple behavioral changes. r=16 (default) handles most instruction tuning. r=32-64 for complex domain adaptation. Higher rank = more trainable parameters = more VRAM = longer training.
lora_alpha: Scaling factor for the LoRA update. The effective learning rate for LoRA parameters is alpha/r * lr. The default alpha=32 with r=16 gives a 2x scaling. If training is unstable, reduce alpha.
ef_search tuning analog -- batch_size and gradient_accumulation_steps: The effective batch size is batch_size * gradient_accumulation_steps. Larger effective batches smooth the gradient but require more memory. With 8 GB VRAM, use batch_size=1 and gradient_accumulation_steps=8-16.
Dataset Format
FineForge uses the OpenAI chat completion format. Each line in the JSONL file is one conversation:
{"messages": [
{"role": "system", "content": "You are a concise technical writer."},
{"role": "user", "content": "Explain what a mutex is."},
{"role": "assistant", "content": "A mutex (mutual exclusion) is a synchronization primitive that prevents multiple threads from accessing a shared resource simultaneously. A thread must acquire the mutex before entering a critical section and release it afterward. If the mutex is already held, the requesting thread blocks until it becomes available."}
]}
Multi-turn conversations are supported:
{"messages": [
{"role": "system", "content": "You are a Python tutor."},
{"role": "user", "content": "What is a list comprehension?"},
{"role": "assistant", "content": "A list comprehension is a concise way to create lists. The syntax is [expression for item in iterable if condition]."},
{"role": "user", "content": "Give me an example with filtering."},
{"role": "assistant", "content": "Here is an example that keeps only even numbers: `evens = [x for x in range(20) if x % 2 == 0]`. This produces [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]."}
]}
Requirements:
- Each sample must have at least one
userand oneassistantmessage. systemmessage is optional but improves training signal.rolemust be one of:system,user,assistant.contentmust be a non-empty string.
Module Architecture
fineforge/
__init__.py # Package metadata (__version__)
cli.py # Click CLI: prepare, train, eval, export commands
config.py # TrainConfig dataclass with validation + YAML I/O
dataset.py # JSONL loading, format validation, quality scoring,
# SHA-256 deduplication, filtering, train/eval splitting
trainer.py # QLoRA training: BitsAndBytesConfig, LoRA injection,
# SFTTrainer, adapter + metadata saving
evaluator.py # Base vs tuned comparison: prompt loading, generation,
# response scoring, side-by-side results
exporter.py # LoRA merge, HF-to-GGUF conversion via llama.cpp,
# Modelfile generation, Ollama registration
Design Decisions
Lazy imports: All heavy ML dependencies (torch, transformers, peft, trl, bitsandbytes, datasets) are imported inside the functions that need them, not at module level. This means the CLI, dataset tools, and test suite all work without a GPU or GPU libraries installed. You can curate datasets on a CPU-only machine and train on a different machine with a GPU.
Config-driven training: All hyperparameters live in a YAML file, not in code. This makes runs reproducible (commit the config alongside the adapter), diffable (compare two training runs by diffing their configs), and shareable (send someone a config file, not instructions).
Modular stages: Each pipeline stage is independent. Use prepare to curate data without ever training. Use export to convert a PEFT adapter from any source, not just FineForge-trained ones. Use eval to benchmark any LoRA adapter against its base model.
Installation
# Core (dataset tools + CLI) -- no GPU required
pip install fineforge
# With training support (requires NVIDIA GPU + CUDA)
pip install fineforge[train]
# Everything including GGUF export and dev tools
pip install fineforge[all]
# From source
git clone https://github.com/dbhavery/fineforge.git
cd fineforge
pip install -e ".[dev]"
Dependencies
Core (always installed):
click>=8.0-- CLI frameworkpyyaml>=6.0-- Config file parsingrich>=13.0-- Terminal formatting and progress display
Training (install with [train]):
torch>=2.0-- Tensor computation and CUDA backendtransformers>=4.40-- Model loading and tokenizationpeft>=0.12-- LoRA adapter injection and managementtrl>=0.9-- SFTTrainer for supervised fine-tuningbitsandbytes>=0.43-- 4-bit NF4 quantizationdatasets>=2.20-- Dataset loading utilitiesaccelerate>=0.30-- Device placement and mixed precision
Export (install with [export]):
llama-cpp-python>=0.2-- Python bindings for GGUF operations
Full Workflow Example
# 1. Prepare: curate 10,000 chat samples down to high-quality training data
fineforge prepare raw_conversations.jsonl \
--output-dir ./data \
--min-quality 0.4 \
--eval-ratio 0.1 \
--seed 42
# Output:
# Dataset Statistics
# Raw samples: 10,000
# After filtering: 7,234
# Duplicates removed: 412
# Low quality removed: 2,354
# Avg turns/conversation: 4.2
# Train set: 6,511 samples -> ./data/train.jsonl
# Eval set: 723 samples -> ./data/eval.jsonl
# 2. Train: fine-tune Qwen2.5-7B with QLoRA
cat > config.yaml << 'EOF'
base_model: unsloth/Qwen2.5-7B
dataset_path: ./data/train.jsonl
output_dir: ./output
lora_r: 16
lora_alpha: 32
num_epochs: 3
learning_rate: 2e-4
batch_size: 4
max_seq_length: 2048
EOF
fineforge train config.yaml
# 3. Evaluate: compare base vs tuned
fineforge eval ./output/adapter \
--prompts test_prompts.yaml \
--base-model unsloth/Qwen2.5-7B \
--output eval_results.json
# 4. Export: GGUF + Ollama
fineforge export ./output/adapter \
--base-model unsloth/Qwen2.5-7B \
--quantization q4_k_m \
--ollama-name my-tuned-qwen
# 5. Use it
ollama run my-tuned-qwen
References
- Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314
- Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685
- Dettmers, T., et al. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv:2208.07339
License
MIT License. See LICENSE for details.
- Downloads last month
- -