dbhavery commited on
Commit
39fed52
·
verified ·
1 Parent(s): 6b2478b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +396 -55
README.md CHANGED
@@ -1,55 +1,396 @@
1
- ---
2
- library_name: peft
3
- tags:
4
- - fine-tuning
5
- - qlora
6
- - lora
7
- - gguf
8
- - ollama
9
- - consumer-gpu
10
- license: mit
11
- ---
12
-
13
- # FineForge — QLoRA Fine-Tuning Pipeline
14
-
15
- End-to-end LoRA/QLoRA fine-tuning pipeline designed for consumer GPUs (RTX 3090, 4090). Curate datasets, train adapters, evaluate, and export to GGUF for local inference with Ollama.
16
-
17
- ## Pipeline
18
-
19
- ```
20
- Dataset Curation -> QLoRA Training -> Evaluation -> GGUF Export -> Ollama Deploy
21
- | | | | |
22
- Filter/clean 4-bit quant Perplexity llama.cpp ollama create
23
- Format/split LoRA adapters Task metrics conversion model:tag
24
- JSONL output Checkpoints Comparisons Quantization Local serve
25
- ```
26
-
27
- ## Features
28
-
29
- - **Dataset curation** — Filter, clean, format, train/val/test split
30
- - **QLoRA training** — 4-bit quantization, LoRA rank/alpha configuration
31
- - **Multi-GPU** — DataParallel for multi-GPU setups
32
- - **Evaluation** Perplexity, task-specific metrics, baseline comparison
33
- - **GGUF export** — Convert to GGUF format for llama.cpp / Ollama
34
- - **Ollama integration** Auto-create Modelfile, register with Ollama
35
-
36
- ## Hardware Requirements
37
-
38
- | GPU | VRAM | Max Model Size |
39
- |-----|------|---------------|
40
- | RTX 3090 | 24GB | 13B (QLoRA) |
41
- | RTX 4090 | 24GB | 13B (QLoRA) |
42
- | RTX 3060 | 12GB | 7B (QLoRA) |
43
- | RTX 4060 | 8GB | 3B (QLoRA) |
44
-
45
- ## Usage
46
-
47
- ```bash
48
- pip install fineforge
49
- fineforge curate --input data.jsonl --output curated/ --strategy quality
50
- fineforge train --config train.yaml --output checkpoints/
51
- fineforge eval --checkpoint checkpoints/best --benchmark mmlu
52
- fineforge export --checkpoint checkpoints/best --format gguf --quant q4_k_m
53
- ```
54
-
55
- **48 tests** | [GitHub](https://github.com/dbhavery/fineforge) | [Author](https://github.com/dbhavery)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: peft
3
+ license: mit
4
+ tags:
5
+ - fine-tuning
6
+ - qlora
7
+ - lora
8
+ - gguf
9
+ - ollama
10
+ - consumer-gpu
11
+ - peft
12
+ - quantization
13
+ language:
14
+ - en
15
+ pipeline_tag: text-generation
16
+ ---
17
+
18
+ # FineForge -- QLoRA Fine-Tuning Pipeline for Consumer GPUs
19
+
20
+ An end-to-end CLI pipeline that takes raw chat data and produces a fine-tuned language model running locally in Ollama. FineForge handles dataset curation (validate, score, deduplicate, split), QLoRA training with 4-bit quantization, before/after evaluation, GGUF export, and Ollama registration -- all designed to run on a single consumer GPU.
21
+
22
+ The core problem FineForge solves: fine-tuning a 7B parameter model normally requires 28+ GB of VRAM (full FP32 weights plus optimizer states). QLoRA reduces this to under 8 GB by quantizing the base model to 4-bit NormalFloat and training only small rank-decomposition matrices injected into the attention layers. FineForge wraps this technique into a repeatable, config-driven pipeline.
23
+
24
+ **Source**: [github.com/dbhavery/fineforge](https://github.com/dbhavery/fineforge)
25
+
26
+ ---
27
+
28
+ ## What is QLoRA
29
+
30
+ QLoRA (Quantized Low-Rank Adaptation) combines two techniques:
31
+
32
+ **4-bit quantization (NF4)**: The pretrained base model weights are loaded in 4-bit NormalFloat format, a data type specifically designed for normally-distributed neural network weights. Double quantization further compresses the quantization constants themselves. This reduces the memory footprint of a 7B model from ~14 GB (FP16) to ~3.5 GB, freeing VRAM for training.
33
+
34
+ **Low-Rank Adaptation (LoRA)**: Instead of updating all model weights during training, LoRA freezes the quantized base model and injects small trainable rank-decomposition matrices into specific layers. For a weight matrix W of shape (d, k), LoRA adds a bypass: `W' = W + BA`, where `B` is (d, r) and `A` is (r, k), with rank `r` much smaller than both `d` and `k`. With r=16 on a 7B model, this means training ~10 million parameters instead of 7 billion -- a 700x reduction.
35
+
36
+ The combination means you can fine-tune a 7B model on an 8 GB GPU that would otherwise only be able to run inference. The trained adapter (typically 20-50 MB) can be merged back into the base model for deployment or served as a standalone LoRA layer.
37
+
38
+ **Paper**: [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) (Dettmers et al., 2023)
39
+
40
+ ---
41
+
42
+ ## Pipeline Architecture
43
+
44
+ ```
45
+ +-------------------+ +-------------------+ +-------------------+
46
+ | | | | | |
47
+ | Raw Chat Data | | Training Config | | Test Prompts |
48
+ | (JSONL) | | (YAML) | | (YAML) |
49
+ | | | | | |
50
+ +--------+----------+ +--------+----------+ +--------+----------+
51
+ | | |
52
+ v v |
53
+ +--------+----------+ +--------+----------+ |
54
+ | | | | |
55
+ | fineforge | | fineforge | |
56
+ | prepare | | train | |
57
+ | | | | |
58
+ | - Validate fmt | | - Load base | |
59
+ | - Score quality | | model (4-bit) | |
60
+ | - Deduplicate | | - Apply LoRA | |
61
+ | - Filter | | - Tokenize data | |
62
+ | - Train/eval | | - SFTTrainer | |
63
+ | split | | - Save adapter | |
64
+ | | | | |
65
+ +--------+----------+ +--------+----------+ |
66
+ | | |
67
+ v v v
68
+ +--------+----------+ +--------+----------+ +--------+----------+
69
+ | | | | | |
70
+ | train.jsonl | | LoRA Adapter +---->+ fineforge |
71
+ | eval.jsonl | | (adapter/) | | eval |
72
+ | | | | | |
73
+ +-------------------+ +--------+----------+ | - Load base |
74
+ | | - Load tuned |
75
+ v | - Generate both |
76
+ +--------+----------+ | - Score & compare|
77
+ | | | |
78
+ | fineforge | +--------+----------+
79
+ | export | |
80
+ | | v
81
+ | - Merge adapter | +--------+----------+
82
+ | - Convert GGUF | | |
83
+ | - Quantize | | Eval Results |
84
+ | - Ollama create | | (JSON) |
85
+ | | | |
86
+ +--------+----------+ +-------------------+
87
+ |
88
+ v
89
+ +--------+----------+
90
+ | |
91
+ | Ollama |
92
+ | ollama run |
93
+ | my-tuned-model |
94
+ | |
95
+ +-------------------+
96
+ ```
97
+
98
+ ---
99
+
100
+ ## Pipeline Stages
101
+
102
+ ### Stage 1: Prepare -- Dataset Curation
103
+
104
+ ```bash
105
+ fineforge prepare my_chats.jsonl --output-dir ./data --min-quality 0.4
106
+ ```
107
+
108
+ The prepare stage applies five operations in sequence:
109
+
110
+ 1. **Format validation**: Each sample must follow the OpenAI chat format -- a JSON object with a `messages` array containing objects with `role` (system/user/assistant) and `content` fields. Samples must have at least one user and one assistant message. Malformed samples are rejected with specific error messages.
111
+
112
+ 2. **Quality scoring**: Each valid sample receives a score from 0.0 to 1.0 based on five heuristics: assistant response length (0-0.3), multi-turn depth (0-0.2), presence of a system prompt (0-0.1), user message quality (0-0.2), and vocabulary diversity in assistant responses (0-0.2).
113
+
114
+ 3. **Deduplication**: SHA-256 content hashing over the role/content pairs. Exact duplicates are removed.
115
+
116
+ 4. **Filtering**: Samples below the minimum quality threshold and samples with very short assistant responses are discarded.
117
+
118
+ 5. **Train/eval split**: The remaining samples are shuffled with a fixed seed and split into training and evaluation sets (default 90/10).
119
+
120
+ ### Stage 2: Train -- QLoRA Fine-Tuning
121
+
122
+ ```bash
123
+ fineforge train config.yaml
124
+ ```
125
+
126
+ Training is fully configured via YAML. The trainer:
127
+
128
+ 1. Validates the configuration (LoRA rank, learning rate ranges, mutual exclusivity of fp16/bf16).
129
+ 2. Checks GPU availability and reports device name, VRAM, and CUDA version.
130
+ 3. Loads the base model with 4-bit NF4 quantization via `BitsAndBytesConfig` with double quantization enabled.
131
+ 4. Applies LoRA adapters to the specified target modules (default: q_proj, k_proj, v_proj, o_proj).
132
+ 5. Reports trainable parameter count (typically ~0.1-0.5% of total parameters).
133
+ 6. Loads and tokenizes the dataset using the model's chat template.
134
+ 7. Trains using `trl.SFTTrainer` with paged AdamW 8-bit optimizer and cosine learning rate schedule.
135
+ 8. Saves the adapter weights, tokenizer, and training metadata (loss, elapsed time, config snapshot).
136
+
137
+ ### Stage 3: Evaluate -- Before/After Comparison
138
+
139
+ ```bash
140
+ fineforge eval ./output/adapter --prompts test_prompts.yaml --base-model unsloth/Qwen2.5-7B
141
+ ```
142
+
143
+ Evaluation loads both the base model and the fine-tuned model, runs each test prompt through both, and compares the outputs. Responses are scored on length appropriateness, keyword coverage (if expected keywords are defined in the prompt YAML), and vocabulary diversity. Results are displayed as a side-by-side comparison table with per-prompt improvement scores.
144
+
145
+ The base model is unloaded from GPU memory before the fine-tuned model is loaded, so evaluation fits within the same VRAM budget as training.
146
+
147
+ ### Stage 4: Export -- GGUF and Ollama Registration
148
+
149
+ ```bash
150
+ fineforge export ./output/adapter \
151
+ --base-model unsloth/Qwen2.5-7B \
152
+ --quantization q4_k_m \
153
+ --ollama-name my-tuned-model
154
+ ```
155
+
156
+ Export performs three steps:
157
+
158
+ 1. **Merge**: Load the base model in FP16, apply the LoRA adapter, call `merge_and_unload()` to fold the adapter weights permanently into the base model, and save the merged model.
159
+
160
+ 2. **GGUF conversion**: Use `llama.cpp`'s `convert-hf-to-gguf` script to convert the merged HuggingFace model to GGUF format (first to f16, then quantize to the target type).
161
+
162
+ 3. **Ollama registration**: Generate a Modelfile with the GGUF path, system prompt, and sampling parameters, then run `ollama create` to register the model locally.
163
+
164
+ Supported quantization types: `q4_k_m` (recommended balance of size/quality), `q5_k_m`, `q8_0`, `f16`.
165
+
166
+ ---
167
+
168
+ ## Hardware Requirements
169
+
170
+ | GPU VRAM | What You Can Fine-Tune | Notes |
171
+ |----------|------------------------|-------|
172
+ | 8 GB | 7B models (QLoRA 4-bit) | Tight -- reduce batch_size to 1-2, max_seq_length to 1024 |
173
+ | 12 GB | 7B models comfortably | batch_size=4, max_seq_length=2048 |
174
+ | 16 GB | 7B models with headroom, 13B tight | Enough for eval to load base + tuned sequentially |
175
+ | 24 GB | 7B-13B models comfortably | batch_size=8+, longer sequences, larger LoRA rank |
176
+
177
+ | Component | Minimum | Recommended |
178
+ |-----------|---------|-------------|
179
+ | GPU VRAM | 8 GB (NVIDIA, CUDA) | 16-24 GB |
180
+ | System RAM | 16 GB | 32+ GB |
181
+ | Disk | 20 GB (model weights + checkpoints) | 50+ GB |
182
+ | CUDA | 11.8+ | 12.0+ |
183
+
184
+ Tested on NVIDIA RTX 3090 (24 GB VRAM) with Qwen2.5-7B. AMD ROCm GPUs may work via PyTorch ROCm builds but are untested.
185
+
186
+ ---
187
+
188
+ ## Training Configuration Reference
189
+
190
+ ```yaml
191
+ # config.yaml -- all parameters with defaults
192
+ base_model: unsloth/Qwen2.5-7B # HuggingFace model ID or local path
193
+ dataset_path: ./data/train.jsonl # Path to training JSONL
194
+ output_dir: ./output # Output directory
195
+
196
+ # LoRA hyperparameters
197
+ lora_r: 16 # Rank of the low-rank matrices
198
+ lora_alpha: 32 # Scaling factor (effective lr = alpha/r * lr)
199
+ lora_dropout: 0.05 # Dropout on LoRA layers
200
+ lora_target_modules: # Which attention projections to adapt
201
+ - q_proj
202
+ - k_proj
203
+ - v_proj
204
+ - o_proj
205
+
206
+ # Training hyperparameters
207
+ learning_rate: 2e-4 # Peak LR (cosine schedule with warmup)
208
+ num_epochs: 3 # Training epochs
209
+ batch_size: 4 # Per-device batch size
210
+ gradient_accumulation_steps: 4 # Effective batch = batch_size * grad_accum
211
+ max_seq_length: 2048 # Truncation length
212
+ warmup_steps: 10 # Linear LR warmup
213
+ fp16: true # Mixed-precision (use bf16 on Ampere+)
214
+ bf16: false # BF16 -- mutually exclusive with fp16
215
+ logging_steps: 10 # Log loss every N steps
216
+ save_steps: 100 # Checkpoint every N steps
217
+ eval_steps: 0 # 0 = evaluate at end of epoch only
218
+ seed: 42 # Reproducibility seed
219
+
220
+ # Data handling
221
+ chat_template: chatml # Chat template format
222
+ trust_remote_code: false # Allow custom model code from HF Hub
223
+ ```
224
+
225
+ ### Key Hyperparameter Guidance
226
+
227
+ **LoRA rank (`lora_r`)**: Controls the expressiveness of the adaptation. r=8 is sufficient for style transfer and simple behavioral changes. r=16 (default) handles most instruction tuning. r=32-64 for complex domain adaptation. Higher rank = more trainable parameters = more VRAM = longer training.
228
+
229
+ **`lora_alpha`**: Scaling factor for the LoRA update. The effective learning rate for LoRA parameters is `alpha/r * lr`. The default alpha=32 with r=16 gives a 2x scaling. If training is unstable, reduce alpha.
230
+
231
+ **`ef_search` tuning analog -- `batch_size` and `gradient_accumulation_steps`**: The effective batch size is `batch_size * gradient_accumulation_steps`. Larger effective batches smooth the gradient but require more memory. With 8 GB VRAM, use batch_size=1 and gradient_accumulation_steps=8-16.
232
+
233
+ ---
234
+
235
+ ## Dataset Format
236
+
237
+ FineForge uses the OpenAI chat completion format. Each line in the JSONL file is one conversation:
238
+
239
+ ```json
240
+ {"messages": [
241
+ {"role": "system", "content": "You are a concise technical writer."},
242
+ {"role": "user", "content": "Explain what a mutex is."},
243
+ {"role": "assistant", "content": "A mutex (mutual exclusion) is a synchronization primitive that prevents multiple threads from accessing a shared resource simultaneously. A thread must acquire the mutex before entering a critical section and release it afterward. If the mutex is already held, the requesting thread blocks until it becomes available."}
244
+ ]}
245
+ ```
246
+
247
+ Multi-turn conversations are supported:
248
+
249
+ ```json
250
+ {"messages": [
251
+ {"role": "system", "content": "You are a Python tutor."},
252
+ {"role": "user", "content": "What is a list comprehension?"},
253
+ {"role": "assistant", "content": "A list comprehension is a concise way to create lists. The syntax is [expression for item in iterable if condition]."},
254
+ {"role": "user", "content": "Give me an example with filtering."},
255
+ {"role": "assistant", "content": "Here is an example that keeps only even numbers: `evens = [x for x in range(20) if x % 2 == 0]`. This produces [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]."}
256
+ ]}
257
+ ```
258
+
259
+ Requirements:
260
+ - Each sample must have at least one `user` and one `assistant` message.
261
+ - `system` message is optional but improves training signal.
262
+ - `role` must be one of: `system`, `user`, `assistant`.
263
+ - `content` must be a non-empty string.
264
+
265
+ ---
266
+
267
+ ## Module Architecture
268
+
269
+ ```
270
+ fineforge/
271
+ __init__.py # Package metadata (__version__)
272
+ cli.py # Click CLI: prepare, train, eval, export commands
273
+ config.py # TrainConfig dataclass with validation + YAML I/O
274
+ dataset.py # JSONL loading, format validation, quality scoring,
275
+ # SHA-256 deduplication, filtering, train/eval splitting
276
+ trainer.py # QLoRA training: BitsAndBytesConfig, LoRA injection,
277
+ # SFTTrainer, adapter + metadata saving
278
+ evaluator.py # Base vs tuned comparison: prompt loading, generation,
279
+ # response scoring, side-by-side results
280
+ exporter.py # LoRA merge, HF-to-GGUF conversion via llama.cpp,
281
+ # Modelfile generation, Ollama registration
282
+ ```
283
+
284
+ ### Design Decisions
285
+
286
+ **Lazy imports**: All heavy ML dependencies (`torch`, `transformers`, `peft`, `trl`, `bitsandbytes`, `datasets`) are imported inside the functions that need them, not at module level. This means the CLI, dataset tools, and test suite all work without a GPU or GPU libraries installed. You can curate datasets on a CPU-only machine and train on a different machine with a GPU.
287
+
288
+ **Config-driven training**: All hyperparameters live in a YAML file, not in code. This makes runs reproducible (commit the config alongside the adapter), diffable (compare two training runs by diffing their configs), and shareable (send someone a config file, not instructions).
289
+
290
+ **Modular stages**: Each pipeline stage is independent. Use `prepare` to curate data without ever training. Use `export` to convert a PEFT adapter from any source, not just FineForge-trained ones. Use `eval` to benchmark any LoRA adapter against its base model.
291
+
292
+ ---
293
+
294
+ ## Installation
295
+
296
+ ```bash
297
+ # Core (dataset tools + CLI) -- no GPU required
298
+ pip install fineforge
299
+
300
+ # With training support (requires NVIDIA GPU + CUDA)
301
+ pip install fineforge[train]
302
+
303
+ # Everything including GGUF export and dev tools
304
+ pip install fineforge[all]
305
+
306
+ # From source
307
+ git clone https://github.com/dbhavery/fineforge.git
308
+ cd fineforge
309
+ pip install -e ".[dev]"
310
+ ```
311
+
312
+ ### Dependencies
313
+
314
+ **Core** (always installed):
315
+ - `click>=8.0` -- CLI framework
316
+ - `pyyaml>=6.0` -- Config file parsing
317
+ - `rich>=13.0` -- Terminal formatting and progress display
318
+
319
+ **Training** (install with `[train]`):
320
+ - `torch>=2.0` -- Tensor computation and CUDA backend
321
+ - `transformers>=4.40` -- Model loading and tokenization
322
+ - `peft>=0.12` -- LoRA adapter injection and management
323
+ - `trl>=0.9` -- SFTTrainer for supervised fine-tuning
324
+ - `bitsandbytes>=0.43` -- 4-bit NF4 quantization
325
+ - `datasets>=2.20` -- Dataset loading utilities
326
+ - `accelerate>=0.30` -- Device placement and mixed precision
327
+
328
+ **Export** (install with `[export]`):
329
+ - `llama-cpp-python>=0.2` -- Python bindings for GGUF operations
330
+
331
+ ---
332
+
333
+ ## Full Workflow Example
334
+
335
+ ```bash
336
+ # 1. Prepare: curate 10,000 chat samples down to high-quality training data
337
+ fineforge prepare raw_conversations.jsonl \
338
+ --output-dir ./data \
339
+ --min-quality 0.4 \
340
+ --eval-ratio 0.1 \
341
+ --seed 42
342
+
343
+ # Output:
344
+ # Dataset Statistics
345
+ # Raw samples: 10,000
346
+ # After filtering: 7,234
347
+ # Duplicates removed: 412
348
+ # Low quality removed: 2,354
349
+ # Avg turns/conversation: 4.2
350
+ # Train set: 6,511 samples -> ./data/train.jsonl
351
+ # Eval set: 723 samples -> ./data/eval.jsonl
352
+
353
+ # 2. Train: fine-tune Qwen2.5-7B with QLoRA
354
+ cat > config.yaml << 'EOF'
355
+ base_model: unsloth/Qwen2.5-7B
356
+ dataset_path: ./data/train.jsonl
357
+ output_dir: ./output
358
+ lora_r: 16
359
+ lora_alpha: 32
360
+ num_epochs: 3
361
+ learning_rate: 2e-4
362
+ batch_size: 4
363
+ max_seq_length: 2048
364
+ EOF
365
+
366
+ fineforge train config.yaml
367
+
368
+ # 3. Evaluate: compare base vs tuned
369
+ fineforge eval ./output/adapter \
370
+ --prompts test_prompts.yaml \
371
+ --base-model unsloth/Qwen2.5-7B \
372
+ --output eval_results.json
373
+
374
+ # 4. Export: GGUF + Ollama
375
+ fineforge export ./output/adapter \
376
+ --base-model unsloth/Qwen2.5-7B \
377
+ --quantization q4_k_m \
378
+ --ollama-name my-tuned-qwen
379
+
380
+ # 5. Use it
381
+ ollama run my-tuned-qwen
382
+ ```
383
+
384
+ ---
385
+
386
+ ## References
387
+
388
+ - Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). *QLoRA: Efficient Finetuning of Quantized LLMs*. [arXiv:2305.14314](https://arxiv.org/abs/2305.14314)
389
+ - Hu, E. J., et al. (2021). *LoRA: Low-Rank Adaptation of Large Language Models*. [arXiv:2106.09685](https://arxiv.org/abs/2106.09685)
390
+ - Dettmers, T., et al. (2022). *LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale*. [arXiv:2208.07339](https://arxiv.org/abs/2208.07339)
391
+
392
+ ---
393
+
394
+ ## License
395
+
396
+ MIT License. See [LICENSE](https://github.com/dbhavery/fineforge/blob/main/LICENSE) for details.