File size: 10,446 Bytes
92c0ea5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
# Code Sources & References

Every code snippet, technique, and configuration used in this project traced back to its original source.
Use this when writing your paper to cite where each technique came from.

---

## 1. Liquid AI — Model & Architecture

### LFM2.5-1.2B-Instruct (Our Model)
```python
model = AutoModelForCausalLM.from_pretrained("LiquidAI/LFM2.5-1.2B-Instruct")
```
- **What:** 1.2 billion parameter instruction-tuned language model
- **HuggingFace:** https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct
- **Company:** https://www.liquid.ai/
- **Architecture:** Liquid Neural Network — hybrid state-space + attention + conv, inspired by biological neural circuits (C. elegans)
- **Paper:** arXiv:2511.23404 — LFM2 technical report
- **Why we use it:** Small enough for a laptop (2.4 GB in bf16), instruction-tuned, HuggingFace compatible

### Liquid AI Official Documentation
- **Main docs:** https://docs.liquid.ai
- **Transformers inference guide:** https://docs.liquid.ai/deployment/gpu-inference/transformers
- **Fine-tuning with TRL:** https://docs.liquid.ai/customization/finetuning-frameworks/trl
- **Fine-tuning with Unsloth:** https://docs.liquid.ai/customization/finetuning-frameworks/unsloth
- **Dataset formats:** https://docs.liquid.ai/customization/finetuning-frameworks/datasets
- **Customization overview:** https://docs.liquid.ai/customization/getting-started/welcome

### Liquid AI Official Cookbook (GitHub)
- **Repository:** https://github.com/Liquid4All/cookbook
- **SFT with TRL notebook:** https://github.com/Liquid4All/cookbook/blob/main/finetuning/notebooks/sft_with_trl.ipynb
  - This is the primary source for our LoRA configuration and training setup
  - Defines target modules for LFM2 architecture: attention + GLU + conv layers
- **SFT with Unsloth notebook:** https://github.com/Liquid4All/cookbook/blob/main/finetuning/notebooks/sft_with_unsloth.ipynb
  - Alternative fine-tuning approach using Unsloth for 2-5x faster training
  - Uses 16-bit LoRA with gradient checkpointing

### Other Liquid AI Models (Evaluated, Not Used)
- **LFM2-8B-A1B (MoE):** https://huggingface.co/LiquidAI/LFM2-8B-A1B
  - 8B total params, 1B active (Mixture of Experts)
  - Considered as teacher model but too large for 24 GB Mac (~16 GB for weights alone)
- **LFM2-2.6B:** https://huggingface.co/LiquidAI/LFM2-2.6B
  - Evaluated as larger alternative, would fit (~5.2 GB) but tight with LoRA + optimizer
- **Full model catalog:** https://huggingface.co/LiquidAI

---

## 2. Fine-Tuning Framework

### TRL — SFTTrainer (Supervised Fine-Tuning)
```python
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(model=model, args=training_args, peft_config=peft_config, ...)
```
- **What:** HuggingFace library for training language models with reinforcement learning and SFT
- **Docs:** https://huggingface.co/docs/trl
- **Source:** https://github.com/huggingface/trl
- **SFTTrainer guide:** https://huggingface.co/docs/trl/sft_trainer
- **Why we use it:** Liquid AI's officially recommended fine-tuning method
- **Key feature:** Automatically handles chat template application, tokenization, and prompt masking
- **Version note:** TRL v0.29 renamed `max_seq_length` to `max_length` in SFTConfig

### PEFT — LoRA (Low-Rank Adaptation)
```python
from peft import LoraConfig, PeftModel
```
- **What:** Parameter-Efficient Fine-Tuning library — adds small trainable adapters to frozen models
- **Docs:** https://huggingface.co/docs/peft
- **Source:** https://github.com/huggingface/peft
- **LoRA conceptual guide:** https://huggingface.co/docs/peft/conceptual_guides/lora
- **LoRA paper:** Hu, E., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685
- **Why we use it:** Trains only ~1-5% of parameters — makes fine-tuning possible on a laptop

### LoRA Configuration (from Liquid AI Cookbook)
```python
peft_config = LoraConfig(
    r=8, lora_alpha=16, lora_dropout=0.1,
    target_modules=["q_proj", "k_proj", "v_proj", "out_proj", "w1", "w2", "w3", "in_proj"],
)
```
- **Source:** https://github.com/Liquid4All/cookbook/blob/main/finetuning/notebooks/sft_with_trl.ipynb
- **Target modules explained:**
  - `q_proj, k_proj, v_proj, out_proj` — Multi-Head Attention layers
  - `w1, w2, w3` — GLU (Gated Linear Unit) feed-forward layers
  - `in_proj` — Conv block input projection (unique to Liquid AI architecture)
- **Why these modules:** Liquid AI's architecture is not a standard transformer — it has additional conv and GLU layers. Adapting all layer types gives better results than attention-only LoRA.
- **Note:** Standard transformer LoRA typically only targets `q_proj` and `v_proj`. The expanded target list is specific to LFM2 models.

---

## 3. PyTorch & Apple Silicon

### PyTorch MPS Backend
```python
import torch
torch.backends.mps.is_available()  # True on Apple Silicon
```
- **What:** Metal Performance Shaders — PyTorch's backend for Apple Silicon GPU acceleration
- **Docs:** https://pytorch.org/docs/stable/notes/mps.html
- **Why we use it:** Enables GPU-accelerated training on Mac without NVIDIA hardware
- **Key finding:** MPS saturates at batch size 4 for this model — batch size 8 showed no speed improvement (steps halved but each step took 2x longer)

### HuggingFace Accelerate
```python
# device_map="auto" uses accelerate under the hood
model = AutoModelForCausalLM.from_pretrained(..., device_map="auto")
```
- **What:** Automatic device placement library
- **Docs:** https://huggingface.co/docs/accelerate
- **Why we use it:** Automatically places model on MPS (Mac), CUDA (NVIDIA), or CPU

---

## 4. HuggingFace Transformers

### AutoModelForCausalLM / AutoTokenizer
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
```
- **What:** Auto-classes that load any causal language model from HuggingFace Hub
- **Docs:** https://huggingface.co/docs/transformers
- **Source:** https://github.com/huggingface/transformers
- **Chat templates:** https://huggingface.co/docs/transformers/en/chat_templating
- **Why we use it:** Standard interface for loading and running Liquid AI models

### HuggingFace Datasets
```python
from datasets import Dataset
dataset = Dataset.from_list(examples)
```
- **What:** Library for loading and processing datasets
- **Docs:** https://huggingface.co/docs/datasets
- **Why we use it:** SFTTrainer expects HuggingFace Dataset objects with a "messages" column

---

## 5. Training Data

### Dataset Source
- **Origin:** Generated by the MLX sibling project using Qwen3-VL-32B
- **HuggingFace dataset:** `FaroukMoc2/email_spam-qwen3-vl-32b`
  - Source: https://huggingface.co/datasets/FaroukMoc2/email_spam-qwen3-vl-32b
- **Size:** 3,200 training + 800 test examples
- **Format:** JSONL with chat-style messages (`system`, `user`, `assistant` roles)
- **Why reused:** The JSONL chat format is model-agnostic — works with any model that supports chat templates

### Original Email Dataset
- **Source:** Kaggle spam email dataset (193,852 emails)
- **CSV path:** `data/spam_Emails_data.csv` (symlinked from spam-xai-project)

---

## 6. Gradio Web Interface

### Gradio
```python
import gradio as gr
with gr.Blocks() as demo:
    ...
demo.launch()
```
- **What:** Python library for building ML web interfaces
- **Docs:** https://www.gradio.app/docs
- **Source:** https://github.com/gradio-app/gradio
- **Why we use it:** Quick web UI for email classification — same as MLX version for consistency

---

## 7. Performance Findings (Empirical)

These findings were discovered during development on a MacBook Pro M4 Pro with 24 GB unified memory:

| Finding | Details |
|---------|---------|
| MPS batch size sweet spot | Batch size 4 is optimal. Batch size 8 halved steps but doubled time per step — GPU saturated. |
| Memory usage | ~7-8 GB during training (1.2B model bf16 + LoRA + optimizer + activations) |
| Training speed | ~0.34 it/s at batch size 4 on MPS |
| Model load time | 30-60 seconds for initial model loading into memory |
| MLX vs PyTorch MPS | MLX (used in sibling project) is significantly faster for Apple Silicon — purpose-built vs compatibility layer |
| No orphaned ports | Unlike MLX version (which spawns llama-server), PyTorch loads in-process — clean shutdown |
| TRL v0.29 breaking change | `max_seq_length` renamed to `max_length` in SFTConfig |
| LFM2 layer names | Uses `out_proj` (not `o_proj` like standard transformers) |

---

## 8. Comparison with MLX Version

| Aspect | MLX Version | Liquid AI Version |
|--------|-------------|-------------------|
| Model | Qwen3.5-0.8B (4-bit quantized) | LFM2.5-1.2B-Instruct (bf16) |
| Architecture | Transformer | Liquid Neural Network (state-space + attention + conv) |
| Framework | Apple MLX + mlx-lm | PyTorch + HuggingFace Transformers + TRL + PEFT |
| Fine-tuning tool | mlx-lm LoRA CLI | TRL SFTTrainer + PEFT LoRA |
| Training speed | ~10-20 min | ~37 min (1 epoch), ~2 hrs (3 epochs) |
| Memory usage | ~3-4 GB | ~7-8 GB |
| Platform | Apple Silicon only | Any platform (Mac MPS, NVIDIA CUDA, CPU) |
| Model serving | Spawns llama-server (can leak ports) | In-process PyTorch (clean shutdown) |
| LoRA targets | Attention layers only | Attention + GLU + Conv (8 module types) |
| Training data | Same (model-agnostic JSONL format) | Same (copied from MLX project) |
| Gradio UI | Identical | Identical |

---

## Academic Citations (for Paper)

```
Hu, E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021).
  LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.

Liquid AI. (2025). LFM2: Liquid Foundation Models 2. arXiv:2511.23404.

Liquid AI. (2026). Liquid AI Cookbook: Fine-tuning notebooks.
  https://github.com/Liquid4All/cookbook

Liquid AI. (2026). LFM2.5-1.2B-Instruct model card.
  https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct

von Werra, L., et al. (2020). TRL: Transformer Reinforcement Learning.
  https://github.com/huggingface/trl

Mangrulkar, S., et al. (2022). PEFT: Parameter-Efficient Fine-Tuning.
  https://github.com/huggingface/peft

Wolf, T., et al. (2020). Transformers: State-of-the-Art Natural Language Processing.
  Proceedings of EMNLP 2020 (Systems Demonstrations), pp. 38-45.
  https://github.com/huggingface/transformers

Paszke, A., et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library.
  Advances in Neural Information Processing Systems 32, pp. 8024-8035.
```