File size: 8,348 Bytes
40fefce
 
 
 
 
 
 
 
 
 
 
 
 
 
781af5a
 
d63f23b
 
 
 
 
 
 
 
 
 
 
 
 
fefe61a
d63f23b
 
 
 
 
fefe61a
d63f23b
 
 
 
 
e86ddb9
 
 
 
 
 
 
 
 
 
 
 
d63f23b
 
 
 
 
 
e86ddb9
 
d63f23b
 
949500b
e86ddb9
 
 
 
d63f23b
 
 
e86ddb9
d63f23b
e86ddb9
 
 
 
 
 
 
d63f23b
 
 
 
 
2abe5d0
d63f23b
e86ddb9
 
 
 
fefe61a
 
 
 
d63f23b
fefe61a
 
d63f23b
61c72b6
dbb959c
61c72b6
dbb959c
 
 
 
 
 
 
61c72b6
 
 
 
 
 
 
 
dbb959c
dba87af
 
 
 
 
 
 
dbb959c
 
61c72b6
d63f23b
949500b
d63f23b
 
61c72b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dba87af
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
license: cc-by-nc-sa-4.0
language: en
tags:
  - text-generation
  - causal-lm
  - lora
  - tulu
base_model: allenai/tulu-2-7b
model_type: llama
library_name: transformers
pipeline_tag: text-generation
---

# License: CC BY-NC-SA 4.0. Rights belong to Javad Taghia (taghia.javad@gmail.com).

# Tulu Laptop Finetune + W&B

Minimal setup to finetune a laptop-friendly Tulu checkpoint with QLoRA and track runs in Weights & Biases.

## Prereqs
- Recent NVIDIA GPU with CUDA for 4-bit (bitsandbytes) set `--use_4bit true`. On CPU/MPS (default), set `--use_4bit false`, but expect much slower/limited runs.
- Conda (Miniconda/Anaconda).
- A Weights & Biases account + API key.

## Setup
1) Create the env (Conda)
```bash
conda env create -f environment.yml
conda activate deeai
```
2) Add secrets (keep `.env` out of git)
```bash
cp .env.example .env
# Edit .env with your WANDB_API_KEY / project / entity
# Optionally set BASE_MODEL_CACHE to choose where HF downloads models
```
3) Verify packages (optional if you prefer pip)
```bash
pip install -r requirements.txt
```
- If you see `LlamaTokenizer requires the SentencePiece library`, install it in the env:
```bash
pip install sentencepiece
```
- If you get a `torch.load` vulnerability error, either upgrade torch (>=2.6 when available for your platform) or ensure `safetensors` is installed; this repo prefers safetensors by default:
```bash
pip install safetensors
```
- If you see `LlamaTokenizer requires the SentencePiece library`, install it in the env:
```bash
pip install sentencepiece
```

## Run a quick finetune
The defaults use `allenai/tulu-2-7b` with a small instruction dataset (`mlabonne/guanaco-llama2-1k`) and 4-bit QLoRA. This keeps memory needs closer to laptop GPUs.
```bash
python train_tulu.py \
  --output_dir outputs/tulu-lora \
  --offload_folder offload \
  --device cpu \
  --max_seq_length 512 \
  --per_device_batch_size 1 \
  --gradient_accumulation_steps 16 \
  --no-use_4bit \
  --instruction_field instruction \
  --input_field input \
  --output_field output
```

Key flags:
- `--no-use_4bit` if bitsandbytes/CUDA are unavailable; for Mac MPS this should stay false (CPU/MPS only).
- `--dataset_name` to try another instruction set (any HF dataset with `instruction/input/output` fields).
- `--model_name` if you want a different Tulu variant (e.g., `allenai/tulu-2-dpo-7b`) or a smaller model for constrained hardware (e.g., `TinyLlama/TinyLlama-1.1B-Chat-v1.0` on Mac MPS).
- `--offload_folder` sets where to offload weights when `device_map="auto"` (ensure it has space). Default `offload/` lives in this repo so it stays alongside the project.
- `--instruction_field/--input_field/--output_field` let you match custom dataset column names; defaults assume `instruction/input/output`. For text-only datasets, set `--instruction_field text --output_field text`.
- `--device` can force `cpu`, `mps`, `cuda`, or `auto` (default). Use `--device mps` with a smaller fp16 model (e.g., TinyLlama) to fit memory; offloading is disabled on MPS/CPU.
- `--torch_dtype` can force the dtype (`float16/float32/bfloat16`); on MPS use `float16` to avoid unsupported bf16 weights.
- `--cpu_threads` limits CPU threads (default 4) when running on CPU so you don’t overload your machine.
- MPS (Mac) note: mixed precision isn’t supported for bfloat16; script will fall back to fp32 automatically on MPS. Keep `--no-use_4bit` on Mac, and offloading is disabled on MPS (model stays on device).

## How W&B is used
- `train_tulu.py` loads `.env`, logs into W&B, and reports through `Trainer(report_to=["wandb"])`.
- Ensure `WANDB_API_KEY`, `WANDB_PROJECT`, and (optionally) `WANDB_ENTITY` are set in `.env`.
- Each run captures hyperparameters and metrics; check the W&B UI for live loss curves and checkpoints.
- Additional summaries are logged: `train_duration_seconds`, `train_examples`, `estimated_tokens`, `precision_mode` (bf16/fp16/fp32), `use_4bit`, `model_name`, `dataset_name`, `per_device_batch_size`, `gradient_accumulation_steps`, and `max_seq_length`.

## Training objective and base model
- Objective: standard causal LM cross-entropy. The model predicts the next token; cross-entropy measures how much probability mass it assigns to the true token. Minimizing it (maximum likelihood) encourages the model to imitate the target outputs in your instruction data. No rewards/RLHF here—pure supervised finetuning.
- Base model: a Tulu checkpoint (LLaMA-style architecture) from the Hub (default `allenai/tulu-2-7b`). We train LoRA adapters on top of the frozen base (optionally 4-bit on CUDA), keeping the adapter small and the base intact.

## Model cache location
- Base model weights download to the Hugging Face cache. You can point downloads to an external directory by setting `BASE_MODEL_CACHE` in `.env` (e.g., `/Volumes/JTQ-s/______GITLAB____/downloaded_base_models`); the script maps this to `HF_HOME`/`TRANSFORMERS_CACHE` before loading models.
- If `BASE_MODEL_CACHE` is not set, the default HF cache is used (typically `~/.cache/huggingface/hub`).

## Output
- Finetuned adapters + tokenizer are written to `outputs/tulu-lora` (configurable via `--output_dir`).
- `outputs/` is tracked via Git LFS (`.gitattributes`), so weights can be committed and pushed to the Hub. Run `git lfs install` once, then `git add outputs/...` before committing.

## Evaluation (inference/compare)
- Quick smoke test with the saved adapter (edit `lora_dir` or pass flags):
```bash
python evaluation/simple_inference.py \
  --lora_dir outputs/tinyllama-lora \
  --device auto \
  --torch_dtype auto \
  --max_new_tokens 128 \
  --temperature 0.7 \
  --top_p 0.9
```
- Compare base vs. LoRA outputs side-by-side:
```bash
python evaluation/compare_lora.py \
  --base_model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --lora_dir outputs/tinyllama-lora \
  --prompt "Explain LoRA in one sentence."
```
For CPU or constrained machines, force CPU + fp32 (and add `--offload_dir offload` if using `device_map=auto`):
```bash
python evaluation/compare_lora.py \
  --base_model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --lora_dir outputs/tinyllama-lora \
  --prompt "Explain LoRA in one sentence." \
  --device cpu \
  --torch_dtype float32
```
Optional flags: `--max_new_tokens`, `--temperature`, `--top_p`, `--torch_dtype`, `--device`, `--offload_dir`.

## Troubleshooting
- OOM? Reduce `max_seq_length`, increase `gradient_accumulation_steps`, or switch to a smaller dataset (e.g., use a tiny instruction set like `mlabonne/guanaco-llama2-1k`, or subset your dataset with `--dataset_name your/dataset --max_train_samples 500` in code/script).
- bitsandbytes import errors on macOS/CPU: run with `--use_4bit false` or use a Linux+CUDA machine.
- bitsandbytes install error? We pin to `0.42.0`, the latest widely distributed wheel. If you cannot install it (CPU-only/MPS), remove it from `requirements.txt` and set `--use_4bit false`.


===
pip install --upgrade "torch==2.2.*" "torchvision==0.17.*" "torchaudio==2.2.*" --index-url https://download.pytorch.org/whl/cu121
pip install --upgrade "bitsandbytes>=0.43.1"
pip install --upgrade "transformers>=4.40.0"

python train_tulu.py \
  --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --output_dir outputs/tinyllama-lora \
  --offload_folder offload \
  --device cuda \
  --torch_dtype auto \
  --max_seq_length 512 \
  --per_device_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --num_train_epochs 1 \
  --use_4bit \
  --instruction_field instruction \
  --input_field input \
  --output_field output

python train_tulu.py \
  --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --output_dir outputs/tinyllama-lora \
  --offload_folder offload \
  --device cuda \
  --torch_dtype auto \
  --max_seq_length 512 \
  --per_device_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --num_train_epochs 1 \
  --use_4bit \
  --instruction_field instruction \
  --input_field input \
  --output_field output

  ===
  only cpu
  python train_tulu.py \
  --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --output_dir outputs/tinyllama-lora \
  --offload_folder offload \
  --device cuda \
  --torch_dtype auto \
  --max_seq_length 512 \
  --per_device_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --num_train_epochs 1 \
  --use_4bit \
  --instruction_field instruction \
  --input_field input \
  --output_field output