File size: 8,348 Bytes
40fefce 781af5a d63f23b fefe61a d63f23b fefe61a d63f23b e86ddb9 d63f23b e86ddb9 d63f23b 949500b e86ddb9 d63f23b e86ddb9 d63f23b e86ddb9 d63f23b 2abe5d0 d63f23b e86ddb9 fefe61a d63f23b fefe61a d63f23b 61c72b6 dbb959c 61c72b6 dbb959c 61c72b6 dbb959c dba87af dbb959c 61c72b6 d63f23b 949500b d63f23b 61c72b6 dba87af | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | ---
license: cc-by-nc-sa-4.0
language: en
tags:
- text-generation
- causal-lm
- lora
- tulu
base_model: allenai/tulu-2-7b
model_type: llama
library_name: transformers
pipeline_tag: text-generation
---
# License: CC BY-NC-SA 4.0. Rights belong to Javad Taghia (taghia.javad@gmail.com).
# Tulu Laptop Finetune + W&B
Minimal setup to finetune a laptop-friendly Tulu checkpoint with QLoRA and track runs in Weights & Biases.
## Prereqs
- Recent NVIDIA GPU with CUDA for 4-bit (bitsandbytes) set `--use_4bit true`. On CPU/MPS (default), set `--use_4bit false`, but expect much slower/limited runs.
- Conda (Miniconda/Anaconda).
- A Weights & Biases account + API key.
## Setup
1) Create the env (Conda)
```bash
conda env create -f environment.yml
conda activate deeai
```
2) Add secrets (keep `.env` out of git)
```bash
cp .env.example .env
# Edit .env with your WANDB_API_KEY / project / entity
# Optionally set BASE_MODEL_CACHE to choose where HF downloads models
```
3) Verify packages (optional if you prefer pip)
```bash
pip install -r requirements.txt
```
- If you see `LlamaTokenizer requires the SentencePiece library`, install it in the env:
```bash
pip install sentencepiece
```
- If you get a `torch.load` vulnerability error, either upgrade torch (>=2.6 when available for your platform) or ensure `safetensors` is installed; this repo prefers safetensors by default:
```bash
pip install safetensors
```
- If you see `LlamaTokenizer requires the SentencePiece library`, install it in the env:
```bash
pip install sentencepiece
```
## Run a quick finetune
The defaults use `allenai/tulu-2-7b` with a small instruction dataset (`mlabonne/guanaco-llama2-1k`) and 4-bit QLoRA. This keeps memory needs closer to laptop GPUs.
```bash
python train_tulu.py \
--output_dir outputs/tulu-lora \
--offload_folder offload \
--device cpu \
--max_seq_length 512 \
--per_device_batch_size 1 \
--gradient_accumulation_steps 16 \
--no-use_4bit \
--instruction_field instruction \
--input_field input \
--output_field output
```
Key flags:
- `--no-use_4bit` if bitsandbytes/CUDA are unavailable; for Mac MPS this should stay false (CPU/MPS only).
- `--dataset_name` to try another instruction set (any HF dataset with `instruction/input/output` fields).
- `--model_name` if you want a different Tulu variant (e.g., `allenai/tulu-2-dpo-7b`) or a smaller model for constrained hardware (e.g., `TinyLlama/TinyLlama-1.1B-Chat-v1.0` on Mac MPS).
- `--offload_folder` sets where to offload weights when `device_map="auto"` (ensure it has space). Default `offload/` lives in this repo so it stays alongside the project.
- `--instruction_field/--input_field/--output_field` let you match custom dataset column names; defaults assume `instruction/input/output`. For text-only datasets, set `--instruction_field text --output_field text`.
- `--device` can force `cpu`, `mps`, `cuda`, or `auto` (default). Use `--device mps` with a smaller fp16 model (e.g., TinyLlama) to fit memory; offloading is disabled on MPS/CPU.
- `--torch_dtype` can force the dtype (`float16/float32/bfloat16`); on MPS use `float16` to avoid unsupported bf16 weights.
- `--cpu_threads` limits CPU threads (default 4) when running on CPU so you don’t overload your machine.
- MPS (Mac) note: mixed precision isn’t supported for bfloat16; script will fall back to fp32 automatically on MPS. Keep `--no-use_4bit` on Mac, and offloading is disabled on MPS (model stays on device).
## How W&B is used
- `train_tulu.py` loads `.env`, logs into W&B, and reports through `Trainer(report_to=["wandb"])`.
- Ensure `WANDB_API_KEY`, `WANDB_PROJECT`, and (optionally) `WANDB_ENTITY` are set in `.env`.
- Each run captures hyperparameters and metrics; check the W&B UI for live loss curves and checkpoints.
- Additional summaries are logged: `train_duration_seconds`, `train_examples`, `estimated_tokens`, `precision_mode` (bf16/fp16/fp32), `use_4bit`, `model_name`, `dataset_name`, `per_device_batch_size`, `gradient_accumulation_steps`, and `max_seq_length`.
## Training objective and base model
- Objective: standard causal LM cross-entropy. The model predicts the next token; cross-entropy measures how much probability mass it assigns to the true token. Minimizing it (maximum likelihood) encourages the model to imitate the target outputs in your instruction data. No rewards/RLHF here—pure supervised finetuning.
- Base model: a Tulu checkpoint (LLaMA-style architecture) from the Hub (default `allenai/tulu-2-7b`). We train LoRA adapters on top of the frozen base (optionally 4-bit on CUDA), keeping the adapter small and the base intact.
## Model cache location
- Base model weights download to the Hugging Face cache. You can point downloads to an external directory by setting `BASE_MODEL_CACHE` in `.env` (e.g., `/Volumes/JTQ-s/______GITLAB____/downloaded_base_models`); the script maps this to `HF_HOME`/`TRANSFORMERS_CACHE` before loading models.
- If `BASE_MODEL_CACHE` is not set, the default HF cache is used (typically `~/.cache/huggingface/hub`).
## Output
- Finetuned adapters + tokenizer are written to `outputs/tulu-lora` (configurable via `--output_dir`).
- `outputs/` is tracked via Git LFS (`.gitattributes`), so weights can be committed and pushed to the Hub. Run `git lfs install` once, then `git add outputs/...` before committing.
## Evaluation (inference/compare)
- Quick smoke test with the saved adapter (edit `lora_dir` or pass flags):
```bash
python evaluation/simple_inference.py \
--lora_dir outputs/tinyllama-lora \
--device auto \
--torch_dtype auto \
--max_new_tokens 128 \
--temperature 0.7 \
--top_p 0.9
```
- Compare base vs. LoRA outputs side-by-side:
```bash
python evaluation/compare_lora.py \
--base_model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--lora_dir outputs/tinyllama-lora \
--prompt "Explain LoRA in one sentence."
```
For CPU or constrained machines, force CPU + fp32 (and add `--offload_dir offload` if using `device_map=auto`):
```bash
python evaluation/compare_lora.py \
--base_model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--lora_dir outputs/tinyllama-lora \
--prompt "Explain LoRA in one sentence." \
--device cpu \
--torch_dtype float32
```
Optional flags: `--max_new_tokens`, `--temperature`, `--top_p`, `--torch_dtype`, `--device`, `--offload_dir`.
## Troubleshooting
- OOM? Reduce `max_seq_length`, increase `gradient_accumulation_steps`, or switch to a smaller dataset (e.g., use a tiny instruction set like `mlabonne/guanaco-llama2-1k`, or subset your dataset with `--dataset_name your/dataset --max_train_samples 500` in code/script).
- bitsandbytes import errors on macOS/CPU: run with `--use_4bit false` or use a Linux+CUDA machine.
- bitsandbytes install error? We pin to `0.42.0`, the latest widely distributed wheel. If you cannot install it (CPU-only/MPS), remove it from `requirements.txt` and set `--use_4bit false`.
===
pip install --upgrade "torch==2.2.*" "torchvision==0.17.*" "torchaudio==2.2.*" --index-url https://download.pytorch.org/whl/cu121
pip install --upgrade "bitsandbytes>=0.43.1"
pip install --upgrade "transformers>=4.40.0"
python train_tulu.py \
--model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--output_dir outputs/tinyllama-lora \
--offload_folder offload \
--device cuda \
--torch_dtype auto \
--max_seq_length 512 \
--per_device_batch_size 2 \
--gradient_accumulation_steps 8 \
--num_train_epochs 1 \
--use_4bit \
--instruction_field instruction \
--input_field input \
--output_field output
python train_tulu.py \
--model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--output_dir outputs/tinyllama-lora \
--offload_folder offload \
--device cuda \
--torch_dtype auto \
--max_seq_length 512 \
--per_device_batch_size 2 \
--gradient_accumulation_steps 8 \
--num_train_epochs 1 \
--use_4bit \
--instruction_field instruction \
--input_field input \
--output_field output
===
only cpu
python train_tulu.py \
--model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--output_dir outputs/tinyllama-lora \
--offload_folder offload \
--device cuda \
--torch_dtype auto \
--max_seq_length 512 \
--per_device_batch_size 2 \
--gradient_accumulation_steps 8 \
--num_train_epochs 1 \
--use_4bit \
--instruction_field instruction \
--input_field input \
--output_field output
|