dee-tulu-train / README.md

Javad Taghia

cput ok for compare

dbb959c 2 months ago

8.35 kB

	---
	license: cc-by-nc-sa-4.0
	language: en
	tags:
	- text-generation
	- causal-lm
	- lora
	- tulu
	base_model: allenai/tulu-2-7b
	model_type: llama
	library_name: transformers
	pipeline_tag: text-generation
	---

	# License: CC BY-NC-SA 4.0. Rights belong to Javad Taghia (taghia.javad@gmail.com).

	# Tulu Laptop Finetune + W&B

	Minimal setup to finetune a laptop-friendly Tulu checkpoint with QLoRA and track runs in Weights & Biases.

	## Prereqs
	- Recent NVIDIA GPU with CUDA for 4-bit (bitsandbytes) set `--use_4bit true`. On CPU/MPS (default), set `--use_4bit false`, but expect much slower/limited runs.
	- Conda (Miniconda/Anaconda).
	- A Weights & Biases account + API key.

	## Setup
	1) Create the env (Conda)
	```bash
	conda env create -f environment.yml
	conda activate deeai
	```
	2) Add secrets (keep `.env` out of git)
	```bash
	cp .env.example .env
	# Edit .env with your WANDB_API_KEY / project / entity
	# Optionally set BASE_MODEL_CACHE to choose where HF downloads models
	```
	3) Verify packages (optional if you prefer pip)
	```bash
	pip install -r requirements.txt
	```
	- If you see `LlamaTokenizer requires the SentencePiece library`, install it in the env:
	```bash
	pip install sentencepiece
	```
	- If you get a `torch.load` vulnerability error, either upgrade torch (>=2.6 when available for your platform) or ensure `safetensors` is installed; this repo prefers safetensors by default:
	```bash
	pip install safetensors
	```
	- If you see `LlamaTokenizer requires the SentencePiece library`, install it in the env:
	```bash
	pip install sentencepiece
	```

	## Run a quick finetune
	The defaults use `allenai/tulu-2-7b` with a small instruction dataset (`mlabonne/guanaco-llama2-1k`) and 4-bit QLoRA. This keeps memory needs closer to laptop GPUs.
	```bash
	python train_tulu.py \
	--output_dir outputs/tulu-lora \
	--offload_folder offload \
	--device cpu \
	--max_seq_length 512 \
	--per_device_batch_size 1 \
	--gradient_accumulation_steps 16 \
	--no-use_4bit \
	--instruction_field instruction \
	--input_field input \
	--output_field output
	```

	Key flags:
	- `--no-use_4bit` if bitsandbytes/CUDA are unavailable; for Mac MPS this should stay false (CPU/MPS only).
	- `--dataset_name` to try another instruction set (any HF dataset with `instruction/input/output` fields).
	- `--model_name` if you want a different Tulu variant (e.g., `allenai/tulu-2-dpo-7b`) or a smaller model for constrained hardware (e.g., `TinyLlama/TinyLlama-1.1B-Chat-v1.0` on Mac MPS).
	- `--offload_folder` sets where to offload weights when `device_map="auto"` (ensure it has space). Default `offload/` lives in this repo so it stays alongside the project.
	- `--instruction_field/--input_field/--output_field` let you match custom dataset column names; defaults assume `instruction/input/output`. For text-only datasets, set `--instruction_field text --output_field text`.
	- `--device` can force `cpu`, `mps`, `cuda`, or `auto` (default). Use `--device mps` with a smaller fp16 model (e.g., TinyLlama) to fit memory; offloading is disabled on MPS/CPU.
	- `--torch_dtype` can force the dtype (`float16/float32/bfloat16`); on MPS use `float16` to avoid unsupported bf16 weights.
	- `--cpu_threads` limits CPU threads (default 4) when running on CPU so you don’t overload your machine.
	- MPS (Mac) note: mixed precision isn’t supported for bfloat16; script will fall back to fp32 automatically on MPS. Keep `--no-use_4bit` on Mac, and offloading is disabled on MPS (model stays on device).

	## How W&B is used
	- `train_tulu.py` loads `.env`, logs into W&B, and reports through `Trainer(report_to=["wandb"])`.
	- Ensure `WANDB_API_KEY`, `WANDB_PROJECT`, and (optionally) `WANDB_ENTITY` are set in `.env`.
	- Each run captures hyperparameters and metrics; check the W&B UI for live loss curves and checkpoints.
	- Additional summaries are logged: `train_duration_seconds`, `train_examples`, `estimated_tokens`, `precision_mode` (bf16/fp16/fp32), `use_4bit`, `model_name`, `dataset_name`, `per_device_batch_size`, `gradient_accumulation_steps`, and `max_seq_length`.

	## Training objective and base model
	- Objective: standard causal LM cross-entropy. The model predicts the next token; cross-entropy measures how much probability mass it assigns to the true token. Minimizing it (maximum likelihood) encourages the model to imitate the target outputs in your instruction data. No rewards/RLHF here—pure supervised finetuning.
	- Base model: a Tulu checkpoint (LLaMA-style architecture) from the Hub (default `allenai/tulu-2-7b`). We train LoRA adapters on top of the frozen base (optionally 4-bit on CUDA), keeping the adapter small and the base intact.

	## Model cache location
	- Base model weights download to the Hugging Face cache. You can point downloads to an external directory by setting `BASE_MODEL_CACHE` in `.env` (e.g., `/Volumes/JTQ-s/______GITLAB____/downloaded_base_models`); the script maps this to `HF_HOME`/`TRANSFORMERS_CACHE` before loading models.
	- If `BASE_MODEL_CACHE` is not set, the default HF cache is used (typically `~/.cache/huggingface/hub`).

	## Output
	- Finetuned adapters + tokenizer are written to `outputs/tulu-lora` (configurable via `--output_dir`).
	- `outputs/` is tracked via Git LFS (`.gitattributes`), so weights can be committed and pushed to the Hub. Run `git lfs install` once, then `git add outputs/...` before committing.

	## Evaluation (inference/compare)
	- Quick smoke test with the saved adapter (edit `lora_dir` or pass flags):
	```bash
	python evaluation/simple_inference.py \
	--lora_dir outputs/tinyllama-lora \
	--device auto \
	--torch_dtype auto \
	--max_new_tokens 128 \
	--temperature 0.7 \
	--top_p 0.9
	```
	- Compare base vs. LoRA outputs side-by-side:
	```bash
	python evaluation/compare_lora.py \
	--base_model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
	--lora_dir outputs/tinyllama-lora \
	--prompt "Explain LoRA in one sentence."
	```
	For CPU or constrained machines, force CPU + fp32 (and add `--offload_dir offload` if using `device_map=auto`):
	```bash
	python evaluation/compare_lora.py \
	--base_model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
	--lora_dir outputs/tinyllama-lora \
	--prompt "Explain LoRA in one sentence." \
	--device cpu \
	--torch_dtype float32
	```
	Optional flags: `--max_new_tokens`, `--temperature`, `--top_p`, `--torch_dtype`, `--device`, `--offload_dir`.

	## Troubleshooting
	- OOM? Reduce `max_seq_length`, increase `gradient_accumulation_steps`, or switch to a smaller dataset (e.g., use a tiny instruction set like `mlabonne/guanaco-llama2-1k`, or subset your dataset with `--dataset_name your/dataset --max_train_samples 500` in code/script).
	- bitsandbytes import errors on macOS/CPU: run with `--use_4bit false` or use a Linux+CUDA machine.
	- bitsandbytes install error? We pin to `0.42.0`, the latest widely distributed wheel. If you cannot install it (CPU-only/MPS), remove it from `requirements.txt` and set `--use_4bit false`.


	===
	pip install --upgrade "torch==2.2." "torchvision==0.17." "torchaudio==2.2.*" --index-url https://download.pytorch.org/whl/cu121
	pip install --upgrade "bitsandbytes>=0.43.1"
	pip install --upgrade "transformers>=4.40.0"

	python train_tulu.py \
	--model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
	--output_dir outputs/tinyllama-lora \
	--offload_folder offload \
	--device cuda \
	--torch_dtype auto \
	--max_seq_length 512 \
	--per_device_batch_size 2 \
	--gradient_accumulation_steps 8 \
	--num_train_epochs 1 \
	--use_4bit \
	--instruction_field instruction \
	--input_field input \
	--output_field output

	python train_tulu.py \
	--model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
	--output_dir outputs/tinyllama-lora \
	--offload_folder offload \
	--device cuda \
	--torch_dtype auto \
	--max_seq_length 512 \
	--per_device_batch_size 2 \
	--gradient_accumulation_steps 8 \
	--num_train_epochs 1 \
	--use_4bit \
	--instruction_field instruction \
	--input_field input \
	--output_field output

	===
	only cpu
	python train_tulu.py \
	--model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
	--output_dir outputs/tinyllama-lora \
	--offload_folder offload \
	--device cuda \
	--torch_dtype auto \
	--max_seq_length 512 \
	--per_device_batch_size 2 \
	--gradient_accumulation_steps 8 \
	--num_train_epochs 1 \
	--use_4bit \
	--instruction_field instruction \
	--input_field input \
	--output_field output