---
license: cc-by-nc-sa-4.0
language: en
tags:
  - text-generation
  - causal-lm
  - lora
  - tulu
base_model: allenai/tulu-2-7b
model_type: llama
library_name: transformers
pipeline_tag: text-generation
---

# License: CC BY-NC-SA 4.0. Rights belong to Javad Taghia (taghia.javad@gmail.com).

# Tulu Laptop Finetune + W&B

Minimal setup to finetune a laptop-friendly Tulu checkpoint with QLoRA and track runs in Weights & Biases.

## Prereqs
- Recent NVIDIA GPU with CUDA for 4-bit (bitsandbytes) set `--use_4bit true`. On CPU/MPS (default), set `--use_4bit false`, but expect much slower/limited runs.
- Conda (Miniconda/Anaconda).
- A Weights & Biases account + API key.

## Setup
1) Create the env (Conda)
```bash
conda env create -f environment.yml
conda activate deeai
```
2) Add secrets (keep `.env` out of git)
```bash
cp .env.example .env
# Edit .env with your WANDB_API_KEY / project / entity
# Optionally set BASE_MODEL_CACHE to choose where HF downloads models
```
3) Verify packages (optional if you prefer pip)
```bash
pip install -r requirements.txt
```
- If you see `LlamaTokenizer requires the SentencePiece library`, install it in the env:
```bash
pip install sentencepiece
```
- If you get a `torch.load` vulnerability error, either upgrade torch (>=2.6 when available for your platform) or ensure `safetensors` is installed; this repo prefers safetensors by default:
```bash
pip install safetensors
```
- If you see `LlamaTokenizer requires the SentencePiece library`, install it in the env:
```bash
pip install sentencepiece
```

## Run a quick finetune
The defaults use `allenai/tulu-2-7b` with a small instruction dataset (`mlabonne/guanaco-llama2-1k`) and 4-bit QLoRA. This keeps memory needs closer to laptop GPUs.
```bash
python train_tulu.py \
  --output_dir outputs/tulu-lora \
  --offload_folder offload \
  --device cpu \
  --max_seq_length 512 \
  --per_device_batch_size 1 \
  --gradient_accumulation_steps 16 \
  --no-use_4bit \
  --instruction_field instruction \
  --input_field input \
  --output_field output
```

Key flags:
- `--no-use_4bit` if bitsandbytes/CUDA are unavailable; for Mac MPS this should stay false (CPU/MPS only).
- `--dataset_name` to try another instruction set (any HF dataset with `instruction/input/output` fields).
- `--model_name` if you want a different Tulu variant (e.g., `allenai/tulu-2-dpo-7b`) or a smaller model for constrained hardware (e.g., `TinyLlama/TinyLlama-1.1B-Chat-v1.0` on Mac MPS).
- `--offload_folder` sets where to offload weights when `device_map="auto"` (ensure it has space). Default `offload/` lives in this repo so it stays alongside the project.
- `--instruction_field/--input_field/--output_field` let you match custom dataset column names; defaults assume `instruction/input/output`. For text-only datasets, set `--instruction_field text --output_field text`.
- `--device` can force `cpu`, `mps`, `cuda`, or `auto` (default). Use `--device mps` with a smaller fp16 model (e.g., TinyLlama) to fit memory; offloading is disabled on MPS/CPU.
- `--torch_dtype` can force the dtype (`float16/float32/bfloat16`); on MPS use `float16` to avoid unsupported bf16 weights.
- `--cpu_threads` limits CPU threads (default 4) when running on CPU so you don’t overload your machine.
- MPS (Mac) note: mixed precision isn’t supported for bfloat16; script will fall back to fp32 automatically on MPS. Keep `--no-use_4bit` on Mac, and offloading is disabled on MPS (model stays on device).

## How W&B is used
- `train_tulu.py` loads `.env`, logs into W&B, and reports through `Trainer(report_to=["wandb"])`.
- Ensure `WANDB_API_KEY`, `WANDB_PROJECT`, and (optionally) `WANDB_ENTITY` are set in `.env`.
- Each run captures hyperparameters and metrics; check the W&B UI for live loss curves and checkpoints.
- Additional summaries are logged: `train_duration_seconds`, `train_examples`, `estimated_tokens`, `precision_mode` (bf16/fp16/fp32), `use_4bit`, `model_name`, `dataset_name`, `per_device_batch_size`, `gradient_accumulation_steps`, and `max_seq_length`.

## Training objective and base model
- Objective: standard causal LM cross-entropy. The model predicts the next token; cross-entropy measures how much probability mass it assigns to the true token. Minimizing it (maximum likelihood) encourages the model to imitate the target outputs in your instruction data. No rewards/RLHF here—pure supervised finetuning.
- Base model: a Tulu checkpoint (LLaMA-style architecture) from the Hub (default `allenai/tulu-2-7b`). We train LoRA adapters on top of the frozen base (optionally 4-bit on CUDA), keeping the adapter small and the base intact.

## Model cache location
- Base model weights download to the Hugging Face cache. You can point downloads to an external directory by setting `BASE_MODEL_CACHE` in `.env` (e.g., `/Volumes/JTQ-s/______GITLAB____/downloaded_base_models`); the script maps this to `HF_HOME`/`TRANSFORMERS_CACHE` before loading models.
- If `BASE_MODEL_CACHE` is not set, the default HF cache is used (typically `~/.cache/huggingface/hub`).

## Output
- Finetuned adapters + tokenizer are written to `outputs/tulu-lora` (configurable via `--output_dir`).
- `outputs/` is tracked via Git LFS (`.gitattributes`), so weights can be committed and pushed to the Hub. Run `git lfs install` once, then `git add outputs/...` before committing.

## Evaluation (inference/compare)
- Quick smoke test with the saved adapter (edit `lora_dir` or pass flags):
```bash
python evaluation/simple_inference.py \
  --lora_dir outputs/tinyllama-lora \
  --device auto \
  --torch_dtype auto \
  --max_new_tokens 128 \
  --temperature 0.7 \
  --top_p 0.9
```
- Compare base vs. LoRA outputs side-by-side:
```bash
python evaluation/compare_lora.py \
  --base_model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --lora_dir outputs/tinyllama-lora \
  --prompt "Explain LoRA in one sentence."
```
For CPU or constrained machines, force CPU + fp32 (and add `--offload_dir offload` if using `device_map=auto`):
```bash
python evaluation/compare_lora.py \
  --base_model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --lora_dir outputs/tinyllama-lora \
  --prompt "Explain LoRA in one sentence." \
  --device cpu \
  --torch_dtype float32
```
Optional flags: `--max_new_tokens`, `--temperature`, `--top_p`, `--torch_dtype`, `--device`, `--offload_dir`.

## Troubleshooting
- OOM? Reduce `max_seq_length`, increase `gradient_accumulation_steps`, or switch to a smaller dataset (e.g., use a tiny instruction set like `mlabonne/guanaco-llama2-1k`, or subset your dataset with `--dataset_name your/dataset --max_train_samples 500` in code/script).
- bitsandbytes import errors on macOS/CPU: run with `--use_4bit false` or use a Linux+CUDA machine.
- bitsandbytes install error? We pin to `0.42.0`, the latest widely distributed wheel. If you cannot install it (CPU-only/MPS), remove it from `requirements.txt` and set `--use_4bit false`.


===
pip install --upgrade "torch==2.2.*" "torchvision==0.17.*" "torchaudio==2.2.*" --index-url https://download.pytorch.org/whl/cu121
pip install --upgrade "bitsandbytes>=0.43.1"
pip install --upgrade "transformers>=4.40.0"

python train_tulu.py \
  --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --output_dir outputs/tinyllama-lora \
  --offload_folder offload \
  --device cuda \
  --torch_dtype auto \
  --max_seq_length 512 \
  --per_device_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --num_train_epochs 1 \
  --use_4bit \
  --instruction_field instruction \
  --input_field input \
  --output_field output

python train_tulu.py \
  --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --output_dir outputs/tinyllama-lora \
  --offload_folder offload \
  --device cuda \
  --torch_dtype auto \
  --max_seq_length 512 \
  --per_device_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --num_train_epochs 1 \
  --use_4bit \
  --instruction_field instruction \
  --input_field input \
  --output_field output

  ===
  only cpu
  python train_tulu.py \
  --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --output_dir outputs/tinyllama-lora \
  --offload_folder offload \
  --device cuda \
  --torch_dtype auto \
  --max_seq_length 512 \
  --per_device_batch_size 2 \
  --gradient_accumulation_steps 8 \
  --num_train_epochs 1 \
  --use_4bit \
  --instruction_field instruction \
  --input_field input \
  --output_field output