--- license: cc-by-nc-sa-4.0 language: en tags: - text-generation - causal-lm - lora - tulu base_model: allenai/tulu-2-7b model_type: llama library_name: transformers pipeline_tag: text-generation --- # License: CC BY-NC-SA 4.0. Rights belong to Javad Taghia (taghia.javad@gmail.com). # Tulu Laptop Finetune + W&B Minimal setup to finetune a laptop-friendly Tulu checkpoint with QLoRA and track runs in Weights & Biases. ## Prereqs - Recent NVIDIA GPU with CUDA for 4-bit (bitsandbytes) set `--use_4bit true`. On CPU/MPS (default), set `--use_4bit false`, but expect much slower/limited runs. - Conda (Miniconda/Anaconda). - A Weights & Biases account + API key. ## Setup 1) Create the env (Conda) ```bash conda env create -f environment.yml conda activate deeai ``` 2) Add secrets (keep `.env` out of git) ```bash cp .env.example .env # Edit .env with your WANDB_API_KEY / project / entity # Optionally set BASE_MODEL_CACHE to choose where HF downloads models ``` 3) Verify packages (optional if you prefer pip) ```bash pip install -r requirements.txt ``` - If you see `LlamaTokenizer requires the SentencePiece library`, install it in the env: ```bash pip install sentencepiece ``` - If you get a `torch.load` vulnerability error, either upgrade torch (>=2.6 when available for your platform) or ensure `safetensors` is installed; this repo prefers safetensors by default: ```bash pip install safetensors ``` - If you see `LlamaTokenizer requires the SentencePiece library`, install it in the env: ```bash pip install sentencepiece ``` ## Run a quick finetune The defaults use `allenai/tulu-2-7b` with a small instruction dataset (`mlabonne/guanaco-llama2-1k`) and 4-bit QLoRA. This keeps memory needs closer to laptop GPUs. ```bash python train_tulu.py \ --output_dir outputs/tulu-lora \ --offload_folder offload \ --device cpu \ --max_seq_length 512 \ --per_device_batch_size 1 \ --gradient_accumulation_steps 16 \ --no-use_4bit \ --instruction_field instruction \ --input_field input \ --output_field output ``` Key flags: - `--no-use_4bit` if bitsandbytes/CUDA are unavailable; for Mac MPS this should stay false (CPU/MPS only). - `--dataset_name` to try another instruction set (any HF dataset with `instruction/input/output` fields). - `--model_name` if you want a different Tulu variant (e.g., `allenai/tulu-2-dpo-7b`) or a smaller model for constrained hardware (e.g., `TinyLlama/TinyLlama-1.1B-Chat-v1.0` on Mac MPS). - `--offload_folder` sets where to offload weights when `device_map="auto"` (ensure it has space). Default `offload/` lives in this repo so it stays alongside the project. - `--instruction_field/--input_field/--output_field` let you match custom dataset column names; defaults assume `instruction/input/output`. For text-only datasets, set `--instruction_field text --output_field text`. - `--device` can force `cpu`, `mps`, `cuda`, or `auto` (default). Use `--device mps` with a smaller fp16 model (e.g., TinyLlama) to fit memory; offloading is disabled on MPS/CPU. - `--torch_dtype` can force the dtype (`float16/float32/bfloat16`); on MPS use `float16` to avoid unsupported bf16 weights. - `--cpu_threads` limits CPU threads (default 4) when running on CPU so you don’t overload your machine. - MPS (Mac) note: mixed precision isn’t supported for bfloat16; script will fall back to fp32 automatically on MPS. Keep `--no-use_4bit` on Mac, and offloading is disabled on MPS (model stays on device). ## How W&B is used - `train_tulu.py` loads `.env`, logs into W&B, and reports through `Trainer(report_to=["wandb"])`. - Ensure `WANDB_API_KEY`, `WANDB_PROJECT`, and (optionally) `WANDB_ENTITY` are set in `.env`. - Each run captures hyperparameters and metrics; check the W&B UI for live loss curves and checkpoints. - Additional summaries are logged: `train_duration_seconds`, `train_examples`, `estimated_tokens`, `precision_mode` (bf16/fp16/fp32), `use_4bit`, `model_name`, `dataset_name`, `per_device_batch_size`, `gradient_accumulation_steps`, and `max_seq_length`. ## Training objective and base model - Objective: standard causal LM cross-entropy. The model predicts the next token; cross-entropy measures how much probability mass it assigns to the true token. Minimizing it (maximum likelihood) encourages the model to imitate the target outputs in your instruction data. No rewards/RLHF here—pure supervised finetuning. - Base model: a Tulu checkpoint (LLaMA-style architecture) from the Hub (default `allenai/tulu-2-7b`). We train LoRA adapters on top of the frozen base (optionally 4-bit on CUDA), keeping the adapter small and the base intact. ## Model cache location - Base model weights download to the Hugging Face cache. You can point downloads to an external directory by setting `BASE_MODEL_CACHE` in `.env` (e.g., `/Volumes/JTQ-s/______GITLAB____/downloaded_base_models`); the script maps this to `HF_HOME`/`TRANSFORMERS_CACHE` before loading models. - If `BASE_MODEL_CACHE` is not set, the default HF cache is used (typically `~/.cache/huggingface/hub`). ## Output - Finetuned adapters + tokenizer are written to `outputs/tulu-lora` (configurable via `--output_dir`). - `outputs/` is tracked via Git LFS (`.gitattributes`), so weights can be committed and pushed to the Hub. Run `git lfs install` once, then `git add outputs/...` before committing. ## Evaluation (inference/compare) - Quick smoke test with the saved adapter (edit `lora_dir` or pass flags): ```bash python evaluation/simple_inference.py \ --lora_dir outputs/tinyllama-lora \ --device auto \ --torch_dtype auto \ --max_new_tokens 128 \ --temperature 0.7 \ --top_p 0.9 ``` - Compare base vs. LoRA outputs side-by-side: ```bash python evaluation/compare_lora.py \ --base_model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \ --lora_dir outputs/tinyllama-lora \ --prompt "Explain LoRA in one sentence." ``` For CPU or constrained machines, force CPU + fp32 (and add `--offload_dir offload` if using `device_map=auto`): ```bash python evaluation/compare_lora.py \ --base_model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \ --lora_dir outputs/tinyllama-lora \ --prompt "Explain LoRA in one sentence." \ --device cpu \ --torch_dtype float32 ``` Optional flags: `--max_new_tokens`, `--temperature`, `--top_p`, `--torch_dtype`, `--device`, `--offload_dir`. ## Troubleshooting - OOM? Reduce `max_seq_length`, increase `gradient_accumulation_steps`, or switch to a smaller dataset (e.g., use a tiny instruction set like `mlabonne/guanaco-llama2-1k`, or subset your dataset with `--dataset_name your/dataset --max_train_samples 500` in code/script). - bitsandbytes import errors on macOS/CPU: run with `--use_4bit false` or use a Linux+CUDA machine. - bitsandbytes install error? We pin to `0.42.0`, the latest widely distributed wheel. If you cannot install it (CPU-only/MPS), remove it from `requirements.txt` and set `--use_4bit false`. === pip install --upgrade "torch==2.2.*" "torchvision==0.17.*" "torchaudio==2.2.*" --index-url https://download.pytorch.org/whl/cu121 pip install --upgrade "bitsandbytes>=0.43.1" pip install --upgrade "transformers>=4.40.0" python train_tulu.py \ --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \ --output_dir outputs/tinyllama-lora \ --offload_folder offload \ --device cuda \ --torch_dtype auto \ --max_seq_length 512 \ --per_device_batch_size 2 \ --gradient_accumulation_steps 8 \ --num_train_epochs 1 \ --use_4bit \ --instruction_field instruction \ --input_field input \ --output_field output python train_tulu.py \ --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \ --output_dir outputs/tinyllama-lora \ --offload_folder offload \ --device cuda \ --torch_dtype auto \ --max_seq_length 512 \ --per_device_batch_size 2 \ --gradient_accumulation_steps 8 \ --num_train_epochs 1 \ --use_4bit \ --instruction_field instruction \ --input_field input \ --output_field output === only cpu python train_tulu.py \ --model_name TinyLlama/TinyLlama-1.1B-Chat-v1.0 \ --output_dir outputs/tinyllama-lora \ --offload_folder offload \ --device cuda \ --torch_dtype auto \ --max_seq_length 512 \ --per_device_batch_size 2 \ --gradient_accumulation_steps 8 \ --num_train_epochs 1 \ --use_4bit \ --instruction_field instruction \ --input_field input \ --output_field output