td-toolkit / QUICKSTART.md

Fixed code: vocab mismatch fix for cross-arch merging (Llama/Falcon)

5d61448 verified 3 months ago

3.68 kB

	# TD Quick Start — Rent a GPU and Go

	## What You Need (One-Time Setup)

	1. vast.ai account — sign up at vast.ai, add credit ($10-20 to start)
	2. HuggingFace account — sign up at huggingface.co (use any username, doesn't have to be your real name)
	3. HuggingFace token — Settings → Access Tokens → New Token → Write access
	4. ntfy.sh app on your phone (you already have this)

	## One-Time: Upload Your Code to Private HuggingFace

	Do this once from your computer. After this, your code lives in a private repo that only you can see.

	```bash
	# Install the tool
	pip install huggingface_hub

	# Log in (paste your token when asked)
	huggingface-cli login

	# Upload everything
	HF_USER=your_hf_username bash upload_to_hf.sh
	```

	Now your td_lang, td_fuse, .td files, and deploy script are all in a private HuggingFace repo. Nobody can see them except you.

	When you update your code, just run `upload_to_hf.sh` again — it overwrites with the latest version.

	## Every Time: Rent GPU → 3 Commands → Done

	### 1. Rent a GPU on vast.ai

	Go to vast.ai → Console → Search for:
	- GPU: RTX 4090 (24GB) or A100 (40GB+)
	- Image: Pick one with PyTorch pre-installed (like `pytorch/pytorch`)
	- Storage: At least 100GB disk
	- Cost: ~$0.40-0.80/hr for a 4090

	Click RENT and wait for it to start (~1-2 minutes).

	### 2. Connect to the GPU

	vast.ai gives you an SSH command. Copy and paste it into your terminal:
	```
	ssh -p 12345 root@ssh1.vast.ai
	```

	### 3. Run these 3 commands

	```bash
	# Set your token
	export HF_TOKEN=hf_your_token_here

	# Download your code from HuggingFace (takes ~10 seconds)
	pip install huggingface_hub -q && python -c "
	from huggingface_hub import snapshot_download
	snapshot_download('YOUR_USERNAME/td-toolkit', local_dir='/workspace/td')
	"

	# Go!
	cd /workspace/td && bash deploy.sh demo_autopilot.td
	```

	That's it. Put your phone down. ntfy.sh sends you updates as it runs.

	### 4. When it's done

	Your model gets saved to Google Drive automatically (if rclone is configured in the .td file). Otherwise it stays on the GPU at `final_model/`.

	## Setting Up Google Drive (Optional, One-Time per GPU)

	On the GPU machine after SSHing in:
	```bash
	rclone config
	```
	1. Type `n` for new remote
	2. Name it `gdrive`
	3. Pick `Google Drive` from the list
	4. Follow the prompts (it gives you a URL to visit in your browser)
	5. Done — now `save base to "gdrive:TD/models/final"` works in your .td files

	Tip: You can save the rclone config to your HuggingFace repo too, so you don't have to set it up every time.

	## Quick Reference

	\| Command \| What it does \|
	\|---------\|-------------\|
	\| `bash deploy.sh my_file.td` \| Full setup + run \|
	\| `python -m td_lang check my_file.td` \| Check syntax only \|
	\| `python -m td_lang info my_file.td` \| Show plan without running \|
	\| `python -m td_lang run my_file.td` \| Run (skip deploy setup) \|
	\| `python -m td_lang run my_file.td --dry` \| Compile but don't execute \|

	## If Something Goes Wrong

	- OOM (out of memory): Your .td file's `on_error` block handles this — it retries with smaller batches
	- Model download fails: Check your HF_TOKEN is set correctly
	- ntfy not working: Check your phone has the ntfy app and you're subscribed to the right topic
	- GPU disconnects: Re-SSH in, your files are still there. Run deploy.sh again — td_lang picks up from the last snapshot

	## Cost Estimate

	For the full `demo_autopilot.td` pipeline (merge 4 models + 5 training loops):
	- RTX 4090: ~$0.50/hr × ~30-40 hrs = ~$15-20
	- A100 40GB: ~$1.00/hr × ~20-30 hrs = ~$20-30
	- Budget cap in .td file: Set `max_cost = 160.00` to prevent runaway costs