| # TD Quick Start β Rent a GPU and Go |
|
|
| ## What You Need (One-Time Setup) |
|
|
| 1. **vast.ai account** β sign up at vast.ai, add credit ($10-20 to start) |
| 2. **HuggingFace account** β sign up at huggingface.co (use any username, doesn't have to be your real name) |
| 3. **HuggingFace token** β Settings β Access Tokens β New Token β **Write** access |
| 4. **ntfy.sh app** on your phone (you already have this) |
|
|
| ## One-Time: Upload Your Code to Private HuggingFace |
|
|
| Do this once from your computer. After this, your code lives in a private repo that only you can see. |
|
|
| ```bash |
| # Install the tool |
| pip install huggingface_hub |
| |
| # Log in (paste your token when asked) |
| huggingface-cli login |
| |
| # Upload everything |
| HF_USER=your_hf_username bash upload_to_hf.sh |
| ``` |
|
|
| Now your td_lang, td_fuse, .td files, and deploy script are all in a private HuggingFace repo. Nobody can see them except you. |
|
|
| **When you update your code**, just run `upload_to_hf.sh` again β it overwrites with the latest version. |
|
|
| ## Every Time: Rent GPU β 3 Commands β Done |
|
|
| ### 1. Rent a GPU on vast.ai |
|
|
| Go to vast.ai β Console β Search for: |
| - **GPU:** RTX 4090 (24GB) or A100 (40GB+) |
| - **Image:** Pick one with PyTorch pre-installed (like `pytorch/pytorch`) |
| - **Storage:** At least 100GB disk |
| - **Cost:** ~$0.40-0.80/hr for a 4090 |
|
|
| Click **RENT** and wait for it to start (~1-2 minutes). |
|
|
| ### 2. Connect to the GPU |
|
|
| vast.ai gives you an SSH command. Copy and paste it into your terminal: |
| ``` |
| ssh -p 12345 root@ssh1.vast.ai |
| ``` |
|
|
| ### 3. Run these 3 commands |
|
|
| ```bash |
| # Set your token |
| export HF_TOKEN=hf_your_token_here |
| |
| # Download your code from HuggingFace (takes ~10 seconds) |
| pip install huggingface_hub -q && python -c " |
| from huggingface_hub import snapshot_download |
| snapshot_download('YOUR_USERNAME/td-toolkit', local_dir='/workspace/td') |
| " |
| |
| # Go! |
| cd /workspace/td && bash deploy.sh demo_autopilot.td |
| ``` |
|
|
| That's it. Put your phone down. ntfy.sh sends you updates as it runs. |
|
|
| ### 4. When it's done |
|
|
| Your model gets saved to Google Drive automatically (if rclone is configured in the .td file). Otherwise it stays on the GPU at `final_model/`. |
|
|
| ## Setting Up Google Drive (Optional, One-Time per GPU) |
|
|
| On the GPU machine after SSHing in: |
| ```bash |
| rclone config |
| ``` |
| 1. Type `n` for new remote |
| 2. Name it `gdrive` |
| 3. Pick `Google Drive` from the list |
| 4. Follow the prompts (it gives you a URL to visit in your browser) |
| 5. Done β now `save base to "gdrive:TD/models/final"` works in your .td files |
|
|
| **Tip:** You can save the rclone config to your HuggingFace repo too, so you don't have to set it up every time. |
|
|
| ## Quick Reference |
|
|
| | Command | What it does | |
| |---------|-------------| |
| | `bash deploy.sh my_file.td` | Full setup + run | |
| | `python -m td_lang check my_file.td` | Check syntax only | |
| | `python -m td_lang info my_file.td` | Show plan without running | |
| | `python -m td_lang run my_file.td` | Run (skip deploy setup) | |
| | `python -m td_lang run my_file.td --dry` | Compile but don't execute | |
|
|
| ## If Something Goes Wrong |
|
|
| - **OOM (out of memory):** Your .td file's `on_error` block handles this β it retries with smaller batches |
| - **Model download fails:** Check your HF_TOKEN is set correctly |
| - **ntfy not working:** Check your phone has the ntfy app and you're subscribed to the right topic |
| - **GPU disconnects:** Re-SSH in, your files are still there. Run deploy.sh again β td_lang picks up from the last snapshot |
|
|
| ## Cost Estimate |
|
|
| For the full `demo_autopilot.td` pipeline (merge 4 models + 5 training loops): |
| - **RTX 4090:** ~$0.50/hr Γ ~30-40 hrs = ~$15-20 |
| - **A100 40GB:** ~$1.00/hr Γ ~20-30 hrs = ~$20-30 |
| - **Budget cap in .td file:** Set `max_cost = 160.00` to prevent runaway costs |
|
|