trainer: skip completed phases / resume mid-phase + update context

src/training/mindi_trainer.py:
- train(): skip phases with global_step >= phase.end_step
- train(): when resuming mid-phase, set _resume_step_offset so
train_phase() starts from the correct step instead of 0
- train_phase(): honor _resume_step_offset and clear after use

This fixes the resume bug surfaced in Session 2 where Phase 1 resume
restarted from step 0 instead of step 4250. Bug discovered on droplet
165.245.141.141 and fixed remotely; this commit lands the fix locally.

context.md: bring up to date with sessions 2-4
- Session 2 (Apr 16): rate-limit / git-clone data fix, all phases dry
run passed, checkpoint upload to HF, auto-push script
- Session 3 (Apr 19): Phase 1 finish on droplet 165.245.141.141
- Session 4 (Apr 30): frontend bugs, Gradio 5.x SSE v3, ZeroGPU quota,
agent system, training summary across MI300X + Modal A100
- Add bash history-expansion gotcha and data dir-already-exists fix
- Droplet history table

Files changed (2) hide show

context.md +270 -32
src/training/mindi_trainer.py +13 -1

context.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # MINDI 1.5 Vision-Coder — Complete Project Context
-> **Last updated:** April 16, 2026
 > **Purpose:** This file contains ALL context needed to continue development with any AI assistant.
 > It covers architecture decisions, errors encountered, fixes applied, training state, and exact next steps.
@@ -285,7 +285,7 @@ Also needed: `apt-get install -y git-lfs && git lfs install`
 2. `echo 1 > /sys/bus/pci/devices/0000:83:00.0/reset` (PCI address from `lspci | grep AMD`)
 3. If GPU% still 100%: `modprobe -r amdgpu && modprobe amdgpu`
 4. Verify `rocm-smi` shows GPU% = 0% before restarting Docker
-**Status:** Droplet was deleted. Will need to handle this on fresh droplet if it recurs.
 ### 6.8 HuggingFace Upload Limits
@@ -313,6 +313,7 @@ Also needed: `apt-get install -y git-lfs && git lfs install`
 export HF_TOKEN=<your-hf-token>     # Get from HF settings page
 export HF_HUB_DISABLE_PROGRESS_BARS=1
 export PYTORCH_ROCM_ARCH=gfx942
 # DO NOT SET: HSA_OVERRIDE_GFX_VERSION (causes GPU hang on ROCm 7.0)
 ```
@@ -322,57 +323,56 @@ export PYTORCH_ROCM_ARCH=gfx942
 # 1. SSH into droplet
 ssh root@<DROPLET_IP>
-# 2. Start Docker
 docker start rocm
 docker exec -it rocm /bin/bash
-# 3. Set environment (inside Docker)
 export HF_TOKEN=<your-hf-token>     # Get from HF settings page
 export HF_HUB_DISABLE_PROGRESS_BARS=1
 export PYTORCH_ROCM_ARCH=gfx942
-# 4. Quick GPU test
 python3 -c "import torch; print('GPU:', torch.cuda.get_device_name(0)); x=torch.randn(100,device='cuda'); print('OK:', x.sum().item())"
-# 5. Install git-lfs
 apt-get update && apt-get install -y git-lfs
 git lfs install
-# 6. Clone code repo
 cd /workspace
 git clone https://$HF_TOKEN:$HF_TOKEN@huggingface.co/Mindigenous/MINDI-1.5-Vision-Coder.git
 cd MINDI-1.5-Vision-Coder
-# 7. Install requirements
 pip install -r requirements-training.txt
-# 8. Download training data from HF dataset repo
-python3 -c "
-from huggingface_hub import snapshot_download
-import os
-# HF_TOKEN must be set in environment
-snapshot_download(
-    repo_id='Mindigenous/MINDI-1.5-training-data',
-    repo_type='dataset',
-    local_dir='data',
-    token=os.environ['HF_TOKEN'],
-)
-print('Data download complete!')
-"
-# 9. Verify data
-ls -la data/processed/
-ls -la data/websight/
-ls data/websight/images/ | head
-# 10. Run GPU diagnostic
 python3 scripts/gpu_diagnostic.py
-# 11. Dry run
 python3 scripts/train.py --dry_run --no_wandb
-# 12. Full training
-python3 scripts/train.py --no_wandb
 ```
 ### 7.4 GPU Hang Recovery (if it happens again)
@@ -388,6 +388,32 @@ rocm-smi  # Should show 0% now
 docker start rocm
 ```
 ---
 ## 8. HF DATASET REPO STRUCTURE
@@ -461,12 +487,30 @@ cdc806e Fix: register LLM as nn.Module submodule so optimizer finds LoRA params
 ## 11. WHAT REMAINS (TODO) ❌
-1. **Complete WebSight upload to HF** — Subdirs 04 and 05 still uploading (re-run `scripts/upload_websight_images.py` if interrupted)
 2. **Full 3-phase dry run** — Phase 2 (WebSight) and Phase 3 (mixed) NOT yet tested with the vision pipeline
 3. **Full production training** — 10,000 steps total (Phase 1: 5K, Phase 2: 2.5K, Phase 3: 2.5K)
 4. **Inference testing** — Generate code from screenshots after training
 5. **Commit `upload_websight_images.py` and `context.md`** — These new files need to be pushed
 ---
 ## 12. KNOWN ISSUES & GOTCHAS
@@ -548,8 +592,10 @@ When continuing with a new AI assistant:
    ```
 6. **Spin up fresh MI300X droplet** on DigitalOcean
 7. **Follow Section 7.3** for setup procedure
-8. **Run dry run first** to verify all 3 phases work
-9. **Then full training** — `python3 scripts/train.py --no_wandb`
 ---
@@ -570,4 +616,196 @@ The `snapshot_download(local_dir='data')` call places everything correctly becau
 ---
 *This context file was created on April 16, 2026 during Claude Opus 4.6 session to ensure project continuity.*

 # MINDI 1.5 Vision-Coder — Complete Project Context
+> **Last updated:** April 30, 2026 (Session 4)
 > **Purpose:** This file contains ALL context needed to continue development with any AI assistant.
 > It covers architecture decisions, errors encountered, fixes applied, training state, and exact next steps.
 2. `echo 1 > /sys/bus/pci/devices/0000:83:00.0/reset` (PCI address from `lspci | grep AMD`)
 3. If GPU% still 100%: `modprobe -r amdgpu && modprobe amdgpu`
 4. Verify `rocm-smi` shows GPU% = 0% before restarting Docker
+**Status:** Droplet was deleted. Session 2 is on `134.199.197.198`.
 ### 6.8 HuggingFace Upload Limits
 export HF_TOKEN=<your-hf-token>     # Get from HF settings page
 export HF_HUB_DISABLE_PROGRESS_BARS=1
 export PYTORCH_ROCM_ARCH=gfx942
+export TOKENIZERS_PARALLELISM=false
 # DO NOT SET: HSA_OVERRIDE_GFX_VERSION (causes GPU hang on ROCm 7.0)
 ```
 # 1. SSH into droplet
 ssh root@<DROPLET_IP>
+# 2. Verify GPU health on host (must show 0% GPU)
+rocm-smi
+# 3. Start Docker
 docker start rocm
 docker exec -it rocm /bin/bash
+# 4. Set environment (inside Docker)
 export HF_TOKEN=<your-hf-token>     # Get from HF settings page
 export HF_HUB_DISABLE_PROGRESS_BARS=1
 export PYTORCH_ROCM_ARCH=gfx942
+export TOKENIZERS_PARALLELISM=false
+# 5. Quick GPU test
 python3 -c "import torch; print('GPU:', torch.cuda.get_device_name(0)); x=torch.randn(100,device='cuda'); print('OK:', x.sum().item())"
+# 6. Install git-lfs (ignore AMD artifactory DNS warning — harmless)
 apt-get update && apt-get install -y git-lfs
 git lfs install
+# 7. Clone code repo
 cd /workspace
 git clone https://$HF_TOKEN:$HF_TOKEN@huggingface.co/Mindigenous/MINDI-1.5-Vision-Coder.git
 cd MINDI-1.5-Vision-Coder
+# 8. Install requirements
 pip install -r requirements-training.txt
+# 9. Download training data from HF dataset repo
+#    NOTE: Use git clone, NOT snapshot_download (which hits HTTP 429 rate limits)
+#    NOTE: Must rm -rf data first — code repo creates an empty data/ directory
+rm -rf data
+git clone https://$HF_TOKEN:$HF_TOKEN@huggingface.co/datasets/Mindigenous/MINDI-1.5-training-data data
+# 10. Verify data
+wc -l data/processed/train.jsonl data/processed/val.jsonl
+wc -l data/websight/train.jsonl data/websight/val.jsonl
+for d in data/websight/images/0*/; do echo "$d: $(ls $d | wc -l) files"; done
+# 11. Create output directories
+mkdir -p checkpoints/training checkpoints/best logs/training
+# 12. Run GPU diagnostic
 python3 scripts/gpu_diagnostic.py
+# 13. Dry run (test all 3 phases before full training)
 python3 scripts/train.py --dry_run --no_wandb
+# 14. Full training (background, survives SSH disconnect)
+nohup python3 scripts/train.py --no_wandb > /workspace/training.log 2>&1 &
 ```
 ### 7.4 GPU Hang Recovery (if it happens again)
 docker start rocm
 ```
+### 6.9 HuggingFace snapshot_download Rate Limit (HTTP 429)
+**Symptom:** `HTTP Error 429 thrown while requesting GET .../tree/main` during `snapshot_download()`. Retries endlessly.
+**Root cause:** The dataset has 52,500+ image files. `snapshot_download` paginates through the HF tree API listing all files, causing rate limiting.
+**Fix:** Use `git clone` instead of `snapshot_download` for the dataset:
+```bash
+rm -rf data
+git clone https://$HF_TOKEN:$HF_TOKEN@huggingface.co/datasets/Mindigenous/MINDI-1.5-training-data data
+```
+This downloads everything in a single git connection without hitting the API rate limiter.
+**Discovered:** April 16, 2026 — Session 2
+### 6.10 Bash History Expansion with Exclamation Mark
+**Symptom:** `bash: !': event not found` when running `python3 -c "...print('Done!')"` in a single line.
+**Root cause:** Bash interprets `!'` inside double quotes as history expansion.
+**Fix:** Use multi-line python commands (with actual newlines between double quotes) instead of single-line. Or use single quotes around the python code.
+**Discovered:** April 16, 2026 — Session 2
+### 6.11 Data Directory Already Exists on Clone
+**Symptom:** `fatal: destination path 'data' already exists and is not an empty directory` when trying to `git clone ... data`.
+**Root cause:** The code repo clone creates an empty `data/` directory structure.
+**Fix:** `rm -rf data` before cloning the dataset repo.
+**Discovered:** April 16, 2026 — Session 2
 ---
 ## 8. HF DATASET REPO STRUCTURE
 ## 11. WHAT REMAINS (TODO) ❌
+1. ~~**Complete WebSight upload to HF**~~ — Check if subdirs 04 and 05 are uploaded; re-run upload script if needed
 2. **Full 3-phase dry run** — Phase 2 (WebSight) and Phase 3 (mixed) NOT yet tested with the vision pipeline
 3. **Full production training** — 10,000 steps total (Phase 1: 5K, Phase 2: 2.5K, Phase 3: 2.5K)
 4. **Inference testing** — Generate code from screenshots after training
 5. **Commit `upload_websight_images.py` and `context.md`** — These new files need to be pushed
+### Session 2 Status (April 16, 2026)
+- ✅ Fresh droplet spun up at `134.199.197.198`
+- ✅ Docker container started, GPU healthy (0% util, 45°C)
+- ✅ Code repo cloned, dependencies installed
+- ✅ GPU diagnostic: All 6 tests passed (bf16 matmul, 1GB alloc, forward pass)
+- ⚠️ Data download: multiple rate limits (snapshot_download → git clone → git-lfs → hf_hub_download retries)
+- ✅ All data downloaded: 1.3M text + 50K WebSight JSONL + 52,500 images
+- ✅ Phase 1 dry run PASSED: loss 18.87 → 8.05 in 10 steps (10.8 min)
+- ✅ Phase 2 dry run PASSED: loss 1.46 → 1.19, val_loss 1.32 in 10 steps (6.2 min)
+- ✅ Phase 3 dry run PASSED: loss 14.10 → 9.71, val_loss 9.72 in 10 steps (8.2 min)
+- ✅ Checkpoint upload to HF fixed (.gitignore was blocking *.pt, *.safetensors — removed model file patterns)
+- ✅ Auto-push script running (pushes latest checkpoint to HF every 2 hours — fixed alphabetic sorting bug)
+- ✅ Resume bug fixed: train() now skips completed phases and resumes mid-phase correctly
+- ⏳ Phase 1 training: step 4500/5000, val_loss 0.5372 — on 3rd droplet (165.245.141.141)
+- ⏳ Image download running: ~8300/52500 images (needed for Phase 2)
+- 💰 Budget: ~$91 on current account, more accounts available
+- 📋 Plan: finish Phase 1 → Phase 2 → Phase 3, auto-push checkpoints to HF
 ---
 ## 12. KNOWN ISSUES & GOTCHAS
    ```
 6. **Spin up fresh MI300X droplet** on DigitalOcean
 7. **Follow Section 7.3** for setup procedure
+8. **IMPORTANT:** Use `git clone` for data download (NOT `snapshot_download` — see Section 6.9)
+9. **IMPORTANT:** `rm -rf data` before cloning dataset repo (see Section 6.11)
+10. **Run dry run first** to verify all 3 phases work
+11. **Then full training** — `nohup python3 scripts/train.py --no_wandb > /workspace/training.log 2>&1 &`
 ---
 ---
+## 16. APRIL 16, 2026 — MAIN TRAINING COMMANDS
+### Data Download (git clone — NOT snapshot_download)
+```bash
+# Inside Docker, after cloning code repo:
+rm -rf data
+git clone https://$HF_TOKEN:$HF_TOKEN@huggingface.co/datasets/Mindigenous/MINDI-1.5-training-data data
+```
+### Training — Background (Recommended, survives SSH disconnect)
+```bash
+# From inside Docker:
+nohup python3 scripts/train.py --no_wandb > /workspace/training.log 2>&1 &
+echo $! > /workspace/training.pid
+```
+Or from the **host** (also survives SSH disconnect):
+```bash
+docker exec -d rocm bash -lc 'cd /workspace/MINDI-1.5-Vision-Coder && export HF_TOKEN=<your-hf-token> && export PYTORCH_ROCM_ARCH=gfx942 && python3 scripts/train.py --no_wandb > /workspace/training.log 2>&1'
+```
+### Training — Interactive (Foreground)
+```bash
+python3 scripts/train.py --no_wandb 2>&1 | tee /workspace/training.log
+```
+### Monitoring
+```bash
+docker exec rocm tail -f /workspace/training.log   # Live logs
+docker exec rocm rocm-smi                           # GPU usage
+docker exec rocm ps aux | grep train.py             # Process check
+```
+Notes:
+- Use the background command if you want the process detached from your SSH session.
+- The `scripts/train.py` launcher does not accept a `--log_file` flag; redirect output into `/workspace/training.log` instead.
+- Line-buffered stdout has been added to `src/training/mindi_trainer.py` so logs should appear in near real-time when using `tail -f`.
+## 17. DROPLET HISTORY
+| Session | Date | Droplet IP | Status | Notes |
+|---------|------|-----------|--------|-------|
+| 1 | April 15, 2026 | `134.199.194.245` | Deleted | Phase 1 dry run passed. GPU hung during heavy I/O. |
+| 2 | April 16, 2026 | `134.199.197.198` | Deleted | Phase 1 steps 0→4250 completed. Credits exhausted. |
+| 3 | April 19, 2026 | `165.245.141.141` | Active | Phase 1 resumed at step 4250. Resume bug fixed. |
+---
 *This context file was created on April 16, 2026 during Claude Opus 4.6 session to ensure project continuity.*
+*Updated on April 16, 2026 — Session 2: snapshot_download 429 fix, bash escaping, fresh droplet setup.*
+*Updated on April 28, 2026 — Training complete, frontend built, API deployed.*
+*Updated on April 30, 2026 — Session 4: Fixed critical frontend bugs, Gradio 5.x API protocol, ZeroGPU quota handling.*
+---
+## 22. SESSION 4 — April 30, 2026
+### Bugs Found & Fixed
+**Bug 6.12: `handleSend` ReferenceError (app.js)**
+- **Symptom:** Agent integration broken on page load — `const _originalSend = handleSend` throws ReferenceError because `handleSend` was never defined (the actual function is `send`)
+- **Fix:** Changed to `let activeSend = send` pattern — init() overrides `activeSend = handleSendWithAgent` when MINDIAgent is available. Eliminated duplicate keydown event handlers.
+- **File:** `frontend/app.js`
+**Bug 6.13: Gradio 5.x API protocol mismatch**
+- **Symptom:** `POST /api/predict` returns 404 — the frontend used old Gradio 3.x API format
+- **Root cause:** HF Space runs Gradio 5.23.0 which uses SSE v3 protocol with `/gradio_api/call/{api_name}` (two-step: POST to submit → GET to stream result)
+- **Fix:** Rewrote `callGenerate()` to use the Gradio 5.x two-step flow: POST `/gradio_api/call/chat_fn` → get event_id → GET `/gradio_api/call/chat_fn/{event_id}` → parse SSE response for `event: complete` data
+- **File:** `frontend/app.js`
+- **Config reference:** `GET /config` returns `{"api_prefix": "/gradio_api", "protocol": "sse_v3", "dependencies": [{"api_name": "chat_fn"}]}`
+**Bug 6.14: Health check misdetects Gradio Space as offline**
+- **Symptom:** Status shows "Demo Mode" even when Space is running
+- **Root cause:** `pingHealth()` tried `/api/health` (doesn't exist on Gradio) then `/api/predict` (old format → 404)
+- **Fix:** For HF Spaces, use `fetch(base, {mode:'no-cors'})` which succeeds if the Space is reachable
+- **File:** `frontend/app.js`
+**Improvement: ZeroGPU quota error handling**
+- Reduced `@spaces.GPU(duration=120)` → `@spaces.GPU(duration=60)` (inference is fast after model cache)
+- Added try-except in `chat_fn()` to return clean JSON error instead of crashing when GPU quota exceeded
+- **File:** `hf_space/app.py`
+### Session 4 Status
+- ✅ Frontend bugs fixed (handleSend reference, duplicate handlers)
+- ✅ Gradio 5.x API protocol implemented (SSE v3 two-step flow)
+- ✅ Health check fixed — shows green "MINDI · HF Space" status
+- ✅ Space updated on HF — `Mindigenous/mindi-chat`
+- ⚠️ ZeroGPU daily quota limit can block visitors — PRO users get 8x more quota
+- ✅ Agent system (agent.js + sandbox.js) scaffolded — Plan→Generate→Execute→Verify→Fix loop
+- 📋 Next: Wait for quota reset, then test full end-to-end flow with real model inference
+### Training Summary
+All 3 phases of MINDI 1.5 Vision-Coder training are COMPLETE:
+| Phase | Steps | Status | Platform |
+|-------|-------|--------|----------|
+| Phase 1 (LoRA) | 5,000 | ✅ Complete | DigitalOcean MI300X |
+| Phase 2 (Vision Bridge) | 2,500 | ✅ Complete | DigitalOcean MI300X |
+| Phase 3 (Joint) steps 0-1500 | 1,500 | ✅ Complete | DigitalOcean MI300X |
+| Phase 3 (Joint) steps 1500-2500 | 1,000 | ✅ Complete | Modal A100-40GB |
+### Modal Training Details
+- Resumed from step 1500 checkpoint on Modal A100-40GB ($2.10/hr)
+- Config patched at runtime: batch_size=2, max_length=2048 (from 6/4096)
+- Total Modal cost: ~$28 ($30 credits)
+- Final loss: 0.25–0.40 range
+### HuggingFace Checkpoints (Mindigenous/MINDI-1.5-Vision-Coder)
+All checkpoints uploaded to `checkpoints/` directory:
+- Phase 1: 16 checkpoints (step250 → step5000)
+- Phase 2: 10 checkpoints (step250 → step2500)
+- Phase 3: `phase3_all_step500`, `step1000`, `step1500`, `step2000`, `phase3_all_step2500_final`, `phase3_final`
+### Model Test Results (April 28, 2026)
+- ✅ Code generation (text-only): Matrix exponentiation fibonacci
+- ✅ HTML/CSS generation: Gradient + responsive design
+- ✅ Vision (image input): Processed dummy image
+- ✅ Agentic (bug fix): Identified subtraction→addition bug
+- VRAM usage: 17.2 GB (A100-40GB)
+---
+## 19. FRONTEND
+### Location: `frontend/`
+- `index.html` — Three-panel layout (sidebar + chat + code preview)
+- `styles.css` — Premium dark theme with purple/blue gradients
+- `app.js` — Chat logic, image upload, code extraction, demo mode
+### Features
+- Chat interface with code block rendering (Prism.js)
+- Image upload for vision-to-code
+- Code preview panel with tabs (Code / Preview / Sections)
+- Special token parsing (thinking, critique, fix, error)
+- Demo mode (works without API — simulated responses)
+- Settings modal (double-click MINDI logo) to configure API endpoint
+- Responsive design (mobile + desktop)
+### To Run Locally
+```bash
+cd frontend
+python -m http.server 8080
+# Open http://localhost:8080
+```
+---
+## 20. MODAL API SERVER
+### File: `modal_api.py`
+FastAPI web endpoint that:
+1. Loads MINDI 1.5 from volume checkpoint on container startup
+2. Exposes `/api/generate` (POST) and `/api/health` (GET)
+3. Accepts text + optional base64 image
+4. Returns response + parsed special token sections
+5. CORS enabled for frontend
+### Deployment
+```bash
+modal deploy modal_api.py
+# Returns a URL like: https://mindigenous-ai--mindi-api-api.modal.run
+```
+### Cost
+- A100 @ $2.10/hr, scales to zero when idle
+- ~$0.01-0.05 per request
+- Container idle timeout: 5 minutes
+### Connect Frontend to API
+1. Open frontend at http://localhost:8080
+2. Double-click the MINDI logo (top-left sidebar)
+3. Enter the Modal API URL
+4. Save settings
+---
+## 21. REMAINING BUDGET & NEXT STEPS
+### Budget
+- Modal: $2.21 remaining (~1 hour A100 time)
+- DigitalOcean: exhausted
+### Next Steps
+1. Deploy API when more credits available
+2. Host frontend on Vercel/GitHub Pages (free)
+3. Consider HuggingFace Spaces (free T4) with 4-bit quantization as alternative
+4. Push frontend to GitHub/HF repos

src/training/mindi_trainer.py CHANGED Viewed

@@ -528,7 +528,10 @@ class MINDITrainer:
         self.model.train()
         phase_steps = phase.end_step - phase.start_step
-        step_in_phase = 0
         accum_loss = 0.0
         accum_count = 0
         phase_start_time = time.time()
@@ -679,6 +682,15 @@ class MINDITrainer:
         phase_summaries = []
         for phase in self.config.phases:
             summary = self.train_phase(phase)
             phase_summaries.append(summary)

         self.model.train()
         phase_steps = phase.end_step - phase.start_step
+        step_in_phase = getattr(self, '_resume_step_offset', 0)
+        if step_in_phase > 0:
+            print(f"  [{phase.name}] Resuming from step {step_in_phase}/{phase_steps}")
+            self._resume_step_offset = 0  # Clear after use
         accum_loss = 0.0
         accum_count = 0
         phase_start_time = time.time()
         phase_summaries = []
         for phase in self.config.phases:
+            # Skip completed phases on resume
+            if self.global_step >= phase.end_step:
+                print(f"  Skipping {phase.name} (already completed, global_step={self.global_step})")
+                continue
+            # Resume mid-phase: calculate how many steps are already done
+            if self.global_step > phase.start_step:
+                done_in_phase = self.global_step - phase.start_step
+                self._resume_step_offset = done_in_phase
+                print(f"  Resuming {phase.name} at step {done_in_phase}/{phase.end_step - phase.start_step}")
             summary = self.train_phase(phase)
             phase_summaries.append(summary)