AGILLM4_training_script_and_bounded_uploads

Browse files

Files changed (7) hide show

AGILLM-4.md +11 -7
README.md +8 -2
launch_agillm4_4090_floor_from_v47.sh +6 -3
nB300_agillm4.py +72 -10
run_agillm4_4090_longblock.sh +5 -5
upload_agillm4_checkpoints.py +214 -0
upload_agillm4_checkpoints_loop.sh +23 -0

AGILLM-4.md CHANGED Viewed

@@ -37,7 +37,7 @@ Production first-run recipe on 4090:
 ```bash
 AGILLM4_4090_WARMSTART_FROM=/workspace/agillm-4/agillm4_floor_seed_from_v3_v47.pt \
 AGILLM4_4090_PRESET=agillm4_floor \
-AGILLM4_4090_BLOCK=512 \
 AGILLM4_4090_TOKEN_PARAM_RATIO=100 \
 tmux new-session -d -s agillm4_floor \
   /workspace/agillm-4/launch_agillm4_4090_floor_from_v47.sh
@@ -47,14 +47,18 @@ Important: `--sat_every 1 --nat_every 4` keeps SAT trained every step and NAT ac
 Escalation ladder on 4090:
-1. `block=512`
-2. `block=640`
-3. `block=768`
-4. `block=1024`
-5. `block=1280+` only after measured VRAM headroom
 If 8-bit optimizer is unavailable, install `bitsandbytes` rather than dropping the long-block target. SAT remains active every step; NAT should stay enabled with a slower cadence or `--nat_max_tokens` cap on 24GB. The code lowers peak memory by backpropagating AR, SAT, and NAT sequentially, not by deleting heads.
 ## Warm Start from AGILLM-3 (function-preserving)
 The next AGILLM-4 run does **not** start from a random init. `build_v4_seed.py`
@@ -96,7 +100,7 @@ Use with `--warmstart_from`:
 python /workspace/agillm-4/nB300_agillm4.py train \
   --preset agillm4_floor \
   --warmstart_from /workspace/agillm-4/agillm4_floor_seed_from_v3_v47.pt \
-  --batch_size 1 --block 512 --amp --grad_checkpoint --sat_every 1 --nat_every 4 \
   --token_param_ratio 100
 ```

 ```bash
 AGILLM4_4090_WARMSTART_FROM=/workspace/agillm-4/agillm4_floor_seed_from_v3_v47.pt \
 AGILLM4_4090_PRESET=agillm4_floor \
+AGILLM4_4090_BLOCK=1280 \
 AGILLM4_4090_TOKEN_PARAM_RATIO=100 \
 tmux new-session -d -s agillm4_floor \
   /workspace/agillm-4/launch_agillm4_4090_floor_from_v47.sh
 Escalation ladder on 4090:
+1. Start `block=1280`.
+2. If OOM, back off by about 20% at a time (`1024`, `768`, ...), not straight to half.
+3. If stable below 23GB VRAM after real step timing, raise toward `1536`.
 If 8-bit optimizer is unavailable, install `bitsandbytes` rather than dropping the long-block target. SAT remains active every step; NAT should stay enabled with a slower cadence or `--nat_max_tokens` cap on 24GB. The code lowers peak memory by backpropagating AR, SAT, and NAT sequentially, not by deleting heads.
+For the year-scale full run, keep checkpoint retention bounded. The 4090 launcher
+keeps one local full checkpoint and one local delta. The companion uploader
+publishes status/log tails every 30 minutes, uploads the newest delta at most
+daily, uploads full checkpoints at most weekly, and prunes remote AGILLM-4
+training uploads to the configured rolling window.
 ## Warm Start from AGILLM-3 (function-preserving)
 The next AGILLM-4 run does **not** start from a random init. `build_v4_seed.py`
 python /workspace/agillm-4/nB300_agillm4.py train \
   --preset agillm4_floor \
   --warmstart_from /workspace/agillm-4/agillm4_floor_seed_from_v3_v47.pt \
+  --batch_size 1 --block 1280 --amp --grad_checkpoint --sat_every 1 --nat_every 4 \
   --token_param_ratio 100
 ```

README.md CHANGED Viewed

@@ -29,10 +29,16 @@ recipes. The current sublinear backend is intentionally experimental: profile it
 against SDPA before using it for a real run.
 On RTX 4090-class 24GB cards, `run_agillm4_4090_longblock.sh` now defaults to
-`agillm4_floor` instead of the AGILLM-3-sized `large` preset. Override
-`AGILLM4_4090_BLOCK` upward only after the first floor run is stable.
 For the current v47 seed, launch tmux with
 `/workspace/agillm-4/launch_agillm4_4090_floor_from_v47.sh`; it writes
 `/workspace/agillm4_floor_train.log`.
 Current harvest status from n1.py is tracked in [N1_HARVEST.md](N1_HARVEST.md).

 against SDPA before using it for a real run.
 On RTX 4090-class 24GB cards, `run_agillm4_4090_longblock.sh` now defaults to
+`agillm4_floor` instead of the AGILLM-3-sized `large` preset, starts at block
+`1280`, and backs off in smaller 20% steps if VRAM is too tight.
 For the current v47 seed, launch tmux with
 `/workspace/agillm-4/launch_agillm4_4090_floor_from_v47.sh`; it writes
 `/workspace/agillm4_floor_train.log`.
+Checkpoint upload policy is intentionally bounded for the public HF storage
+quota: status and log tails upload every 30 minutes, the latest multi-GB delta
+uploads at most daily, and full checkpoints upload at most weekly with only two
+current remote files retained. Local full saves default to daily and local
+retention is one full plus one delta, so the 64GB Vast disk does not slowly fill.
 Current harvest status from n1.py is tracked in [N1_HARVEST.md](N1_HARVEST.md).

launch_agillm4_4090_floor_from_v47.sh CHANGED Viewed

@@ -9,11 +9,14 @@ echo "LAUNCH_AGILLM4_4090_FLOOR_FROM_V47 $(date -u +%Y-%m-%dT%H:%M:%SZ) host=$(h
 export AGILLM4_4090_WARMSTART_FROM="${AGILLM4_4090_WARMSTART_FROM:-/workspace/agillm-4/agillm4_floor_seed_from_v3_v47.pt}"
 export AGILLM4_4090_PRESET="${AGILLM4_4090_PRESET:-agillm4_floor}"
-export AGILLM4_4090_BLOCK="${AGILLM4_4090_BLOCK:-512}"
 export AGILLM4_4090_TOKEN_PARAM_RATIO="${AGILLM4_4090_TOKEN_PARAM_RATIO:-100}"
 export AGILLM4_4090_NAT_EVERY="${AGILLM4_4090_NAT_EVERY:-4}"
-export AGILLM4_4090_NAT_MAX_TOKENS="${AGILLM4_4090_NAT_MAX_TOKENS:-512}"
-export AGILLM4_4090_SAVE_EVERY_SEC="${AGILLM4_4090_SAVE_EVERY_SEC:-21600}"
 export AGILLM4_4090_SAVE_DIR="${AGILLM4_4090_SAVE_DIR:-/workspace/agillm4_4090_ckpts}"
 exec bash /workspace/agillm-4/run_agillm4_4090_longblock.sh

 export AGILLM4_4090_WARMSTART_FROM="${AGILLM4_4090_WARMSTART_FROM:-/workspace/agillm-4/agillm4_floor_seed_from_v3_v47.pt}"
 export AGILLM4_4090_PRESET="${AGILLM4_4090_PRESET:-agillm4_floor}"
+export AGILLM4_4090_BLOCK="${AGILLM4_4090_BLOCK:-1280}"
 export AGILLM4_4090_TOKEN_PARAM_RATIO="${AGILLM4_4090_TOKEN_PARAM_RATIO:-100}"
 export AGILLM4_4090_NAT_EVERY="${AGILLM4_4090_NAT_EVERY:-4}"
+export AGILLM4_4090_NAT_MAX_TOKENS="${AGILLM4_4090_NAT_MAX_TOKENS:-768}"
+export AGILLM4_4090_SAVE_EVERY_SEC="${AGILLM4_4090_SAVE_EVERY_SEC:-86400}"
+export AGILLM4_4090_DELTA_EVERY_STEPS="${AGILLM4_4090_DELTA_EVERY_STEPS:-25000}"
+export AGILLM4_4090_DELTA_MAX_KEEP="${AGILLM4_4090_DELTA_MAX_KEEP:-1}"
+export AGILLM4_4090_MAX_CKPTS="${AGILLM4_4090_MAX_CKPTS:-1}"
 export AGILLM4_4090_SAVE_DIR="${AGILLM4_4090_SAVE_DIR:-/workspace/agillm4_4090_ckpts}"
 exec bash /workspace/agillm-4/run_agillm4_4090_longblock.sh

nB300_agillm4.py CHANGED Viewed

@@ -69,7 +69,11 @@ _STATUS_PROGRESS_RE = re.compile(
     r"^\[(?P<percent>\d+(?:\.\d+)?)%\]\s+"
     r"(?P<seen>[\d,]+)/(?P<target>[\d,]+)\s+tok\s+\|\s+"
     r"(?P<tok_s>[\d.]+)\s+tok/s\s+\|\s+"
-    r"loss=(?P<loss>-?[\d.]+)\s+B=(?P<batch>\d+)\s+L=(?P<block>\d+)\s*$"
 )
 _STATUS_DELTA_RE = re.compile(r"\[delta\]\s+saved\s+(?P<name>\S+?\.pt)\s+\((?P<sha>[0-9a-f]+)\.\.\.\)")
 _STATUS_STEP_RE = re.compile(r"step(?P<step>\d+)")
@@ -99,6 +103,30 @@ def _status_human_duration(seconds: Optional[float]) -> Optional[str]:
     return " ".join(parts)
 def _status_format_int(value: Optional[int]) -> str:
     return "?" if value is None else f"{value:,}"
@@ -193,6 +221,9 @@ def _status_parse_progress_line(line: str) -> Optional[Dict[str, Any]]:
         "loss": loss,
         "batch": int(match.group("batch")),
         "block": int(match.group("block")),
     }
@@ -508,12 +539,18 @@ def _format_status_text(status: Dict[str, Any]) -> str:
     progress = status.get("progress")
     if progress:
         lines.append(
             "Progress: "
             f"{progress['percent']:.1f}% | "
             f"{_status_format_int(progress['seen_tokens'])}/{_status_format_int(progress['target_tokens'])} tok | "
             f"{progress['tok_per_sec']} tok/s | loss {progress['loss']:.3f} | "
             f"B={progress['batch']} L={progress['block']}"
         )
     else:
         lines.append("Progress: unavailable")
@@ -590,23 +627,45 @@ from anchor_memory import AnchorMemoryConfig, AnchorMemoryLayer
 # SafeProgress - Claude-safe progress (discrete lines, not single growing line)
 class SafeProgress:
-    def __init__(self, total, initial=0, unit="tok", print_every=500):
         self.total, self.n, self.unit = total, initial, unit
         self.initial = initial
         self.last_print, self.postfix = initial, {}
         self.start_time = __import__('time').time()
     def update(self, n=1):
         self.n += n
-        if self.n - self.last_print >= 1000000:  # print every ~1M tokens
-            self._print(); self.last_print = self.n
     def set_postfix(self, **kwargs): self.postfix = kwargs
-    def _print(self):
-        elapsed = __import__('time').time() - self.start_time
         rate = (self.n - self.initial) / elapsed if elapsed > 0 else 0
         pct = 100 * self.n / self.total if self.total > 0 else 0
         pf = ' '.join(f"{k}={v}" for k,v in self.postfix.items())
-        print(f"[{pct:.1f}%] {self.n:,}/{self.total:,} {self.unit} | {rate:.0f} tok/s | {pf}")
-    def close(self): self._print(); print("Done.")
 import torch.nn as nn
 import torch.nn.functional as F
@@ -2118,7 +2177,10 @@ def _train_phase(
                     BATCH -= 1
                     time.sleep(2)
                 else:
-                    new_block = max(128, BLOCK // 2)
                     print(f"\n[{phase_name} OOM] Reducing Block: {BLOCK} -> {new_block}")
                     BLOCK = new_block
                     time.sleep(2)
@@ -2139,8 +2201,8 @@ def _train_phase(
         oom_retries = 0
         toks_processed = BLOCK * BATCH
         seen_tok += toks_processed
-        pbar.update(toks_processed)
         pbar.set_postfix(loss=f"{loss_value:.3f}", B=BATCH, L=BLOCK)
         if args.save_every_sec > 0:
             now_mono = time.monotonic()
             if now_mono - last_save_mono >= args.save_every_sec:

     r"^\[(?P<percent>\d+(?:\.\d+)?)%\]\s+"
     r"(?P<seen>[\d,]+)/(?P<target>[\d,]+)\s+tok\s+\|\s+"
     r"(?P<tok_s>[\d.]+)\s+tok/s\s+\|\s+"
+    r"loss=(?P<loss>-?[\d.]+)\s+B=(?P<batch>\d+)\s+L=(?P<block>\d+)"
+    r"(?:\s+step=(?P<step>\d+))?"
+    r"(?:\s+eta=(?P<eta>\S+))?"
+    r"(?:\s+elapsed=(?P<elapsed>\S+))?"
+    r"\s*$"
 )
 _STATUS_DELTA_RE = re.compile(r"\[delta\]\s+saved\s+(?P<name>\S+?\.pt)\s+\((?P<sha>[0-9a-f]+)\.\.\.\)")
 _STATUS_STEP_RE = re.compile(r"step(?P<step>\d+)")
     return " ".join(parts)
+def _status_compact_duration(seconds: Optional[float]) -> str:
+    if seconds is None:
+        return "unknown"
+    try:
+        if not math.isfinite(float(seconds)):
+            return "unknown"
+    except Exception:
+        return "unknown"
+    total = max(0, int(seconds))
+    years, rem = divmod(total, 365 * 86400)
+    days, rem = divmod(rem, 86400)
+    hours, rem = divmod(rem, 3600)
+    minutes, secs = divmod(rem, 60)
+    if years:
+        return f"{years}y{days}d{hours}h"
+    if days:
+        return f"{days}d{hours}h{minutes}m"
+    if hours:
+        return f"{hours}h{minutes}m{secs}s"
+    if minutes:
+        return f"{minutes}m{secs}s"
+    return f"{secs}s"
 def _status_format_int(value: Optional[int]) -> str:
     return "?" if value is None else f"{value:,}"
         "loss": loss,
         "batch": int(match.group("batch")),
         "block": int(match.group("block")),
+        "step": int(match.group("step")) if match.group("step") else None,
+        "eta": match.group("eta"),
+        "elapsed": match.group("elapsed"),
     }
     progress = status.get("progress")
     if progress:
+        eta = progress.get("eta")
+        if not eta and progress.get("tok_per_sec"):
+            remaining = max(0, progress["target_tokens"] - progress["seen_tokens"])
+            eta = _status_compact_duration(remaining / float(progress["tok_per_sec"]))
         lines.append(
             "Progress: "
             f"{progress['percent']:.1f}% | "
             f"{_status_format_int(progress['seen_tokens'])}/{_status_format_int(progress['target_tokens'])} tok | "
             f"{progress['tok_per_sec']} tok/s | loss {progress['loss']:.3f} | "
             f"B={progress['batch']} L={progress['block']}"
+            + (f" | step {progress['step']}" if progress.get("step") else "")
+            + (f" | ETA {eta}" if eta else "")
         )
     else:
         lines.append("Progress: unavailable")
 # SafeProgress - Claude-safe progress (discrete lines, not single growing line)
 class SafeProgress:
+    def __init__(self, total, initial=0, unit="tok", print_every=100, print_every_sec=60):
         self.total, self.n, self.unit = total, initial, unit
         self.initial = initial
         self.last_print, self.postfix = initial, {}
+        self.print_every = max(1, int(print_every))
+        self.print_every_sec = max(1, int(print_every_sec))
+        self.step = 0
+        self.last_print_step = 0
         self.start_time = __import__('time').time()
+        self.last_print_time = self.start_time
     def update(self, n=1):
         self.n += n
+        self.step += 1
+        now = __import__('time').time()
+        if (
+            self.step == 1
+            or (self.step - self.last_print_step) >= self.print_every
+            or (now - self.last_print_time) >= self.print_every_sec
+        ):
+            self._print(now)
+            self.last_print = self.n
+            self.last_print_step = self.step
+            self.last_print_time = now
     def set_postfix(self, **kwargs): self.postfix = kwargs
+    def _print(self, now=None):
+        now = now or __import__('time').time()
+        elapsed = now - self.start_time
         rate = (self.n - self.initial) / elapsed if elapsed > 0 else 0
         pct = 100 * self.n / self.total if self.total > 0 else 0
         pf = ' '.join(f"{k}={v}" for k,v in self.postfix.items())
+        remaining = max(0, self.total - self.n)
+        eta = _status_compact_duration(remaining / rate) if rate > 0 else "unknown"
+        elapsed_s = _status_compact_duration(elapsed)
+        print(
+            f"[{pct:.4f}%] {self.n:,}/{self.total:,} {self.unit} | "
+            f"{rate:.2f} tok/s | {pf} step={self.step} eta={eta} elapsed={elapsed_s}",
+            flush=True,
+        )
+    def close(self): self._print(); print("Done.", flush=True)
 import torch.nn as nn
 import torch.nn.functional as F
                     BATCH -= 1
                     time.sleep(2)
                 else:
+                    new_block = max(128, int(BLOCK * 0.8))
+                    new_block = max(128, (new_block // 128) * 128)
+                    if new_block >= BLOCK:
+                        new_block = max(128, BLOCK - 128)
                     print(f"\n[{phase_name} OOM] Reducing Block: {BLOCK} -> {new_block}")
                     BLOCK = new_block
                     time.sleep(2)
         oom_retries = 0
         toks_processed = BLOCK * BATCH
         seen_tok += toks_processed
         pbar.set_postfix(loss=f"{loss_value:.3f}", B=BATCH, L=BLOCK)
+        pbar.update(toks_processed)
         if args.save_every_sec > 0:
             now_mono = time.monotonic()
             if now_mono - last_save_mono >= args.save_every_sec:

run_agillm4_4090_longblock.sh CHANGED Viewed

@@ -13,7 +13,7 @@ if [ -f /root/.cache/huggingface/token ]; then
 fi
 PRESET="${AGILLM4_4090_PRESET:-agillm4_floor}"
-BLOCK="${AGILLM4_4090_BLOCK:-512}"
 TOKEN_PARAM_RATIO="${AGILLM4_4090_TOKEN_PARAM_RATIO:-100}"
 SAVE_DIR="${AGILLM4_4090_SAVE_DIR:-/workspace/agillm4_4090_ckpts}"
@@ -41,10 +41,10 @@ exec python -u /workspace/agillm-4/nB300_agillm4.py train \
   --nat_every "${AGILLM4_4090_NAT_EVERY:-4}" \
   --nat_loss_weight "${AGILLM4_4090_NAT_LOSS_WEIGHT:-1.0}" \
   --nat_expand "${AGILLM4_4090_NAT_EXPAND:-2}" \
-  --nat_max_tokens "${AGILLM4_4090_NAT_MAX_TOKENS:-512}" \
   --token_param_ratio "$TOKEN_PARAM_RATIO" \
   --save_dir "$SAVE_DIR" \
-  --save_every_sec "${AGILLM4_4090_SAVE_EVERY_SEC:-21600}" \
   --delta_every_steps "${AGILLM4_4090_DELTA_EVERY_STEPS:-25000}" \
-  --delta_max_keep "${AGILLM4_4090_DELTA_MAX_KEEP:-8}" \
-  --max_ckpts "${AGILLM4_4090_MAX_CKPTS:-3}"

 fi
 PRESET="${AGILLM4_4090_PRESET:-agillm4_floor}"
+BLOCK="${AGILLM4_4090_BLOCK:-1280}"
 TOKEN_PARAM_RATIO="${AGILLM4_4090_TOKEN_PARAM_RATIO:-100}"
 SAVE_DIR="${AGILLM4_4090_SAVE_DIR:-/workspace/agillm4_4090_ckpts}"
   --nat_every "${AGILLM4_4090_NAT_EVERY:-4}" \
   --nat_loss_weight "${AGILLM4_4090_NAT_LOSS_WEIGHT:-1.0}" \
   --nat_expand "${AGILLM4_4090_NAT_EXPAND:-2}" \
+  --nat_max_tokens "${AGILLM4_4090_NAT_MAX_TOKENS:-768}" \
   --token_param_ratio "$TOKEN_PARAM_RATIO" \
   --save_dir "$SAVE_DIR" \
+  --save_every_sec "${AGILLM4_4090_SAVE_EVERY_SEC:-86400}" \
   --delta_every_steps "${AGILLM4_4090_DELTA_EVERY_STEPS:-25000}" \
+  --delta_max_keep "${AGILLM4_4090_DELTA_MAX_KEEP:-1}" \
+  --max_ckpts "${AGILLM4_4090_MAX_CKPTS:-1}"

upload_agillm4_checkpoints.py ADDED Viewed

	@@ -0,0 +1,214 @@

+#!/usr/bin/env python3
+from __future__ import annotations
+import argparse
+import hashlib
+import json
+import os
+import shutil
+import subprocess
+import sys
+import time
+from datetime import datetime, timezone
+from pathlib import Path
+from typing import Any
+from huggingface_hub import HfApi
+def iso_now() -> str:
+    return datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
+def stamp_now() -> str:
+    return datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
+def load_json(path: Path, default: Any) -> Any:
+    try:
+        return json.loads(path.read_text(encoding="utf-8"))
+    except Exception:
+        return default
+def save_json(path: Path, data: Any) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(json.dumps(data, indent=2, sort_keys=True), encoding="utf-8")
+def sha256_file(path: Path) -> str:
+    h = hashlib.sha256()
+    with path.open("rb") as handle:
+        for chunk in iter(lambda: handle.read(1 << 22), b""):
+            h.update(chunk)
+    return h.hexdigest()
+def upload_file(api: HfApi, repo_id: str, local_path: Path, remote_path: str, message: str) -> None:
+    api.upload_file(
+        repo_id=repo_id,
+        path_or_fileobj=str(local_path),
+        path_in_repo=remote_path,
+        commit_message=message,
+    )
+def delete_remote_not_kept(api: HfApi, repo_id: str, remote_dir: str, keep_basenames: set[str]) -> list[str]:
+    deleted: list[str] = []
+    try:
+        files = api.list_repo_files(repo_id=repo_id)
+    except Exception as exc:
+        print(f"[upload] WARN list_repo_files failed: {exc}", flush=True)
+        return deleted
+    prefix = remote_dir.rstrip("/") + "/"
+    victims = []
+    for file_path in files:
+        if not file_path.startswith(prefix):
+            continue
+        name = Path(file_path).name
+        base = name[:-7] if name.endswith(".sha256") else name
+        if base not in keep_basenames:
+            victims.append(file_path)
+    if victims:
+        try:
+            api.delete_files(repo_id=repo_id, paths=victims, commit_message=f"Prune AGILLM4 uploads under {remote_dir}")
+            deleted.extend(victims)
+        except Exception as exc:
+            print(f"[upload] WARN delete_files failed for {len(victims)} files: {exc}", flush=True)
+    return deleted
+def latest_file(glob_root: Path, pattern: str) -> Path | None:
+    files = [p for p in glob_root.glob(pattern) if p.is_file()]
+    return max(files, key=lambda p: p.stat().st_mtime) if files else None
+def status_json(script: Path, log: Path, save_dir: Path) -> dict[str, Any]:
+    result = subprocess.run(
+        [sys.executable, "-u", str(script), "status", "--json", "--log", str(log), "--save_dir", str(save_dir)],
+        capture_output=True,
+        text=True,
+        timeout=60,
+        check=False,
+    )
+    if result.returncode != 0:
+        return {"checked_at": iso_now(), "error": result.stderr.strip() or result.stdout.strip()}
+    try:
+        return json.loads(result.stdout)
+    except Exception:
+        return {"checked_at": iso_now(), "error": "failed to parse status json", "raw": result.stdout[-4000:]}
+def write_tail(src: Path, dst: Path, lines: int) -> None:
+    dst.parent.mkdir(parents=True, exist_ok=True)
+    if not src.exists():
+        dst.write_text("", encoding="utf-8")
+        return
+    result = subprocess.run(["tail", "-n", str(lines), str(src)], capture_output=True, text=True, check=False)
+    dst.write_text(result.stdout, encoding="utf-8", errors="replace")
+def maybe_upload_large(
+    api: HfApi,
+    repo_id: str,
+    state: dict[str, Any],
+    kind: str,
+    path: Path | None,
+    remote_dir: str,
+    interval_sec: int,
+    keep: int,
+) -> bool:
+    if path is None or not path.exists():
+        print(f"[upload] no {kind} checkpoint yet", flush=True)
+        return False
+    now = time.time()
+    last_t = float(state.get(f"last_{kind}_upload_time", 0) or 0)
+    identity = f"{path.name}:{path.stat().st_size}:{int(path.stat().st_mtime)}"
+    if state.get(f"last_{kind}_identity") == identity:
+        print(f"[upload] {kind} unchanged: {path.name}", flush=True)
+        return False
+    if last_t and now - last_t < interval_sec:
+        remaining = int(interval_sec - (now - last_t))
+        print(f"[upload] {kind} interval not due for {remaining}s: {path.name}", flush=True)
+        return False
+    digest = sha256_file(path)
+    sha_path = path.with_suffix(path.suffix + ".upload.sha256")
+    sha_path.write_text(f"{digest}  {path.name}\n", encoding="utf-8")
+    remote_name = f"{stamp_now()}_{path.name}"
+    remote_path = f"{remote_dir.rstrip('/')}/{remote_name}"
+    print(f"[upload] uploading {kind}: {path} -> {repo_id}/{remote_path}", flush=True)
+    upload_file(api, repo_id, path, remote_path, f"Upload AGILLM4 {kind} checkpoint {path.name}")
+    upload_file(api, repo_id, sha_path, remote_path + ".sha256", f"Upload AGILLM4 {kind} checksum {path.name}")
+    history = list(state.get(f"{kind}_uploads", []))
+    history.append({"name": remote_name, "remote_path": remote_path, "sha256": digest, "uploaded_at": iso_now(), "size": path.stat().st_size})
+    history = history[-max(1, keep):]
+    state[f"{kind}_uploads"] = history
+    state[f"last_{kind}_upload_time"] = now
+    state[f"last_{kind}_identity"] = identity
+    keep_names = {item["name"] for item in history}
+    deleted = delete_remote_not_kept(api, repo_id, remote_dir, keep_names)
+    if deleted:
+        print(f"[upload] pruned {len(deleted)} remote {kind} files", flush=True)
+    return True
+def main() -> int:
+    parser = argparse.ArgumentParser(description="Bounded AGILLM4 checkpoint uploader")
+    parser.add_argument("--repo", default=os.environ.get("AGILLM4_UPLOAD_REPO", "OpenTransformer/AGILLM-4"))
+    parser.add_argument("--prefix", default=os.environ.get("AGILLM4_UPLOAD_PREFIX", "training/agillm4_floor_v47"))
+    parser.add_argument("--save-dir", type=Path, default=Path(os.environ.get("AGILLM4_UPLOAD_SAVE_DIR", "/workspace/agillm4_4090_ckpts")))
+    parser.add_argument("--log", type=Path, default=Path(os.environ.get("AGILLM4_UPLOAD_LOG", "/workspace/agillm4_floor_train.log")))
+    parser.add_argument("--script", type=Path, default=Path(os.environ.get("AGILLM4_UPLOAD_SCRIPT", "/workspace/agillm-4/nB300_agillm4.py")))
+    parser.add_argument("--state", type=Path, default=Path(os.environ.get("AGILLM4_UPLOAD_STATE", "/workspace/agillm4_upload_state.json")))
+    parser.add_argument("--stage", type=Path, default=Path(os.environ.get("AGILLM4_UPLOAD_STAGE", "/workspace/agillm4_upload_stage")))
+    parser.add_argument("--full-interval-sec", type=int, default=int(os.environ.get("AGILLM4_UPLOAD_FULL_INTERVAL_SEC", str(7 * 24 * 3600))))
+    parser.add_argument("--delta-interval-sec", type=int, default=int(os.environ.get("AGILLM4_UPLOAD_DELTA_INTERVAL_SEC", str(24 * 3600))))
+    parser.add_argument("--keep-full", type=int, default=int(os.environ.get("AGILLM4_UPLOAD_KEEP_FULL", "2")))
+    parser.add_argument("--keep-delta", type=int, default=int(os.environ.get("AGILLM4_UPLOAD_KEEP_DELTA", "2")))
+    parser.add_argument("--tail-lines", type=int, default=int(os.environ.get("AGILLM4_UPLOAD_TAIL_LINES", "5000")))
+    args = parser.parse_args()
+    api = HfApi()
+    prefix = args.prefix.strip("/")
+    args.stage.mkdir(parents=True, exist_ok=True)
+    state = load_json(args.state, {})
+    status = status_json(args.script, args.log, args.save_dir)
+    status["upload_policy"] = {
+        "full_interval_sec": args.full_interval_sec,
+        "delta_interval_sec": args.delta_interval_sec,
+        "keep_full_current_files": args.keep_full,
+        "keep_delta_current_files": args.keep_delta,
+        "note": "Small status/log tail uploads are frequent; multi-GB deltas/full checkpoints are rate-limited for HF public storage.",
+    }
+    status_path = args.stage / "status.json"
+    save_json(status_path, status)
+    upload_file(api, args.repo, status_path, f"{prefix}/status/status.json", "Update AGILLM4 training status")
+    tail_path = args.stage / "train_tail.log"
+    write_tail(args.log, tail_path, args.tail_lines)
+    upload_file(api, args.repo, tail_path, f"{prefix}/logs/train_tail.log", "Update AGILLM4 training log tail")
+    latest_json = args.save_dir / "latest.json"
+    if latest_json.exists():
+        shutil.copy2(latest_json, args.stage / "latest.json")
+        upload_file(api, args.repo, args.stage / "latest.json", f"{prefix}/status/latest.json", "Update AGILLM4 latest checkpoint metadata")
+    newest_delta = latest_file(args.save_dir, "*_delta_step*.pt")
+    newest_full = latest_file(args.save_dir, "*_step*.pt")
+    maybe_upload_large(api, args.repo, state, "delta", newest_delta, f"{prefix}/checkpoints/deltas", args.delta_interval_sec, args.keep_delta)
+    maybe_upload_large(api, args.repo, state, "full", newest_full, f"{prefix}/checkpoints/full", args.full_interval_sec, args.keep_full)
+    state["last_status_upload_at"] = iso_now()
+    save_json(args.state, state)
+    manifest_path = args.stage / "upload_state.json"
+    save_json(manifest_path, state)
+    upload_file(api, args.repo, manifest_path, f"{prefix}/status/upload_state.json", "Update AGILLM4 upload state")
+    print(f"[upload] done {iso_now()} repo={args.repo} prefix={prefix}", flush=True)
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

upload_agillm4_checkpoints_loop.sh ADDED Viewed

	@@ -0,0 +1,23 @@

+#!/usr/bin/env bash
+set -Eeuo pipefail
+cd /workspace/agillm-4
+LOG="${AGILLM4_UPLOAD_LOOP_LOG:-/workspace/agillm4_upload_loop.log}"
+INTERVAL="${AGILLM4_UPLOAD_LOOP_INTERVAL_SEC:-1800}"
+if [ -f /root/.cache/huggingface/token ]; then
+  export HF_TOKEN="$(tr -d '\r\n' < /root/.cache/huggingface/token)"
+  export HUGGING_FACE_HUB_TOKEN="$HF_TOKEN"
+fi
+mkdir -p "$(dirname "$LOG")"
+exec >> "$LOG" 2>&1
+echo "START_AGILLM4_UPLOAD_LOOP $(date -u +%Y-%m-%dT%H:%M:%SZ) interval=${INTERVAL}s"
+while true; do
+  echo "UPLOAD_TICK $(date -u +%Y-%m-%dT%H:%M:%SZ)"
+  python -u /workspace/agillm-4/upload_agillm4_checkpoints.py || true
+  sleep "$INTERVAL"
+done