Add Codex Colab training workflow

Browse files

Files changed (10) hide show

AGENTS.md +60 -0
README.md +11 -3
colab/README.md +75 -0
colab/configs/dmhy_char_train.json +42 -0
colab/configs/dmhy_regex_finetune.json +42 -0
colab/start_worker.ipynb +45 -0
colab_client.py +184 -0
colab_train.py +526 -122
colab_worker.py +446 -0
train.py +165 -17

AGENTS.md CHANGED Viewed

@@ -67,6 +67,66 @@ Export for Android:
 python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --android-assets-dir ../../scraper/src/main/assets/anime_parser
 ```
 ## Validation Expectations
 - For parser or tokenizer changes, run `python inference.py --model-dir . ...`

 python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --android-assets-dir ../../scraper/src/main/assets/anime_parser
 ```
+## Codex-Controlled Colab Training
+Free Colab cannot be treated as an always-on remote machine. Use it as a
+short-lived GPU worker only after the user manually opens a Colab runtime and
+starts the worker cell. Do not assume Codex can wake Colab by itself.
+Before relying on the Colab flow, make sure the Colab helper files have been
+pushed to the Hugging Face model repo, or the user has uploaded them manually:
+`colab_worker.py`, `colab_client.py`, `colab_train.py`, and `colab/`.
+Ask the user to start a Colab GPU runtime with:
+```python
+from google.colab import drive
+drive.mount("/content/drive")
+!git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT /content/AniFileBERT || true
+%cd /content/AniFileBERT
+!git pull --ff-only || true
+!git submodule update --init --recursive
+!python colab_worker.py
+```
+The worker prints `COLAB_WORKER_URL=...` and `COLAB_WORKER_TOKEN=...`. After
+the user provides those values, set them for local commands:
+```powershell
+$env:ANIFILEBERT_COLAB_URL="https://...trycloudflare.com"
+$env:ANIFILEBERT_COLAB_TOKEN="..."
+python colab_client.py health
+```
+Submit the default regex fine-tune:
+```powershell
+python colab_client.py submit --profile dmhy_regex_finetune --wait
+```
+Submit the character tokenizer run only when intentional:
+```powershell
+python colab_client.py submit --profile dmhy_char_train --wait
+```
+Useful follow-up commands:
+```powershell
+python colab_client.py jobs
+python colab_client.py status <job-id>
+python colab_client.py logs <job-id> --tail 200
+python colab_client.py manifest <job-id>
+python colab_client.py cancel <job-id>
+```
+The default Colab profiles save checkpoints to Google Drive every 1000 steps
+and resume with `resume_from_checkpoint: "auto"`, so if free Colab disconnects,
+ask the user to restart the worker and submit the same profile again. Artifacts
+land under `MyDrive/AniFileBERT/checkpoints/<profile-name>/`, and worker logs
+land under `MyDrive/AniFileBERT/worker/jobs/<job-id>/`.
 ## Validation Expectations
 - For parser or tokenizer changes, run `python inference.py --model-dir . ...`

README.md CHANGED Viewed

@@ -199,9 +199,17 @@ python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --output expor
 ## Google Colab Training
-Upload and run [`colab_train.py`](colab_train.py) in a Colab GPU runtime.
-It will mount Google Drive, clone both repos, install dependencies, and run
-the full training pipeline. Checkpoints are saved to your Drive automatically.
 ## Repository Layout

 ## Google Colab Training
+For Codex-controlled short Colab sessions, see [`colab/README.md`](colab/README.md).
+Free Colab still has to be started manually, but once `colab_worker.py` is
+running Codex can submit jobs through `colab_client.py`, tail logs, and inspect
+status. Checkpoints live on Google Drive and default profiles resume from the
+latest checkpoint automatically.
+Manual one-shot runs are also supported:
+```bash
+python colab_train.py --profile dmhy_regex_finetune
+```
 ## Repository Layout

colab/README.md ADDED Viewed

	@@ -0,0 +1,75 @@

+# Codex + Colab Training
+Free Colab cannot be used as an always-on remote machine. The practical setup is:
+1. Open a Colab GPU runtime when you want to train.
+2. Start the lightweight worker in one cell.
+3. Give Codex the printed worker URL and token.
+4. Codex submits jobs while that Colab session is alive.
+5. Checkpoints and manifests stay on Google Drive, so the next session can resume.
+## Start a Colab Session
+Run this in a Colab code cell:
+```python
+from google.colab import drive
+drive.mount("/content/drive")
+!git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT /content/AniFileBERT || true
+%cd /content/AniFileBERT
+!git pull --ff-only || true
+!git submodule update --init --recursive
+!python colab_worker.py
+```
+The cell prints:
+```text
+COLAB_WORKER_URL=https://...trycloudflare.com
+COLAB_WORKER_TOKEN=...
+```
+Keep that cell running. If Colab disconnects, start it again; default profiles
+save every 1000 steps and resume from the latest Drive checkpoint because they
+use `checkpoint_steps: 1000` and `resume_from_checkpoint: "auto"`.
+## Let Codex Submit a Job
+On the local machine:
+```powershell
+$env:ANIFILEBERT_COLAB_URL="https://...trycloudflare.com"
+$env:ANIFILEBERT_COLAB_TOKEN="..."
+python colab_client.py health
+python colab_client.py submit --profile dmhy_regex_finetune --wait
+```
+Codex can run the same commands from this repository after you provide the URL
+and token.
+## Profiles
+- `colab/configs/dmhy_regex_finetune.json`: default regex tokenizer fine-tune
+  from the published root checkpoint.
+- `colab/configs/dmhy_char_train.json`: character tokenizer training from
+  scratch.
+You can submit a local edited profile instead of a remote profile:
+```powershell
+python colab_client.py submit --config colab/configs/dmhy_regex_finetune.json --wait
+```
+The worker writes per-job logs under:
+```text
+MyDrive/AniFileBERT/worker/jobs/<job-id>/
+```
+The training runner writes:
+```text
+MyDrive/AniFileBERT/checkpoints/<profile-name>/
+MyDrive/AniFileBERT/last_run_manifest.json
+```

colab/configs/dmhy_char_train.json ADDED Viewed

	@@ -0,0 +1,42 @@

+{
+  "name": "dmhy-char-train",
+  "repo_url": "https://huggingface.co/ModerRAS/AniFileBERT",
+  "repo_ref": "main",
+  "repo_dir": "/content/AniFileBERT",
+  "drive_root": "/content/drive/MyDrive/AniFileBERT",
+  "mount_drive": true,
+  "pull": true,
+  "install": {
+    "requirements": true,
+    "git_lfs": true,
+    "extra_packages": []
+  },
+  "training": {
+    "tokenizer": "char",
+    "data_file": "datasets/AnimeName/dmhy_weak_char.jsonl",
+    "vocab_file": "datasets/AnimeName/vocab.char.json",
+    "save_dir": "{drive_root}/checkpoints/{name}",
+    "init_model_dir": null,
+    "epochs": 1,
+    "batch_size": 128,
+    "learning_rate": 0.0003,
+    "warmup_steps": 300,
+    "train_split": 0.9,
+    "max_seq_length": 128,
+    "seed": 42,
+    "resume_from_checkpoint": "auto",
+    "checkpoint_steps": 1000,
+    "save_total_limit": 3
+  },
+  "export": {
+    "enabled": true,
+    "required": false,
+    "output": "{save_dir}/exports/anime_filename_parser.onnx",
+    "max_length": "{max_seq_length}"
+  },
+  "smoke": {
+    "enabled": true,
+    "required": true,
+    "sample": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
+  }
+}

colab/configs/dmhy_regex_finetune.json ADDED Viewed

	@@ -0,0 +1,42 @@

+{
+  "name": "dmhy-regex-finetune",
+  "repo_url": "https://huggingface.co/ModerRAS/AniFileBERT",
+  "repo_ref": "main",
+  "repo_dir": "/content/AniFileBERT",
+  "drive_root": "/content/drive/MyDrive/AniFileBERT",
+  "mount_drive": true,
+  "pull": true,
+  "install": {
+    "requirements": true,
+    "git_lfs": true,
+    "extra_packages": []
+  },
+  "training": {
+    "tokenizer": "regex",
+    "data_file": "datasets/AnimeName/dmhy_weak.jsonl",
+    "vocab_file": "datasets/AnimeName/vocab.json",
+    "save_dir": "{drive_root}/checkpoints/{name}",
+    "init_model_dir": ".",
+    "epochs": 1,
+    "batch_size": 128,
+    "learning_rate": 0.0003,
+    "warmup_steps": 300,
+    "train_split": 0.9,
+    "max_seq_length": 64,
+    "seed": 42,
+    "resume_from_checkpoint": "auto",
+    "checkpoint_steps": 1000,
+    "save_total_limit": 3
+  },
+  "export": {
+    "enabled": true,
+    "required": false,
+    "output": "{save_dir}/exports/anime_filename_parser.onnx",
+    "max_length": "{max_seq_length}"
+  },
+  "smoke": {
+    "enabled": true,
+    "required": true,
+    "sample": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
+  }
+}

colab/start_worker.ipynb ADDED Viewed

	@@ -0,0 +1,45 @@

+{
+  "nbformat": 4,
+  "nbformat_minor": 5,
+  "metadata": {
+    "colab": {
+      "provenance": [],
+      "gpuType": "T4"
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    },
+    "accelerator": "GPU"
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# AniFileBERT Colab Worker\n",
+        "\n",
+        "Run the next cell in a GPU runtime. Keep it running while Codex submits training jobs. If free Colab disconnects, open the notebook again and rerun the cell; default profiles resume from the latest Drive checkpoint."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from google.colab import drive\n",
+        "drive.mount('/content/drive')\n",
+        "\n",
+        "!git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT /content/AniFileBERT || true\n",
+        "%cd /content/AniFileBERT\n",
+        "!git pull --ff-only || true\n",
+        "!git submodule update --init --recursive\n",
+        "!python colab_worker.py\n"
+      ]
+    }
+  ]
+}

colab_client.py ADDED Viewed

	@@ -0,0 +1,184 @@

+# -*- coding: utf-8 -*-
+"""Local client for controlling an active AniFileBERT Colab worker.
+The worker still has to be started manually in Colab, but once it prints a
+public URL and token this client lets Codex submit training jobs, tail logs, and
+inspect status from the local workspace.
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+from pathlib import Path
+import sys
+import time
+from typing import Any
+import urllib.error
+import urllib.parse
+import urllib.request
+TERMINAL_STATES = {"success", "failed", "cancelled"}
+def load_json(path: str) -> Any:
+    return json.loads(Path(path).read_text(encoding="utf-8"))
+class ColabClient:
+    def __init__(self, base_url: str, token: str, timeout: int = 30):
+        self.base_url = base_url.rstrip("/")
+        self.token = token
+        self.timeout = timeout
+    def request(self, method: str, path: str, payload: Any | None = None) -> Any:
+        url = self.base_url + path
+        data = None
+        headers = {"Authorization": f"Bearer {self.token}"}
+        if payload is not None:
+            data = json.dumps(payload, ensure_ascii=False).encode("utf-8")
+            headers["Content-Type"] = "application/json; charset=utf-8"
+        req = urllib.request.Request(url, data=data, headers=headers, method=method)
+        try:
+            with urllib.request.urlopen(req, timeout=self.timeout) as response:
+                return json.loads(response.read().decode("utf-8"))
+        except urllib.error.HTTPError as exc:
+            body = exc.read().decode("utf-8", errors="replace")
+            raise RuntimeError(f"{method} {url} failed: HTTP {exc.code}: {body}") from exc
+    def health(self) -> Any:
+        return self.request("GET", "/health")
+    def submit(self, payload: dict[str, Any]) -> Any:
+        return self.request("POST", "/jobs", payload)
+    def jobs(self) -> Any:
+        return self.request("GET", "/jobs")
+    def status(self, job_id: str) -> Any:
+        return self.request("GET", f"/jobs/{job_id}")
+    def logs(self, job_id: str, tail: int) -> Any:
+        query = urllib.parse.urlencode({"tail": tail})
+        return self.request("GET", f"/jobs/{job_id}/logs?{query}")
+    def manifest(self, job_id: str) -> Any:
+        return self.request("GET", f"/jobs/{job_id}/manifest")
+    def cancel(self, job_id: str) -> Any:
+        return self.request("POST", f"/jobs/{job_id}/cancel", {})
+def print_json(data: Any) -> None:
+    print(json.dumps(data, ensure_ascii=False, indent=2))
+def require_connection(args: argparse.Namespace) -> ColabClient:
+    url = args.url or os.environ.get("ANIFILEBERT_COLAB_URL")
+    token = args.token or os.environ.get("ANIFILEBERT_COLAB_TOKEN")
+    if not url or not token:
+        raise SystemExit(
+            "Set ANIFILEBERT_COLAB_URL and ANIFILEBERT_COLAB_TOKEN, "
+            "or pass --url and --token."
+        )
+    return ColabClient(url, token, timeout=args.timeout)
+def build_submit_payload(args: argparse.Namespace) -> dict[str, Any]:
+    payload: dict[str, Any] = {}
+    if args.config:
+        payload["config"] = load_json(args.config)
+    if args.profile:
+        payload["profile"] = args.profile
+    extra_args = list(args.args or []) + list(args.extra_args or [])
+    if extra_args:
+        payload["args"] = extra_args
+    if not payload:
+        payload["profile"] = "dmhy_regex_finetune"
+    return payload
+def wait_for_job(client: ColabClient, job_id: str, poll: int, tail: int) -> dict[str, Any]:
+    last_status = None
+    while True:
+        status = client.status(job_id)
+        if status.get("status") != last_status:
+            print_json(status)
+            last_status = status.get("status")
+        logs = client.logs(job_id, tail=tail)
+        log_text = logs.get("log", "")
+        if log_text:
+            print("\n--- log tail ---")
+            print(log_text.rstrip())
+        if status.get("status") in TERMINAL_STATES:
+            return status
+        time.sleep(poll)
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Control an active AniFileBERT Colab worker")
+    parser.add_argument("--url", help="Worker URL, or ANIFILEBERT_COLAB_URL")
+    parser.add_argument("--token", help="Worker token, or ANIFILEBERT_COLAB_TOKEN")
+    parser.add_argument("--timeout", type=int, default=30)
+    subparsers = parser.add_subparsers(dest="command", required=True)
+    subparsers.add_parser("health", help="Check worker health")
+    subparsers.add_parser("jobs", help="List known jobs")
+    submit = subparsers.add_parser("submit", help="Submit a training job")
+    submit.add_argument("--config", help="Local JSON config to send to the worker")
+    submit.add_argument("--profile", help="Remote profile name under colab/configs")
+    submit.add_argument("--arg", dest="args", action="append", default=[], help="Extra arg for colab_train.py")
+    submit.add_argument("--wait", action="store_true", help="Poll until the job finishes")
+    submit.add_argument("--poll", type=int, default=60, help="Polling interval in seconds")
+    submit.add_argument("--tail", type=int, default=80, help="Log lines to show while waiting")
+    submit.add_argument("extra_args", nargs=argparse.REMAINDER,
+                        help="Arguments after -- are passed to colab_train.py")
+    status = subparsers.add_parser("status", help="Show job status")
+    status.add_argument("job_id")
+    logs = subparsers.add_parser("logs", help="Show job logs")
+    logs.add_argument("job_id")
+    logs.add_argument("--tail", type=int, default=200)
+    manifest = subparsers.add_parser("manifest", help="Show job manifest")
+    manifest.add_argument("job_id")
+    cancel = subparsers.add_parser("cancel", help="Cancel a running job")
+    cancel.add_argument("job_id")
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    client = require_connection(args)
+    if args.command == "health":
+        print_json(client.health())
+    elif args.command == "jobs":
+        print_json(client.jobs())
+    elif args.command == "submit":
+        job = client.submit(build_submit_payload(args))
+        print_json(job)
+        if args.wait:
+            final_status = wait_for_job(client, job["job_id"], poll=args.poll, tail=args.tail)
+            if final_status.get("status") != "success":
+                sys.exit(1)
+    elif args.command == "status":
+        print_json(client.status(args.job_id))
+    elif args.command == "logs":
+        print(client.logs(args.job_id, args.tail).get("log", ""), end="")
+    elif args.command == "manifest":
+        print_json(client.manifest(args.job_id))
+    elif args.command == "cancel":
+        print_json(client.cancel(args.job_id))
+if __name__ == "__main__":
+    main()

colab_train.py CHANGED Viewed

@@ -1,139 +1,543 @@
 # -*- coding: utf-8 -*-
-"""AniFileBERT — Google Colab Training Script
-=============================================
-How to use:
-  1. Open https://colab.research.google.com/
-  2. File → Upload notebook → select this file, OR
-     Copy the entire content into a new code cell
-  3. Runtime → Change runtime type → T4 GPU
-  4. Run all
-What it does:
-  - Mounts Google Drive (for persistent checkpoints)
-  - Clones AniFileBERT repo + AnimeName dataset submodule
-  - Installs PyTorch + Transformers dependencies
-  - Runs training: train a character-token model with the full DMHY vocab
-  - Saves final model to Drive
-Output:
-  - Checkpoints saved to: MyDrive/AniFileBERT/checkpoints/
-  - Final model at:       MyDrive/AniFileBERT/checkpoints/dmhy-weak-char/final/
 """
 import os
-import sys
 import subprocess
-import time
-def run(cmd, echo=True):
-    """Run a shell command and print output in real time."""
-    if echo:
-        print(f"\n$ {cmd}")
     proc = subprocess.Popen(
-        cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
-        text=True, bufsize=1
     )
     for line in proc.stdout:
         print(line, end="")
     proc.wait()
-    if proc.returncode != 0:
-        raise RuntimeError(f"Command failed (exit code {proc.returncode}): {cmd}")
     return proc.returncode
-# ── 1. Mount Google Drive ──────────────────────────────────────
-print("=" * 60)
-print("STEP 1: Mount Google Drive")
-print("=" * 60)
-from google.colab import drive
-drive.mount("/content/drive")
-DRIVE_ROOT = "/content/drive/MyDrive/AniFileBERT"
-os.makedirs(DRIVE_ROOT, exist_ok=True)
-print(f"Checkpoints will be saved to: {DRIVE_ROOT}")
-# ── 2. Clone repositories ──────────────────────────────────────
-print("\n" + "=" * 60)
-print("STEP 2: Clone AniFileBERT repository")
-print("=" * 60)
-REPO_DIR = "/content/AniFileBERT"
-if not os.path.isdir(REPO_DIR):
-    os.chdir("/content")
-    run("git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT")
-else:
-    print("Repository already exists, pulling latest...")
-    os.chdir(REPO_DIR)
-    run("git pull")
-    run("git submodule update --init --recursive")
-os.chdir(REPO_DIR)
-# ── 3. Install dependencies ────────────────────────────────────
-print("\n" + "=" * 60)
-print("STEP 3: Install dependencies")
-print("=" * 60)
-# Colab comes with PyTorch + CUDA pre-installed. Just install the extras.
-run("pip install transformers accelerate seqeval onnx onnxruntime onnxscript")
-# ── 4. Verify GPU ──────────────────────────────────────────────
-print("\n" + "=" * 60)
-print("STEP 4: Verify GPU")
-print("=" * 60)
-run("nvidia-smi 2>/dev/null || echo 'No GPU found — training will be slow on CPU'")
-# Single-quote the shell command to avoid bash expanding {torch...}
-run("python -c 'import torch; print(f\"PyTorch {torch.__version__}, CUDA available: {torch.cuda.is_available()}\")'")
-# ── 5. Verify vocab ────────────────────────────────────────────
-print("\n" + "=" * 60)
-print("STEP 5: Verify vocabulary")
-print("=" * 60)
-run("python -c 'import json; v=json.load(open(\"vocab.char.json\", encoding=\"utf-8\")); print(f\"Character vocab size: {len(v)} tokens\")'")
-# ── 6. Run training ────────────────────────────────────────────
-print("\n" + "=" * 60)
-print("STEP 6: Train model")
-print("=" * 60)
-# The full DMHY character vocab is only 6199 tokens and covers every character
-# occurrence in dmhy_weak_char.jsonl.
-SAVE_DIR = os.path.join(DRIVE_ROOT, "checkpoints", "dmhy-weak-char")
-run(
-    f"python train.py "
-    f"--tokenizer char "
-    f"--data-file datasets/AnimeName/dmhy_weak_char.jsonl "
-    f"--vocab-file vocab.char.json "
-    f"--save-dir {SAVE_DIR} "
-    f"--epochs 5 --batch-size 128 "
-    f"--learning-rate 0.0003 --warmup-steps 300 "
-    f"--max-seq-length 128 "
-    f"--seed 42 "
-    f"--no-shuffle"
-)
-# ── 7. Export ONNX (optional) ──────────────────────────────────
-print("\n" + "=" * 60)
-print("STEP 7: Export ONNX (optional — skip if it fails)")
-print("=" * 60)
-ONNX_OUT = os.path.join(SAVE_DIR, "..", "anime_filename_parser.onnx")
-try:
     run(
-        f"python export_onnx.py "
-        f"--model-dir {SAVE_DIR}/final "
-        f"--output {ONNX_OUT}"
     )
-except Exception as e:
-    print(f"[WARN] ONNX export skipped: {e}")
-# ── 8. Summary ─────────────────────────────────────────────────
-print("\n" + "=" * 60)
-print("DONE!")
-print("=" * 60)
-print(f"\nCheckpoints:  {SAVE_DIR}/")
-print(f"Final model:  {SAVE_DIR}/final/")
-print(f"ONNX export:  {ONNX_OUT}")
-print(f"\nAll files are on Google Drive — they persist across Colab sessions.")
-print(f"You can also download them from the Drive web UI.")

 # -*- coding: utf-8 -*-
+"""Codex-friendly Google Colab runner for AniFileBERT training.
+Typical Colab usage:
+    python colab_train.py --config colab/configs/dmhy_regex_finetune.json
+This script keeps the Colab side reproducible by putting run parameters in JSON
+profiles. It can clone/update the repo, mount Drive, install dependencies,
+train, optionally export ONNX, run an inference smoke check, and write a run
+manifest that Codex can inspect later.
 """
+from __future__ import annotations
+import argparse
+import copy
+import datetime as dt
+import json
 import os
+from pathlib import Path
+import shlex
+import shutil
 import subprocess
+import sys
+import traceback
+from typing import Any, Mapping, Sequence
+import urllib.request
+DEFAULT_CONFIG: dict[str, Any] = {
+    "name": "dmhy-regex-finetune",
+    "repo_url": "https://huggingface.co/ModerRAS/AniFileBERT",
+    "repo_ref": "main",
+    "repo_dir": "/content/AniFileBERT",
+    "drive_root": "/content/drive/MyDrive/AniFileBERT",
+    "mount_drive": True,
+    "pull": True,
+    "install": {
+        "requirements": True,
+        "git_lfs": True,
+        "extra_packages": [],
+    },
+    "training": {
+        "tokenizer": "regex",
+        "data_file": "datasets/AnimeName/dmhy_weak.jsonl",
+        "vocab_file": "datasets/AnimeName/vocab.json",
+        "save_dir": "{drive_root}/checkpoints/{name}",
+        "init_model_dir": ".",
+        "epochs": 1,
+        "batch_size": 128,
+        "learning_rate": 0.0003,
+        "warmup_steps": 300,
+        "train_split": 0.9,
+        "max_seq_length": 64,
+        "seed": 42,
+        "limit_samples": None,
+        "rebuild_vocab": False,
+        "max_vocab_size": None,
+        "resume_from_checkpoint": "auto",
+        "checkpoint_steps": 1000,
+        "save_total_limit": 3,
+        "cpu": False,
+        "no_shuffle": False,
+        "extra_args": [],
+    },
+    "export": {
+        "enabled": True,
+        "required": False,
+        "output": "{save_dir}/exports/anime_filename_parser.onnx",
+        "max_length": "{max_seq_length}",
+        "sample": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
+        "android_assets_dir": None,
+    },
+    "smoke": {
+        "enabled": True,
+        "required": True,
+        "sample": "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub",
+    },
+    "artifacts": {
+        "manifest": "{save_dir}/colab_run_manifest.json",
+        "latest_manifest": "{drive_root}/last_run_manifest.json",
+    },
+}
+COMMAND_LOG: list[dict[str, Any]] = []
+class SafeFormatDict(dict):
+    def __missing__(self, key: str) -> str:
+        return "{" + key + "}"
+def utc_now() -> str:
+    return dt.datetime.now(dt.timezone.utc).replace(microsecond=0).isoformat().replace("+00:00", "Z")
+def deep_merge(base: Mapping[str, Any], override: Mapping[str, Any]) -> dict[str, Any]:
+    merged = copy.deepcopy(dict(base))
+    for key, value in override.items():
+        if isinstance(value, Mapping) and isinstance(merged.get(key), Mapping):
+            merged[key] = deep_merge(merged[key], value)
+        else:
+            merged[key] = copy.deepcopy(value)
+    return merged
+def render_templates(value: Any, context: Mapping[str, Any]) -> Any:
+    if isinstance(value, str):
+        return value.format_map(SafeFormatDict(context))
+    if isinstance(value, list):
+        return [render_templates(item, context) for item in value]
+    if isinstance(value, dict):
+        return {key: render_templates(item, context) for key, item in value.items()}
+    return value
+def command_text(args: str | Sequence[Any]) -> str:
+    if isinstance(args, str):
+        return args
+    return " ".join(shlex.quote(str(arg)) for arg in args)
+def run(
+    args: str | Sequence[Any],
+    *,
+    cwd: str | os.PathLike[str] | None = None,
+    check: bool = True,
+    dry_run: bool = False,
+) -> int:
+    text = command_text(args)
+    entry: dict[str, Any] = {
+        "cmd": text,
+        "cwd": os.fspath(cwd) if cwd is not None else None,
+        "started_at": utc_now(),
+        "dry_run": dry_run,
+    }
+    COMMAND_LOG.append(entry)
+    print(f"\n$ {text}")
+    if dry_run:
+        entry["returncode"] = 0
+        entry["finished_at"] = utc_now()
+        return 0
     proc = subprocess.Popen(
+        args,
+        cwd=cwd,
+        shell=isinstance(args, str),
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+        text=True,
+        encoding="utf-8",
+        errors="replace",
+        bufsize=1,
     )
+    assert proc.stdout is not None
     for line in proc.stdout:
         print(line, end="")
     proc.wait()
+    entry["returncode"] = proc.returncode
+    entry["finished_at"] = utc_now()
+    if check and proc.returncode != 0:
+        raise RuntimeError(f"Command failed with exit code {proc.returncode}: {text}")
     return proc.returncode
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Run AniFileBERT training in Colab")
+    parser.add_argument("--config", help="JSON profile path or URL")
+    parser.add_argument("--profile", help="Profile name under colab/configs without .json")
+    parser.add_argument("--repo-url", help="Override repository URL")
+    parser.add_argument("--repo-ref", help="Override branch, tag, or commit to checkout")
+    parser.add_argument("--repo-dir", help="Override Colab repository directory")
+    parser.add_argument("--drive-root", help="Override Google Drive output root")
+    parser.add_argument("--save-dir", help="Override checkpoint output directory")
+    parser.add_argument("--epochs", type=float, help="Override training epochs")
+    parser.add_argument("--batch-size", type=int, help="Override per-device batch size")
+    parser.add_argument("--learning-rate", type=float, help="Override learning rate")
+    parser.add_argument("--warmup-steps", type=int, help="Override warmup steps")
+    parser.add_argument("--limit-samples", type=int, help="Use only the first N dataset rows")
+    parser.add_argument("--skip-install", action="store_true", help="Do not install pip or git-lfs dependencies")
+    parser.add_argument("--skip-export", action="store_true", help="Do not run ONNX export")
+    parser.add_argument("--skip-smoke", action="store_true", help="Do not run inference smoke check")
+    parser.add_argument("--no-mount-drive", action="store_true", help="Do not mount Google Drive")
+    parser.add_argument("--no-pull", action="store_true", help="Do not pull an existing checkout")
+    parser.add_argument("--dry-run", action="store_true", help="Print commands and write no training outputs")
+    parser.add_argument("--print-config", action="store_true", help="Print resolved config before running")
+    return parser.parse_args()
+def load_json_source(source: str | None, *, required: bool) -> dict[str, Any]:
+    if not source:
+        return {}
+    if source.startswith(("http://", "https://")):
+        with urllib.request.urlopen(source) as response:
+            return json.loads(response.read().decode("utf-8"))
+    candidates = [Path(source), Path(__file__).resolve().parent / source]
+    for candidate in candidates:
+        if candidate.is_file():
+            return json.loads(candidate.read_text(encoding="utf-8"))
+    if required:
+        raise FileNotFoundError(f"Config file not found: {source}")
+    return {}
+def load_config(args: argparse.Namespace) -> dict[str, Any]:
+    config_source = args.config
+    required = bool(args.config)
+    if config_source is None and args.profile:
+        config_source = os.fspath(Path("colab") / "configs" / f"{args.profile}.json")
+        required = True
+    profile_config = load_json_source(config_source, required=required)
+    config = deep_merge(DEFAULT_CONFIG, profile_config)
+    if args.repo_url:
+        config["repo_url"] = args.repo_url
+    if args.repo_ref:
+        config["repo_ref"] = args.repo_ref
+    if args.repo_dir:
+        config["repo_dir"] = args.repo_dir
+    if args.drive_root:
+        config["drive_root"] = args.drive_root
+    if args.no_mount_drive:
+        config["mount_drive"] = False
+    if args.no_pull:
+        config["pull"] = False
+    if args.skip_install:
+        config["install"]["requirements"] = False
+        config["install"]["git_lfs"] = False
+        config["install"]["extra_packages"] = []
+    if args.skip_export:
+        config["export"]["enabled"] = False
+    if args.skip_smoke:
+        config["smoke"]["enabled"] = False
+    training = config["training"]
+    for arg_name, key in [
+        ("save_dir", "save_dir"),
+        ("epochs", "epochs"),
+        ("batch_size", "batch_size"),
+        ("learning_rate", "learning_rate"),
+        ("warmup_steps", "warmup_steps"),
+        ("limit_samples", "limit_samples"),
+    ]:
+        value = getattr(args, arg_name)
+        if value is not None:
+            training[key] = value
+    return resolve_config(config)
+def resolve_config(config: dict[str, Any]) -> dict[str, Any]:
+    context: dict[str, Any] = {
+        "name": config["name"],
+        "repo_url": config["repo_url"],
+        "repo_ref": config.get("repo_ref") or "",
+        "repo_dir": config["repo_dir"],
+        "drive_root": config["drive_root"],
+    }
+    training = render_templates(config["training"], context)
+    context.update(training)
+    if not training.get("save_dir"):
+        training["save_dir"] = os.path.join(config["drive_root"], "checkpoints", config["name"])
+    training = render_templates(training, {**context, **training})
+    context.update(training)
+    context["save_dir"] = training["save_dir"]
+    context["final_model_dir"] = os.path.join(training["save_dir"], "final")
+    resolved = copy.deepcopy(config)
+    resolved["training"] = training
+    resolved["export"] = render_templates(config["export"], context)
+    resolved["smoke"] = render_templates(config["smoke"], context)
+    resolved["artifacts"] = render_templates(config["artifacts"], context)
+    return resolved
+def maybe_mount_drive(config: Mapping[str, Any]) -> None:
+    if not config.get("mount_drive", True):
+        print("Google Drive mount disabled.")
+        return
+    try:
+        from google.colab import drive  # type: ignore
+    except Exception:
+        print("[WARN] google.colab is unavailable; skipping Drive mount.")
+        return
+    print("Mounting Google Drive...")
+    drive.mount("/content/drive")
+def install_git_lfs_if_needed(config: Mapping[str, Any], *, dry_run: bool) -> None:
+    if not config.get("install", {}).get("git_lfs", True):
+        return
+    if shutil.which("git-lfs"):
+        run(["git", "lfs", "install"], check=False, dry_run=dry_run)
+        return
+    if Path("/content").exists():
+        print("Installing git-lfs for Hugging Face model artifacts...")
+        run(["apt-get", "update"], check=False, dry_run=dry_run)
+        run(["apt-get", "install", "-y", "git-lfs"], dry_run=dry_run)
+        run(["git", "lfs", "install"], check=False, dry_run=dry_run)
+    else:
+        print("[WARN] git-lfs not found. Existing LFS pointers may not contain model weights.")
+def is_git_repo(path: Path) -> bool:
+    return (path / ".git").exists()
+def prepare_repo(config: Mapping[str, Any], *, dry_run: bool) -> Path:
+    repo_dir = Path(config["repo_dir"])
+    repo_url = config["repo_url"]
+    repo_ref = config.get("repo_ref")
+    if not is_git_repo(repo_dir):
+        if repo_dir.exists() and any(repo_dir.iterdir()):
+            raise RuntimeError(f"{repo_dir} exists but is not a git checkout")
+        repo_dir.parent.mkdir(parents=True, exist_ok=True)
+        run(["git", "clone", "--recursive", repo_url, os.fspath(repo_dir)], dry_run=dry_run)
+    else:
+        print(f"Using existing repository checkout: {repo_dir}")
+    if repo_ref:
+        run(["git", "fetch", "--all", "--tags"], cwd=repo_dir, check=False, dry_run=dry_run)
+        run(["git", "checkout", str(repo_ref)], cwd=repo_dir, dry_run=dry_run)
+    if config.get("pull", True):
+        run(["git", "pull", "--ff-only"], cwd=repo_dir, check=False, dry_run=dry_run)
+    run(["git", "submodule", "update", "--init", "--recursive"], cwd=repo_dir, dry_run=dry_run)
+    if shutil.which("git-lfs"):
+        run(["git", "lfs", "pull"], cwd=repo_dir, check=False, dry_run=dry_run)
+    return repo_dir
+def install_python_deps(config: Mapping[str, Any], repo_dir: Path, *, dry_run: bool) -> None:
+    install = config.get("install", {})
+    if install.get("requirements", True):
+        run([sys.executable, "-m", "pip", "install", "-r", "requirements.txt"], cwd=repo_dir, dry_run=dry_run)
+    for package in install.get("extra_packages", []):
+        run([sys.executable, "-m", "pip", "install", str(package)], cwd=repo_dir, dry_run=dry_run)
+def verify_runtime(repo_dir: Path, *, dry_run: bool) -> None:
+    run(["nvidia-smi"], cwd=repo_dir, check=False, dry_run=dry_run)
     run(
+        [
+            sys.executable,
+            "-c",
+            "import torch; print(f'PyTorch {torch.__version__}, CUDA available: {torch.cuda.is_available()}')",
+        ],
+        cwd=repo_dir,
+        check=False,
+        dry_run=dry_run,
     )
+def add_arg(cmd: list[str], flag: str, value: Any) -> None:
+    if value is None or value is False:
+        return
+    if value is True:
+        cmd.append(flag)
+    else:
+        cmd.extend([flag, str(value)])
+def build_train_command(training: Mapping[str, Any]) -> list[str]:
+    cmd = [sys.executable, "train.py"]
+    for key, flag in [
+        ("tokenizer", "--tokenizer"),
+        ("data_file", "--data-file"),
+        ("vocab_file", "--vocab-file"),
+        ("save_dir", "--save-dir"),
+        ("init_model_dir", "--init-model-dir"),
+        ("epochs", "--epochs"),
+        ("batch_size", "--batch-size"),
+        ("learning_rate", "--learning-rate"),
+        ("warmup_steps", "--warmup-steps"),
+        ("train_split", "--train-split"),
+        ("max_seq_length", "--max-seq-length"),
+        ("seed", "--seed"),
+        ("limit_samples", "--limit-samples"),
+        ("max_vocab_size", "--max-vocab-size"),
+        ("resume_from_checkpoint", "--resume-from-checkpoint"),
+        ("checkpoint_steps", "--checkpoint-steps"),
+        ("save_total_limit", "--save-total-limit"),
+    ]:
+        add_arg(cmd, flag, training.get(key))
+    add_arg(cmd, "--rebuild-vocab", training.get("rebuild_vocab"))
+    add_arg(cmd, "--cpu", training.get("cpu"))
+    add_arg(cmd, "--no-shuffle", training.get("no_shuffle"))
+    cmd.extend(str(arg) for arg in training.get("extra_args", []))
+    return cmd
+def run_training(config: Mapping[str, Any], repo_dir: Path, *, dry_run: bool) -> None:
+    training = config["training"]
+    if not dry_run:
+        Path(training["save_dir"]).mkdir(parents=True, exist_ok=True)
+    run(build_train_command(training), cwd=repo_dir, dry_run=dry_run)
+def run_export(config: Mapping[str, Any], repo_dir: Path, *, dry_run: bool) -> None:
+    export = config["export"]
+    if not export.get("enabled", True):
+        print("ONNX export disabled.")
+        return
+    cmd = [
+        sys.executable,
+        "export_onnx.py",
+        "--model-dir",
+        os.path.join(config["training"]["save_dir"], "final"),
+        "--output",
+        export["output"],
+        "--max-length",
+        str(export["max_length"]),
+    ]
+    add_arg(cmd, "--sample", export.get("sample"))
+    add_arg(cmd, "--android-assets-dir", export.get("android_assets_dir"))
+    try:
+        run(cmd, cwd=repo_dir, dry_run=dry_run)
+    except Exception:
+        if export.get("required", False):
+            raise
+        print("[WARN] ONNX export failed, but export.required is false.")
+        traceback.print_exc()
+def run_smoke(config: Mapping[str, Any], repo_dir: Path, *, dry_run: bool) -> None:
+    smoke = config["smoke"]
+    if not smoke.get("enabled", True):
+        print("Inference smoke check disabled.")
+        return
+    cmd = [
+        sys.executable,
+        "inference.py",
+        "--model-dir",
+        os.path.join(config["training"]["save_dir"], "final"),
+        smoke["sample"],
+    ]
+    try:
+        run(cmd, cwd=repo_dir, dry_run=dry_run)
+    except Exception:
+        if smoke.get("required", True):
+            raise
+        print("[WARN] Smoke check failed, but smoke.required is false.")
+        traceback.print_exc()
+def git_commit(repo_dir: Path, *, dry_run: bool) -> str | None:
+    if dry_run:
+        return None
+    try:
+        return subprocess.check_output(
+            ["git", "rev-parse", "HEAD"],
+            cwd=repo_dir,
+            text=True,
+            encoding="utf-8",
+            errors="replace",
+        ).strip()
+    except Exception:
+        return None
+def write_json(path: str | os.PathLike[str], data: Mapping[str, Any], *, dry_run: bool) -> None:
+    print(f"Writing manifest: {path}")
+    if dry_run:
+        return
+    output_path = Path(path)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    output_path.write_text(json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8")
+def write_manifests(
+    config: Mapping[str, Any],
+    repo_dir: Path,
+    *,
+    status: str,
+    started_at: str,
+    error: str | None,
+    dry_run: bool,
+) -> None:
+    save_dir = config["training"]["save_dir"]
+    manifest = {
+        "status": status,
+        "name": config["name"],
+        "started_at": started_at,
+        "finished_at": utc_now(),
+        "repo_url": config["repo_url"],
+        "repo_ref": config.get("repo_ref"),
+        "repo_commit": git_commit(repo_dir, dry_run=dry_run),
+        "repo_dir": os.fspath(repo_dir),
+        "save_dir": save_dir,
+        "final_model_dir": os.path.join(save_dir, "final"),
+        "onnx_output": config["export"].get("output") if config["export"].get("enabled") else None,
+        "config": config,
+        "commands": COMMAND_LOG,
+        "error": error,
+    }
+    artifacts = config["artifacts"]
+    write_json(artifacts["manifest"], manifest, dry_run=dry_run)
+    if artifacts.get("latest_manifest"):
+        write_json(artifacts["latest_manifest"], manifest, dry_run=dry_run)
+def main() -> None:
+    args = parse_args()
+    started_at = utc_now()
+    config = load_config(args)
+    if args.print_config:
+        print(json.dumps(config, ensure_ascii=False, indent=2))
+    repo_dir = Path(config["repo_dir"])
+    status = "failed"
+    error: str | None = None
+    try:
+        maybe_mount_drive(config)
+        install_git_lfs_if_needed(config, dry_run=args.dry_run)
+        repo_dir = prepare_repo(config, dry_run=args.dry_run)
+        install_python_deps(config, repo_dir, dry_run=args.dry_run)
+        verify_runtime(repo_dir, dry_run=args.dry_run)
+        run_training(config, repo_dir, dry_run=args.dry_run)
+        run_export(config, repo_dir, dry_run=args.dry_run)
+        run_smoke(config, repo_dir, dry_run=args.dry_run)
+        status = "success"
+    except Exception as exc:
+        error = f"{type(exc).__name__}: {exc}"
+        raise
+    finally:
+        write_manifests(config, repo_dir, status=status, started_at=started_at, error=error, dry_run=args.dry_run)
+    print("\nDone.")
+    print(f"Final model: {os.path.join(config['training']['save_dir'], 'final')}")
+    print(f"Manifest: {config['artifacts']['manifest']}")
+if __name__ == "__main__":
+    main()

colab_worker.py ADDED Viewed

	@@ -0,0 +1,446 @@

+# -*- coding: utf-8 -*-
+"""Small HTTP worker for running AniFileBERT training jobs on Google Colab.
+Start this inside a Colab runtime:
+    python colab_worker.py
+The worker exposes a token-protected local HTTP API and, by default, starts a
+Cloudflare Quick Tunnel so Codex on your local machine can submit jobs.
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+from pathlib import Path
+import platform
+import re
+import secrets
+import shutil
+import signal
+import subprocess
+import sys
+import threading
+import time
+import traceback
+from http import HTTPStatus
+from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
+from typing import Any
+from urllib.parse import parse_qs, urlparse
+import urllib.request
+TERMINAL_STATES = {"success", "failed", "cancelled"}
+TUNNEL_URL_RE = re.compile(r"https://[-a-zA-Z0-9.]+\.trycloudflare\.com")
+def utc_timestamp() -> str:
+    return time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
+def json_dumps(data: Any) -> str:
+    return json.dumps(data, ensure_ascii=False, indent=2)
+def read_tail(path: Path, lines: int) -> str:
+    if not path.is_file():
+        return ""
+    if lines <= 0:
+        return path.read_text(encoding="utf-8", errors="replace")
+    chunk_size = 8192
+    data = b""
+    with path.open("rb") as f:
+        f.seek(0, os.SEEK_END)
+        pos = f.tell()
+        while pos > 0 and data.count(b"\n") <= lines:
+            read_size = min(chunk_size, pos)
+            pos -= read_size
+            f.seek(pos)
+            data = f.read(read_size) + data
+    return b"\n".join(data.splitlines()[-lines:]).decode("utf-8", errors="replace")
+def download_cloudflared(path: Path) -> Path:
+    if path.is_file():
+        return path
+    existing = shutil.which("cloudflared")
+    if existing:
+        return Path(existing)
+    arch = platform.machine().lower()
+    if arch in {"x86_64", "amd64"}:
+        suffix = "linux-amd64"
+    elif arch in {"aarch64", "arm64"}:
+        suffix = "linux-arm64"
+    else:
+        raise RuntimeError(f"Unsupported CPU architecture for cloudflared: {arch}")
+    url = f"https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-{suffix}"
+    print(f"Downloading cloudflared: {url}", flush=True)
+    path.parent.mkdir(parents=True, exist_ok=True)
+    urllib.request.urlretrieve(url, path)
+    path.chmod(0o755)
+    return path
+class WorkerState:
+    def __init__(self, repo_dir: Path, jobs_dir: Path):
+        self.repo_dir = repo_dir
+        self.jobs_dir = jobs_dir
+        self.jobs_dir.mkdir(parents=True, exist_ok=True)
+        self.jobs: dict[str, dict[str, Any]] = {}
+        self.lock = threading.RLock()
+    def list_jobs(self) -> list[dict[str, Any]]:
+        with self.lock:
+            return [self._public_job(job) for job in self.jobs.values()]
+    def get_job(self, job_id: str) -> dict[str, Any] | None:
+        with self.lock:
+            job = self.jobs.get(job_id)
+            return self._public_job(job) if job else None
+    def get_job_internal(self, job_id: str) -> dict[str, Any] | None:
+        with self.lock:
+            return self.jobs.get(job_id)
+    def active_job(self) -> dict[str, Any] | None:
+        with self.lock:
+            for job in self.jobs.values():
+                if job["status"] not in TERMINAL_STATES:
+                    return job
+        return None
+    def start_job(self, payload: dict[str, Any]) -> dict[str, Any]:
+        with self.lock:
+            active = self.active_job()
+            if active is not None:
+                raise RuntimeError(f"Job already running: {active['job_id']}")
+            job_id = time.strftime("%Y%m%d-%H%M%S", time.gmtime()) + "-" + secrets.token_hex(3)
+            job_dir = self.jobs_dir / job_id
+            job_dir.mkdir(parents=True, exist_ok=True)
+            log_path = job_dir / "worker.log"
+            config_path: Path | None = None
+            cmd = [sys.executable, "colab_train.py"]
+            config = self._job_config(payload)
+            config.setdefault("artifacts", {})
+            config["artifacts"]["manifest"] = os.fspath(job_dir / "colab_run_manifest.json")
+            config_path = job_dir / "config.json"
+            config_path.write_text(json_dumps(config), encoding="utf-8")
+            cmd.extend(["--config", os.fspath(config_path)])
+            for arg in payload.get("args", []):
+                cmd.append(str(arg))
+            job = {
+                "job_id": job_id,
+                "status": "queued",
+                "created_at": utc_timestamp(),
+                "started_at": None,
+                "finished_at": None,
+                "returncode": None,
+                "cmd": cmd,
+                "cwd": os.fspath(self.repo_dir),
+                "job_dir": os.fspath(job_dir),
+                "log_path": os.fspath(log_path),
+                "config_path": os.fspath(config_path) if config_path else None,
+                "error": None,
+                "process": None,
+            }
+            self.jobs[job_id] = job
+        thread = threading.Thread(target=self._run_job, args=(job_id,), daemon=True)
+        thread.start()
+        return self._public_job(job)
+    def _job_config(self, payload: dict[str, Any]) -> dict[str, Any]:
+        if "config" in payload:
+            return json.loads(json.dumps(payload["config"], ensure_ascii=False))
+        profile = str(payload.get("profile", "dmhy_regex_finetune"))
+        profile_path = self.repo_dir / "colab" / "configs" / f"{profile}.json"
+        if not profile_path.is_file():
+            raise FileNotFoundError(f"Profile not found: {profile_path}")
+        return json.loads(profile_path.read_text(encoding="utf-8"))
+    def cancel_job(self, job_id: str) -> dict[str, Any]:
+        with self.lock:
+            job = self.jobs.get(job_id)
+            if job is None:
+                raise KeyError(job_id)
+            process: subprocess.Popen[str] | None = job.get("process")
+            if job["status"] in TERMINAL_STATES:
+                return self._public_job(job)
+            job["status"] = "cancelled"
+            job["finished_at"] = utc_timestamp()
+        if process and process.poll() is None:
+            try:
+                os.killpg(os.getpgid(process.pid), signal.SIGTERM)
+            except Exception:
+                process.terminate()
+        return self.get_job(job_id) or {}
+    def _run_job(self, job_id: str) -> None:
+        job = self.get_job_internal(job_id)
+        if job is None:
+            return
+        log_path = Path(job["log_path"])
+        try:
+            with self.lock:
+                job["status"] = "running"
+                job["started_at"] = utc_timestamp()
+            with log_path.open("w", encoding="utf-8", errors="replace") as log:
+                log.write(f"job_id={job_id}\n")
+                log.write(f"cwd={job['cwd']}\n")
+                log.write("$ " + " ".join(job["cmd"]) + "\n\n")
+                log.flush()
+                process = subprocess.Popen(
+                    job["cmd"],
+                    cwd=job["cwd"],
+                    stdout=subprocess.PIPE,
+                    stderr=subprocess.STDOUT,
+                    text=True,
+                    encoding="utf-8",
+                    errors="replace",
+                    bufsize=1,
+                    preexec_fn=os.setsid if hasattr(os, "setsid") else None,
+                )
+                with self.lock:
+                    job["process"] = process
+                assert process.stdout is not None
+                for line in process.stdout:
+                    log.write(line)
+                    log.flush()
+                    print(line, end="", flush=True)
+                process.wait()
+            with self.lock:
+                job["returncode"] = process.returncode
+                if job["status"] != "cancelled":
+                    job["status"] = "success" if process.returncode == 0 else "failed"
+                job["finished_at"] = utc_timestamp()
+                job["process"] = None
+        except Exception as exc:
+            with log_path.open("a", encoding="utf-8", errors="replace") as log:
+                traceback.print_exc(file=log)
+            with self.lock:
+                job["status"] = "failed"
+                job["finished_at"] = utc_timestamp()
+                job["error"] = f"{type(exc).__name__}: {exc}"
+                job["process"] = None
+    def _public_job(self, job: dict[str, Any]) -> dict[str, Any]:
+        public = {key: value for key, value in job.items() if key != "process"}
+        return public
+def make_handler(state: WorkerState, token: str):
+    class Handler(BaseHTTPRequestHandler):
+        server_version = "AniFileBERTColabWorker/1.0"
+        def log_message(self, fmt: str, *args: Any) -> None:
+            print(f"[{utc_timestamp()}] {self.address_string()} {fmt % args}", flush=True)
+        def do_GET(self) -> None:
+            self._handle("GET")
+        def do_POST(self) -> None:
+            self._handle("POST")
+        def _handle(self, method: str) -> None:
+            parsed = urlparse(self.path)
+            path = parsed.path.rstrip("/") or "/"
+            parts = [part for part in path.split("/") if part]
+            try:
+                if not self._authorized():
+                    self._send({"error": "unauthorized"}, HTTPStatus.UNAUTHORIZED)
+                    return
+                if method == "GET" and path == "/health":
+                    self._send(
+                        {
+                            "ok": True,
+                            "repo_dir": os.fspath(state.repo_dir),
+                            "jobs_dir": os.fspath(state.jobs_dir),
+                            "active_job": state.active_job()["job_id"] if state.active_job() else None,
+                        }
+                    )
+                    return
+                if method == "GET" and path == "/jobs":
+                    self._send({"jobs": state.list_jobs()})
+                    return
+                if method == "POST" and path == "/jobs":
+                    payload = self._read_json()
+                    job = state.start_job(payload)
+                    self._send(job, HTTPStatus.ACCEPTED)
+                    return
+                if len(parts) >= 2 and parts[0] == "jobs":
+                    job_id = parts[1]
+                    if method == "GET" and len(parts) == 2:
+                        job = state.get_job(job_id)
+                        if job is None:
+                            self._send({"error": "job not found"}, HTTPStatus.NOT_FOUND)
+                        else:
+                            self._send(job)
+                        return
+                    if method == "GET" and len(parts) == 3 and parts[2] == "logs":
+                        query = parse_qs(parsed.query)
+                        tail = int(query.get("tail", ["200"])[0])
+                        job = state.get_job_internal(job_id)
+                        if job is None:
+                            self._send({"error": "job not found"}, HTTPStatus.NOT_FOUND)
+                        else:
+                            self._send({"job_id": job_id, "log": read_tail(Path(job["log_path"]), tail)})
+                        return
+                    if method == "GET" and len(parts) == 3 and parts[2] == "manifest":
+                        job = state.get_job_internal(job_id)
+                        if job is None:
+                            self._send({"error": "job not found"}, HTTPStatus.NOT_FOUND)
+                        else:
+                            manifest = self._find_manifest(job)
+                            if manifest is None:
+                                self._send({"error": "manifest not found"}, HTTPStatus.NOT_FOUND)
+                            else:
+                                self._send(json.loads(manifest.read_text(encoding="utf-8")))
+                        return
+                    if method == "POST" and len(parts) == 3 and parts[2] == "cancel":
+                        try:
+                            self._send(state.cancel_job(job_id))
+                        except KeyError:
+                            self._send({"error": "job not found"}, HTTPStatus.NOT_FOUND)
+                        return
+                self._send({"error": "not found"}, HTTPStatus.NOT_FOUND)
+            except Exception as exc:
+                traceback.print_exc()
+                self._send({"error": f"{type(exc).__name__}: {exc}"}, HTTPStatus.INTERNAL_SERVER_ERROR)
+        def _authorized(self) -> bool:
+            header = self.headers.get("Authorization", "")
+            if header == f"Bearer {token}":
+                return True
+            return self.headers.get("X-Colab-Token") == token
+        def _read_json(self) -> dict[str, Any]:
+            length = int(self.headers.get("Content-Length", "0"))
+            if length == 0:
+                return {}
+            raw = self.rfile.read(length)
+            return json.loads(raw.decode("utf-8"))
+        def _find_manifest(self, job: dict[str, Any]) -> Path | None:
+            config_path = job.get("config_path")
+            if config_path and Path(config_path).is_file():
+                config = json.loads(Path(config_path).read_text(encoding="utf-8"))
+                training = config.get("training", {})
+                save_dir = training.get("save_dir")
+                if save_dir:
+                    manifest = Path(save_dir) / "colab_run_manifest.json"
+                    if manifest.is_file():
+                        return manifest
+            job_manifest = Path(job["job_dir"]) / "colab_run_manifest.json"
+            return job_manifest if job_manifest.is_file() else None
+        def _send(self, data: Any, status: HTTPStatus = HTTPStatus.OK) -> None:
+            raw = json_dumps(data).encode("utf-8")
+            self.send_response(status.value)
+            self.send_header("Content-Type", "application/json; charset=utf-8")
+            self.send_header("Content-Length", str(len(raw)))
+            self.end_headers()
+            self.wfile.write(raw)
+    return Handler
+def start_tunnel(port: int, binary_path: Path) -> subprocess.Popen[str]:
+    cloudflared = download_cloudflared(binary_path)
+    cmd = [
+        os.fspath(cloudflared),
+        "tunnel",
+        "--url",
+        f"http://127.0.0.1:{port}",
+        "--no-autoupdate",
+    ]
+    proc = subprocess.Popen(
+        cmd,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+        text=True,
+        encoding="utf-8",
+        errors="replace",
+        bufsize=1,
+    )
+    def pump() -> None:
+        assert proc.stdout is not None
+        for line in proc.stdout:
+            print(line, end="", flush=True)
+            match = TUNNEL_URL_RE.search(line)
+            if match:
+                print("\nCOLAB_WORKER_URL=" + match.group(0), flush=True)
+    threading.Thread(target=pump, daemon=True).start()
+    return proc
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Start the AniFileBERT Colab worker")
+    parser.add_argument("--host", default="127.0.0.1", help="HTTP bind host")
+    parser.add_argument("--port", type=int, default=7860, help="HTTP bind port")
+    parser.add_argument("--repo-dir", default="/content/AniFileBERT", help="AniFileBERT checkout path in Colab")
+    parser.add_argument("--jobs-dir", default="/content/drive/MyDrive/AniFileBERT/worker/jobs")
+    parser.add_argument("--token", default=os.environ.get("ANIFILEBERT_COLAB_TOKEN"))
+    parser.add_argument("--tunnel", choices=["cloudflare", "none"], default="cloudflare")
+    parser.add_argument("--cloudflared-path", default="/tmp/anifilebert-cloudflared")
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    token = args.token or secrets.token_urlsafe(24)
+    repo_dir = Path(args.repo_dir)
+    if not repo_dir.is_dir():
+        raise RuntimeError(f"Repo directory does not exist: {repo_dir}")
+    state = WorkerState(repo_dir=repo_dir, jobs_dir=Path(args.jobs_dir))
+    server = ThreadingHTTPServer((args.host, args.port), make_handler(state, token))
+    tunnel_proc: subprocess.Popen[str] | None = None
+    print("=" * 72)
+    print("AniFileBERT Colab worker is starting")
+    print(f"Local URL: http://{args.host}:{args.port}")
+    print(f"COLAB_WORKER_TOKEN={token}")
+    print("Keep this Colab cell running while Codex uses the worker.")
+    print("=" * 72, flush=True)
+    if args.tunnel == "cloudflare":
+        tunnel_proc = start_tunnel(args.port, Path(args.cloudflared_path))
+    else:
+        print("Tunnel disabled. Use the local URL from inside the Colab runtime.", flush=True)
+    try:
+        server.serve_forever()
+    finally:
+        server.server_close()
+        if tunnel_proc and tunnel_proc.poll() is None:
+            tunnel_proc.terminate()
+if __name__ == "__main__":
+    main()

train.py CHANGED Viewed

@@ -27,7 +27,7 @@ from transformers import (
 from seqeval.metrics import classification_report, accuracy_score, f1_score, precision_score, recall_score
 from config import Config
-from tokenizer import AnimeTokenizer, create_tokenizer
 from model import create_model, print_model_summary, count_parameters
 from dataset import AnimeDataset, align_tokens_for_tokenizer
@@ -64,8 +64,8 @@ def compute_metrics(p):
 def parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(description="Train anime filename parser")
-    parser.add_argument("--tokenizer", choices=["regex", "char"], default="regex",
-                        help="Tokenizer variant for A/B testing")
     parser.add_argument("--data-file", default=None, help="Training JSONL file")
     parser.add_argument("--vocab-file", default=None,
                         help="Tokenizer vocab JSON. Defaults to data/vocab.json or data/vocab.char.json")
@@ -84,11 +84,58 @@ def parse_args() -> argparse.Namespace:
                         help="Rebuild vocab from the selected data file before training")
     parser.add_argument("--max-vocab-size", type=int, default=None,
                         help="Optional vocab cap used with --rebuild-vocab")
     parser.add_argument("--cpu", action="store_true", help="Force CPU training")
     parser.add_argument("--no-shuffle", action="store_true", help="Do not shuffle before train/eval split")
     return parser.parse_args()
 def resolve_vocab_path(data_file: str, tokenizer_variant: str, explicit_path: Optional[str]) -> str:
     if explicit_path:
         return explicit_path
@@ -96,6 +143,79 @@ def resolve_vocab_path(data_file: str, tokenizer_variant: str, explicit_path: Op
     return os.path.join(os.path.dirname(data_file), name)
 def build_vocab_from_data(data: List[Dict], tokenizer: AnimeTokenizer, vocab_path: str,
                          max_size: Optional[int] = None) -> None:
     token_lists: List[List[str]] = []
@@ -115,9 +235,10 @@ def main():
     config = Config()
     if args.data_file is not None:
         config.data_file = args.data_file
     if args.save_dir is not None:
         config.save_dir = args.save_dir
-    elif args.tokenizer == "char":
         config.save_dir = "./checkpoints_char"
     if args.epochs is not None:
         config.num_epochs = args.epochs
@@ -131,6 +252,8 @@ def main():
         config.train_split = args.train_split
     if args.max_seq_length is not None:
         config.max_seq_length = args.max_seq_length
     random.seed(args.seed)
     np.random.seed(args.seed)
@@ -143,18 +266,20 @@ def main():
         all_data = all_data[:args.limit_samples]
     if not args.no_shuffle:
         random.shuffle(all_data)
     # Load tokenizer
     print("Loading tokenizer...")
-    vocab_path = resolve_vocab_path(config.data_file, args.tokenizer, args.vocab_file)
-    tokenizer = create_tokenizer(args.tokenizer)
     if args.rebuild_vocab or not os.path.isfile(vocab_path):
         max_vocab_size = args.max_vocab_size if args.max_vocab_size is not None else config.vocab_size
-        print(f"  Building {args.tokenizer} vocab: {vocab_path} (max_size={max_vocab_size})")
         build_vocab_from_data(all_data, tokenizer, vocab_path, max_size=max_vocab_size)
-    tokenizer = create_tokenizer(args.tokenizer, vocab_file=vocab_path)
-    print(f"  Variant: {args.tokenizer}")
     print(f"  Vocab size: {tokenizer.vocab_size}")
     # Update config with actual vocab size
     config.vocab_size = tokenizer.vocab_size
@@ -163,9 +288,22 @@ def main():
     if args.init_model_dir:
         print(f"Loading model for fine-tuning: {args.init_model_dir}")
         model = BertForTokenClassification.from_pretrained(args.init_model_dir)
-        if model.config.vocab_size != config.vocab_size:
-            print(f"  Resizing token embeddings: {model.config.vocab_size} -> {config.vocab_size}")
-            model.resize_token_embeddings(config.vocab_size)
         model.config.num_labels = config.num_labels
         model.config.id2label = config.id2label
         model.config.label2id = config.label2id
@@ -212,6 +350,8 @@ def main():
     use_cpu = args.cpu or not torch.cuda.is_available()
     use_fp16 = not use_cpu
     print(f"  Device: {'CPU' if use_cpu else 'CUDA'}")
     # Training arguments
     training_args = TrainingArguments(
@@ -220,15 +360,16 @@ def main():
         per_device_train_batch_size=config.batch_size,
         per_device_eval_batch_size=config.batch_size,
         eval_strategy="epoch",
-        save_strategy="epoch",
         logging_steps=config.log_interval,
         learning_rate=config.learning_rate,
         weight_decay=config.weight_decay,
         warmup_steps=config.warmup_steps,
         use_cpu=use_cpu,
         report_to="none",
-        save_total_limit=2,
-        load_best_model_at_end=True,
         metric_for_best_model="f1",
         greater_is_better=True,
         dataloader_num_workers=config.num_workers,
@@ -250,12 +391,19 @@ def main():
     # Train
     print("Starting training...")
-    trainer.train()
     # Set proper label mappings in model config before saving
     model.config.id2label = config.id2label
     model.config.label2id = config.label2id
-    model.config.tokenizer_variant = args.tokenizer
     model.config.max_seq_length = config.max_seq_length
     # Save final model

 from seqeval.metrics import classification_report, accuracy_score, f1_score, precision_score, recall_score
 from config import Config
+from tokenizer import AnimeTokenizer, create_tokenizer, load_tokenizer
 from model import create_model, print_model_summary, count_parameters
 from dataset import AnimeDataset, align_tokens_for_tokenizer
 def parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(description="Train anime filename parser")
+    parser.add_argument("--tokenizer", choices=["regex", "char"], default=None,
+                        help="Tokenizer variant for A/B testing. Defaults to dataset metadata")
     parser.add_argument("--data-file", default=None, help="Training JSONL file")
     parser.add_argument("--vocab-file", default=None,
                         help="Tokenizer vocab JSON. Defaults to data/vocab.json or data/vocab.char.json")
                         help="Rebuild vocab from the selected data file before training")
     parser.add_argument("--max-vocab-size", type=int, default=None,
                         help="Optional vocab cap used with --rebuild-vocab")
+    parser.add_argument("--checkpoint-steps", type=int, default=None,
+                        help="Save resumable checkpoints every N steps instead of only at epoch end")
+    parser.add_argument("--save-total-limit", type=int, default=2,
+                        help="Maximum number of checkpoints to keep")
     parser.add_argument("--cpu", action="store_true", help="Force CPU training")
     parser.add_argument("--no-shuffle", action="store_true", help="Do not shuffle before train/eval split")
+    parser.add_argument("--resume-from-checkpoint", default=None,
+                        help="Resume Trainer state from a checkpoint directory, or 'auto' for the latest checkpoint")
     return parser.parse_args()
+def detect_tokenizer_variant(
+    data_file: str,
+    explicit_variant: Optional[str],
+    explicit_vocab_path: Optional[str],
+    sample_size: int = 256,
+) -> str:
+    """Infer tokenizer variant from CLI, dataset metadata, or vocab filename."""
+    if explicit_variant:
+        return explicit_variant
+    variants = set()
+    char_like = 0
+    inspected = 0
+    with open(data_file, "r", encoding="utf-8") as f:
+        for line in f:
+            if inspected >= sample_size:
+                break
+            line = line.strip()
+            if not line:
+                continue
+            item = json.loads(line)
+            inspected += 1
+            variant = item.get("tokenizer_variant")
+            if variant:
+                variants.add(variant)
+            tokens = item.get("tokens", [])
+            filename = item.get("filename")
+            if filename is not None and tokens == list(filename):
+                char_like += 1
+    if len(variants) == 1:
+        return next(iter(variants))
+    if len(variants) > 1:
+        raise ValueError(f"Mixed tokenizer_variant values in {data_file}: {sorted(variants)}")
+    if explicit_vocab_path and ".char" in os.path.basename(explicit_vocab_path).lower():
+        return "char"
+    if inspected and char_like / inspected >= 0.95:
+        return "char"
+    return "regex"
 def resolve_vocab_path(data_file: str, tokenizer_variant: str, explicit_path: Optional[str]) -> str:
     if explicit_path:
         return explicit_path
     return os.path.join(os.path.dirname(data_file), name)
+def latest_checkpoint(save_dir: str) -> Optional[str]:
+    if not os.path.isdir(save_dir):
+        return None
+    checkpoints = []
+    for name in os.listdir(save_dir):
+        if not name.startswith("checkpoint-"):
+            continue
+        path = os.path.join(save_dir, name)
+        if not os.path.isdir(path):
+            continue
+        try:
+            step = int(name.split("-")[-1])
+        except ValueError:
+            continue
+        checkpoints.append((step, path))
+    if not checkpoints:
+        return None
+    return max(checkpoints)[1]
+def validate_dataset_tokenizer_metadata(data: List[Dict], tokenizer_variant: str) -> None:
+    variants = {item.get("tokenizer_variant") for item in data if item.get("tokenizer_variant")}
+    if variants and variants != {tokenizer_variant}:
+        raise ValueError(
+            f"Dataset tokenizer_variant {sorted(variants)} does not match selected tokenizer "
+            f"'{tokenizer_variant}'. Pass --tokenizer explicitly only when this is intentional."
+        )
+def remap_token_embeddings(
+    model: BertForTokenClassification,
+    old_vocab: Dict[str, int],
+    new_vocab: Dict[str, int],
+    pad_token_id: int,
+) -> int:
+    """
+    Replace the input embedding table for a changed vocabulary.
+    resize_token_embeddings() preserves rows by numeric ID, which is unsafe when
+    two tokenizers assign different tokens to the same ID. This remaps by token
+    string and randomly initializes tokens that do not exist in the old vocab.
+    """
+    old_embeddings = model.get_input_embeddings()
+    old_weight = old_embeddings.weight.data
+    embedding_dim = old_weight.shape[1]
+    new_embeddings = torch.nn.Embedding(
+        len(new_vocab),
+        embedding_dim,
+        padding_idx=pad_token_id,
+        device=old_weight.device,
+        dtype=old_weight.dtype,
+    )
+    torch.nn.init.normal_(
+        new_embeddings.weight,
+        mean=0.0,
+        std=getattr(model.config, "initializer_range", 0.02),
+    )
+    if pad_token_id is not None and 0 <= pad_token_id < len(new_vocab):
+        new_embeddings.weight.data[pad_token_id].zero_()
+    copied = 0
+    for token, new_id in new_vocab.items():
+        old_id = old_vocab.get(token)
+        if old_id is None or old_id >= old_weight.shape[0]:
+            continue
+        new_embeddings.weight.data[new_id].copy_(old_weight[old_id])
+        copied += 1
+    model.set_input_embeddings(new_embeddings)
+    model.config.vocab_size = len(new_vocab)
+    return copied
 def build_vocab_from_data(data: List[Dict], tokenizer: AnimeTokenizer, vocab_path: str,
                          max_size: Optional[int] = None) -> None:
     token_lists: List[List[str]] = []
     config = Config()
     if args.data_file is not None:
         config.data_file = args.data_file
+    tokenizer_variant = detect_tokenizer_variant(config.data_file, args.tokenizer, args.vocab_file)
     if args.save_dir is not None:
         config.save_dir = args.save_dir
+    elif tokenizer_variant == "char":
         config.save_dir = "./checkpoints_char"
     if args.epochs is not None:
         config.num_epochs = args.epochs
         config.train_split = args.train_split
     if args.max_seq_length is not None:
         config.max_seq_length = args.max_seq_length
+    elif tokenizer_variant == "char":
+        config.max_seq_length = max(config.max_seq_length, 128)
     random.seed(args.seed)
     np.random.seed(args.seed)
         all_data = all_data[:args.limit_samples]
     if not args.no_shuffle:
         random.shuffle(all_data)
+    validate_dataset_tokenizer_metadata(all_data, tokenizer_variant)
     # Load tokenizer
     print("Loading tokenizer...")
+    vocab_path = resolve_vocab_path(config.data_file, tokenizer_variant, args.vocab_file)
+    tokenizer = create_tokenizer(tokenizer_variant)
     if args.rebuild_vocab or not os.path.isfile(vocab_path):
         max_vocab_size = args.max_vocab_size if args.max_vocab_size is not None else config.vocab_size
+        print(f"  Building {tokenizer_variant} vocab: {vocab_path} (max_size={max_vocab_size})")
         build_vocab_from_data(all_data, tokenizer, vocab_path, max_size=max_vocab_size)
+    tokenizer = create_tokenizer(tokenizer_variant, vocab_file=vocab_path)
+    print(f"  Variant: {tokenizer_variant}")
     print(f"  Vocab size: {tokenizer.vocab_size}")
+    print(f"  Max sequence length: {config.max_seq_length}")
     # Update config with actual vocab size
     config.vocab_size = tokenizer.vocab_size
     if args.init_model_dir:
         print(f"Loading model for fine-tuning: {args.init_model_dir}")
         model = BertForTokenClassification.from_pretrained(args.init_model_dir)
+        init_tokenizer = load_tokenizer(args.init_model_dir)
+        init_variant = getattr(init_tokenizer, "tokenizer_variant", None)
+        if init_variant != tokenizer_variant:
+            print(f"  WARNING: tokenizer variant changes during fine-tune: {init_variant} -> {tokenizer_variant}")
+            print("  Token embeddings will be remapped by token string; unmatched tokens are newly initialized.")
+        if model.config.vocab_size != config.vocab_size or init_tokenizer.get_vocab() != tokenizer.get_vocab():
+            copied = remap_token_embeddings(
+                model=model,
+                old_vocab=init_tokenizer.get_vocab(),
+                new_vocab=tokenizer.get_vocab(),
+                pad_token_id=tokenizer.pad_token_id,
+            )
+            print(
+                f"  Remapped token embeddings: copied {copied:,}/{config.vocab_size:,} "
+                f"tokens from init checkpoint"
+            )
         model.config.num_labels = config.num_labels
         model.config.id2label = config.id2label
         model.config.label2id = config.label2id
     use_cpu = args.cpu or not torch.cuda.is_available()
     use_fp16 = not use_cpu
     print(f"  Device: {'CPU' if use_cpu else 'CUDA'}")
+    save_strategy = "steps" if args.checkpoint_steps else "epoch"
+    load_best_model_at_end = args.checkpoint_steps is None
     # Training arguments
     training_args = TrainingArguments(
         per_device_train_batch_size=config.batch_size,
         per_device_eval_batch_size=config.batch_size,
         eval_strategy="epoch",
+        save_strategy=save_strategy,
+        save_steps=args.checkpoint_steps,
         logging_steps=config.log_interval,
         learning_rate=config.learning_rate,
         weight_decay=config.weight_decay,
         warmup_steps=config.warmup_steps,
         use_cpu=use_cpu,
         report_to="none",
+        save_total_limit=args.save_total_limit,
+        load_best_model_at_end=load_best_model_at_end,
         metric_for_best_model="f1",
         greater_is_better=True,
         dataloader_num_workers=config.num_workers,
     # Train
     print("Starting training...")
+    resume_from_checkpoint = args.resume_from_checkpoint
+    if resume_from_checkpoint == "auto":
+        resume_from_checkpoint = latest_checkpoint(config.save_dir)
+        if resume_from_checkpoint:
+            print(f"Resuming from latest checkpoint: {resume_from_checkpoint}")
+        else:
+            print("No checkpoint found; starting a fresh training run.")
+    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
     # Set proper label mappings in model config before saving
     model.config.id2label = config.id2label
     model.config.label2id = config.label2id
+    model.config.tokenizer_variant = tokenizer_variant
     model.config.max_seq_length = config.max_seq_length
     # Save final model