Codex + Colab Training

Free Colab cannot be used as an always-on remote machine. The practical setup is:

Open a Colab GPU runtime when you want to train.
Start the lightweight worker in one cell.
Give Codex the printed worker URL and token.
Codex submits jobs while that Colab session is alive.
Checkpoints and manifests stay on Google Drive, so the next session can resume.

Start a Colab Session

Run this in a Colab code cell:

from google.colab import drive
drive.mount("/content/drive")

!git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT /content/AniFileBERT || true
%cd /content/AniFileBERT
!git pull --ff-only || true
!git submodule update --init --recursive
!python -m tools.colab_worker

The cell prints:

COLAB_WORKER_URL=https://...trycloudflare.com
COLAB_WORKER_TOKEN=...

Keep that cell running. If Colab disconnects, start it again; default profiles save every 1000 steps and resume from the latest Drive checkpoint because they use checkpoint_steps: 1000 and resume_from_checkpoint: "auto".

Let Codex Submit a Job

On the local machine:

$env:ANIFILEBERT_COLAB_URL="https://...trycloudflare.com"
$env:ANIFILEBERT_COLAB_TOKEN="..."
python -m tools.colab_client health
python -m tools.colab_client submit --profile dmhy_regex_finetune --wait

Codex can run the same commands from this repository after you provide the URL and token.

Profiles

colab/configs/dmhy_regex_finetune.json: default regex tokenizer fine-tune from the published root checkpoint.
colab/configs/dmhy_char_train.json: character tokenizer training from scratch.

You can submit a local edited profile instead of a remote profile:

python -m tools.colab_client submit --config colab/configs/dmhy_regex_finetune.json --wait

The worker writes per-job logs under:

MyDrive/AniFileBERT/worker/jobs/<job-id>/

The training runner writes:

MyDrive/AniFileBERT/checkpoints/<profile-name>/
MyDrive/AniFileBERT/last_run_manifest.json