Add Codex Colab training workflow

e458112 7 days ago

6.67 kB

	# Repository Guidelines

	This repository is `AniFileBERT`, the Python model, dataset, training, inference,
	and ONNX export workspace used by MiruPlay as `tools/anime_parser`.

	## Project Shape

	- Root model artifacts (`config.json`, `model.safetensors`, `vocab.json`,
	`tokenizer_config.json`, `training_args.bin`) are the published default
	checkpoint.
	- Core code lives in `train.py`, `dataset.py`, `tokenizer.py`, `model.py`,
	`inference.py`, and `export_onnx.py`.
	- Dataset generation and labeling helpers live in `data_generator.py`,
	`dmhy_dataset.py`, `mix_datasets.py`, `llm_labeler.py`,
	`semantic_labeler.py`, and `convert_to_char_dataset.py`.
	- `datasets/AnimeName` is a nested dataset submodule and should be treated as
	the authoritative dataset snapshot when present. Use either
	`dmhy_weak.jsonl` for the regex tokenizer or `dmhy_weak_char.jsonl` for the
	character tokenizer; the other dataset files are legacy snapshots.
	- `exports/` contains Android-facing ONNX artifacts. Keep it in sync when
	changing export behavior or the published checkpoint.

	## Setup

	```bash
	python -m pip install -r requirements.txt
	```

	For local GPU training, install a CUDA-compatible PyTorch build first, then
	install the remaining requirements.

	If the dataset submodule is missing, initialize it:

	```bash
	git submodule update --init --recursive
	```

	## Common Commands

	Run a parser smoke check:

	```bash
	python inference.py --model-dir . "Witch.Hat.Atelier.S01E07.1080p.NF.WEB-DL.JPN.AAC2.0.H.264.MSubs-ToonsHub"
	```

	Run the lightweight training pipeline check:

	```bash
	python test_train_small.py --limit-samples 5000 --epochs 2
	```

	Train the default regex tokenizer from the dataset submodule:

	```bash
	python train.py --data-file datasets/AnimeName/dmhy_weak.jsonl --vocab-file datasets/AnimeName/vocab.json --save-dir checkpoints/dmhy-finetune --init-model-dir . --epochs 1 --batch-size 128 --learning-rate 0.0003 --warmup-steps 300 --seed 42
	```

	Train the character tokenizer only when that variant is intentional:

	```bash
	python train.py --tokenizer char --data-file datasets/AnimeName/dmhy_weak_char.jsonl --vocab-file datasets/AnimeName/vocab.char.json --save-dir checkpoints/dmhy-weak-char --epochs 1 --batch-size 64 --learning-rate 0.0003 --warmup-steps 300 --max-seq-length 128 --seed 42
	```

	Export for Android:

	```bash
	python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --android-assets-dir ../../scraper/src/main/assets/anime_parser
	```

	## Codex-Controlled Colab Training

	Free Colab cannot be treated as an always-on remote machine. Use it as a
	short-lived GPU worker only after the user manually opens a Colab runtime and
	starts the worker cell. Do not assume Codex can wake Colab by itself.

	Before relying on the Colab flow, make sure the Colab helper files have been
	pushed to the Hugging Face model repo, or the user has uploaded them manually:
	`colab_worker.py`, `colab_client.py`, `colab_train.py`, and `colab/`.

	Ask the user to start a Colab GPU runtime with:

	```python
	from google.colab import drive
	drive.mount("/content/drive")

	!git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT /content/AniFileBERT \|\| true
	%cd /content/AniFileBERT
	!git pull --ff-only \|\| true
	!git submodule update --init --recursive
	!python colab_worker.py
	```

	The worker prints `COLAB_WORKER_URL=...` and `COLAB_WORKER_TOKEN=...`. After
	the user provides those values, set them for local commands:

	```powershell
	$env:ANIFILEBERT_COLAB_URL="https://...trycloudflare.com"
	$env:ANIFILEBERT_COLAB_TOKEN="..."
	python colab_client.py health
	```

	Submit the default regex fine-tune:

	```powershell
	python colab_client.py submit --profile dmhy_regex_finetune --wait
	```

	Submit the character tokenizer run only when intentional:

	```powershell
	python colab_client.py submit --profile dmhy_char_train --wait
	```

	Useful follow-up commands:

	```powershell
	python colab_client.py jobs
	python colab_client.py status <job-id>
	python colab_client.py logs <job-id> --tail 200
	python colab_client.py manifest <job-id>
	python colab_client.py cancel <job-id>
	```

	The default Colab profiles save checkpoints to Google Drive every 1000 steps
	and resume with `resume_from_checkpoint: "auto"`, so if free Colab disconnects,
	ask the user to restart the worker and submit the same profile again. Artifacts
	land under `MyDrive/AniFileBERT/checkpoints/<profile-name>/`, and worker logs
	land under `MyDrive/AniFileBERT/worker/jobs/<job-id>/`.

	## Validation Expectations

	- For parser or tokenizer changes, run `python inference.py --model-dir . ...`
	with at least one realistic filename.
	- For dataset alignment, tokenizer, model, or training-loop changes, run
	`python test_train_small.py --limit-samples 5000 --epochs 2` when practical.
	- For export changes, run `python export_onnx.py ...` and confirm the exporter
	reports a small PyTorch/ONNX logits difference.
	- Full training is expensive; do not start long multi-epoch runs unless the
	task explicitly requires it.

	## Data And Artifact Rules

	- Avoid committing generated checkpoint directories such as `checkpoints/`,
	`test_checkpoints/`, and `ab_checkpoints/`.
	- Most `data/*/.jsonl` files are generated and ignored. The small checked-in
	fixtures are `data/synthetic_small.jsonl` and `data/test_smoke.jsonl`.
	- For real training, choose exactly one current dataset:
	`datasets/AnimeName/dmhy_weak.jsonl` for regex tokenization or
	`datasets/AnimeName/dmhy_weak_char.jsonl` for character tokenization.
	Treat `mixed_train.jsonl`, `ab_mix_100k.jsonl`, and other alternate JSONL
	files as legacy unless a task explicitly asks to inspect them.
	- Large binary artifacts are tracked through Git LFS by `.gitattributes`.
	Preserve LFS handling for `.safetensors`, `.onnx`, `.bin`, and related model
	files.
	- When publishing a new checkpoint, copy the final checkpoint files to the
	repository root as described in `MAINTENANCE.md`.
	- When updating `datasets/AnimeName`, commit the submodule pointer in this repo
	and then update the parent MiruPlay submodule pointer.

	## Coding Notes

	- Keep the custom tokenizer contract stable: Android runtime tokenization must
	continue to match the exported vocabulary and model metadata.
	- Preserve label names and BIO behavior unless a task explicitly changes the
	model schema; Android expects the current fields for title, season, episode,
	group, resolution, source, and special tags.
	- Prefer deterministic dataset and training changes. Keep seed handling intact.
	- Use UTF-8 for files that contain Japanese, Chinese, or release-name examples.
	- Keep command examples Windows-friendly where paths reference MiruPlay.