AniFileBERT / MAINTENANCE.md
ModerRAS's picture
Improve anime filename parser model
e63569d
|
raw
history blame
3.18 kB

AniFileBERT Maintenance

This repository is the standalone Hugging Face model repo used by MiruPlay as tools/anime_parser.

Related Repositories

Repository URL Purpose
AniFileBERT https://huggingface.co/ModerRAS/AniFileBERT Model, training scripts, ONNX export
AnimeName https://huggingface.co/datasets/ModerRAS/AnimeName Training datasets and manifests
MiruPlay https://github.com/ModerRAS/MiruPlay Android app and runtime integration

Nested structure:

AniFileBERT
  datasets/AnimeName -> ModerRAS/AnimeName

Clone

git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT

After a normal clone:

git submodule update --init --recursive

Dataset Waterline

Current DMHY snapshot:

labeled_samples: 632002
char_vocab_size: 6199
strict_bio_violations: 0

The authoritative dataset files live in datasets/AnimeName.

Train

uv sync
uv run python train.py \
  --tokenizer char \
  --data-file datasets/AnimeName/dmhy_weak_char.jsonl \
  --vocab-file datasets/AnimeName/vocab.char.json \
  --save-dir checkpoints/dmhy-char-full-relabel \
  --init-model-dir . \
  --epochs 2 \
  --batch-size 256 \
  --learning-rate 0.00008 \
  --warmup-steps 300 \
  --max-seq-length 128 \
  --checkpoint-steps 1000 \
  --parse-eval-limit 2048 \
  --seed 48

Publish a New Checkpoint

Copy the final checkpoint to the repository root:

Copy-Item checkpoints/dmhy-char-full-relabel/final/config.json . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/model.safetensors . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/tokenizer_config.json . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/training_args.bin . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/vocab.json . -Force
Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/run_metadata.json . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/trainer_eval_metrics.json . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/parse_eval_metrics.json . -Force

There is no tracked model/ duplicate. The root checkpoint is the publishing surface; ignored checkpoints/ directories are training artifacts.

Then commit and push:

git add .
git commit -m "Update AniFileBERT checkpoint"
git push origin main

Update the Dataset Submodule

After pushing new files to ModerRAS/AnimeName, update the nested pointer:

git submodule update --remote datasets/AnimeName
git add datasets/AnimeName
git commit -m "Update AnimeName dataset pointer"
git push origin main

Update MiruPlay

From the MiruPlay root:

git submodule update --remote --recursive tools/anime_parser
git add tools/anime_parser
git commit -m "Update AniFileBERT submodule"
git push origin master

If a new ONNX export changed Android runtime assets, also stage:

scraper/src/main/assets/anime_parser/anime_filename_parser.onnx
scraper/src/main/assets/anime_parser/config.json
scraper/src/main/assets/anime_parser/vocab.json