# AniFileBERT Maintenance This repository is the standalone Hugging Face model repo used by MiruPlay as `tools/anime_parser`. ## Related Repositories | Repository | URL | Purpose | |------------|-----|---------| | AniFileBERT | `https://huggingface.co/ModerRAS/AniFileBERT` | Model, training scripts, ONNX export | | AnimeName | `https://huggingface.co/datasets/ModerRAS/AnimeName` | Training datasets and manifests | | MiruPlay | `https://github.com/ModerRAS/MiruPlay` | Android app and runtime integration | Nested structure: ```text AniFileBERT datasets/AnimeName -> ModerRAS/AnimeName ``` ## Clone ```bash git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT ``` After a normal clone: ```bash git submodule update --init --recursive ``` ## Dataset Waterline Current DMHY snapshot: ```text labeled_samples: 632002 char_vocab_size: 6199 strict_bio_violations: 0 ``` The authoritative dataset files live in `datasets/AnimeName`. ## Train ```bash uv sync uv run python train.py \ --tokenizer char \ --data-file datasets/AnimeName/dmhy_weak_char.jsonl \ --vocab-file datasets/AnimeName/vocab.char.json \ --save-dir checkpoints/dmhy-char-full-relabel \ --init-model-dir . \ --epochs 2 \ --batch-size 256 \ --learning-rate 0.00008 \ --warmup-steps 300 \ --max-seq-length 128 \ --checkpoint-steps 1000 \ --parse-eval-limit 2048 \ --seed 48 ``` ## Publish a New Checkpoint Copy the final checkpoint to the repository root: ```powershell Copy-Item checkpoints/dmhy-char-full-relabel/final/config.json . -Force Copy-Item checkpoints/dmhy-char-full-relabel/final/model.safetensors . -Force Copy-Item checkpoints/dmhy-char-full-relabel/final/tokenizer_config.json . -Force Copy-Item checkpoints/dmhy-char-full-relabel/final/training_args.bin . -Force Copy-Item checkpoints/dmhy-char-full-relabel/final/vocab.json . -Force Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force Copy-Item checkpoints/dmhy-char-full-relabel/final/run_metadata.json . -Force Copy-Item checkpoints/dmhy-char-full-relabel/final/trainer_eval_metrics.json . -Force Copy-Item checkpoints/dmhy-char-full-relabel/final/parse_eval_metrics.json . -Force ``` There is no tracked `model/` duplicate. The root checkpoint is the publishing surface; ignored `checkpoints/` directories are training artifacts. Then commit and push: ```bash git add . git commit -m "Update AniFileBERT checkpoint" git push origin main ``` ## Update the Dataset Submodule After pushing new files to `ModerRAS/AnimeName`, update the nested pointer: ```bash git submodule update --remote datasets/AnimeName git add datasets/AnimeName git commit -m "Update AnimeName dataset pointer" git push origin main ``` ## Update MiruPlay From the MiruPlay root: ```bash git submodule update --remote --recursive tools/anime_parser git add tools/anime_parser git commit -m "Update AniFileBERT submodule" git push origin master ``` If a new ONNX export changed Android runtime assets, also stage: ```text scraper/src/main/assets/anime_parser/anime_filename_parser.onnx scraper/src/main/assets/anime_parser/config.json scraper/src/main/assets/anime_parser/vocab.json ```