AniFileBERT / MAINTENANCE.md
ModerRAS's picture
Improve anime filename parser model
e63569d
|
raw
history blame
3.18 kB
# AniFileBERT Maintenance
This repository is the standalone Hugging Face model repo used by MiruPlay as
`tools/anime_parser`.
## Related Repositories
| Repository | URL | Purpose |
|------------|-----|---------|
| AniFileBERT | `https://huggingface.co/ModerRAS/AniFileBERT` | Model, training scripts, ONNX export |
| AnimeName | `https://huggingface.co/datasets/ModerRAS/AnimeName` | Training datasets and manifests |
| MiruPlay | `https://github.com/ModerRAS/MiruPlay` | Android app and runtime integration |
Nested structure:
```text
AniFileBERT
datasets/AnimeName -> ModerRAS/AnimeName
```
## Clone
```bash
git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
```
After a normal clone:
```bash
git submodule update --init --recursive
```
## Dataset Waterline
Current DMHY snapshot:
```text
labeled_samples: 632002
char_vocab_size: 6199
strict_bio_violations: 0
```
The authoritative dataset files live in `datasets/AnimeName`.
## Train
```bash
uv sync
uv run python train.py \
--tokenizer char \
--data-file datasets/AnimeName/dmhy_weak_char.jsonl \
--vocab-file datasets/AnimeName/vocab.char.json \
--save-dir checkpoints/dmhy-char-full-relabel \
--init-model-dir . \
--epochs 2 \
--batch-size 256 \
--learning-rate 0.00008 \
--warmup-steps 300 \
--max-seq-length 128 \
--checkpoint-steps 1000 \
--parse-eval-limit 2048 \
--seed 48
```
## Publish a New Checkpoint
Copy the final checkpoint to the repository root:
```powershell
Copy-Item checkpoints/dmhy-char-full-relabel/final/config.json . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/model.safetensors . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/tokenizer_config.json . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/training_args.bin . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/vocab.json . -Force
Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/run_metadata.json . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/trainer_eval_metrics.json . -Force
Copy-Item checkpoints/dmhy-char-full-relabel/final/parse_eval_metrics.json . -Force
```
There is no tracked `model/` duplicate. The root checkpoint is the publishing
surface; ignored `checkpoints/` directories are training artifacts.
Then commit and push:
```bash
git add .
git commit -m "Update AniFileBERT checkpoint"
git push origin main
```
## Update the Dataset Submodule
After pushing new files to `ModerRAS/AnimeName`, update the nested pointer:
```bash
git submodule update --remote datasets/AnimeName
git add datasets/AnimeName
git commit -m "Update AnimeName dataset pointer"
git push origin main
```
## Update MiruPlay
From the MiruPlay root:
```bash
git submodule update --remote --recursive tools/anime_parser
git add tools/anime_parser
git commit -m "Update AniFileBERT submodule"
git push origin master
```
If a new ONNX export changed Android runtime assets, also stage:
```text
scraper/src/main/assets/anime_parser/anime_filename_parser.onnx
scraper/src/main/assets/anime_parser/config.json
scraper/src/main/assets/anime_parser/vocab.json
```