# AniFileBERT Maintenance / 维护手册

This repository is the standalone Hugging Face model repo used by MiruPlay as
`tools/anime_parser`.

本仓库是 MiruPlay 通过 `tools/anime_parser` 引用的独立 Hugging Face 模型仓库。

## Related Repositories / 相关仓库

| Repository / 仓库 | URL | Purpose / 用途 |
| --- | --- | --- |
| AniFileBERT | `https://huggingface.co/ModerRAS/AniFileBERT` | Model, scripts, ONNX export / 模型、脚本、ONNX 导出 |
| AnimeName | `https://huggingface.co/datasets/ModerRAS/AnimeName` | Dataset snapshot / 数据集快照 |
| MiruPlay | `https://github.com/ModerRAS/MiruPlay` | Android integration / Android 集成 |

Nested structure / 嵌套结构：

```text
AniFileBERT
  datasets/AnimeName -> ModerRAS/AnimeName
```

## Clone / 克隆

```powershell
git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT
```

After a normal clone / 普通 clone 后：

```powershell
git submodule update --init --recursive
uv sync
```

## Publishing Surface / 发布面

The repository root is the only published Hugging Face checkpoint location:

仓库根目录是唯一的 Hugging Face checkpoint 发布位置：

```text
config.json
model.safetensors
tokenizer_config.json
training_args.bin
vocab.json
vocab.char.json
run_metadata.json
trainer_eval_metrics.json
parse_eval_metrics.json
case_metrics.json
```

There is no tracked `model/` duplicate. Ignored `checkpoints/` directories are
local training artifacts only.

仓库不再跟踪旧的 `model/` 副本。被 ignore 的 `checkpoints/` 仅是本地训练产物。

## Standard Training / 标准训练

For full details, see [`docs/training.md`](docs/training.md).

完整流程见 [`docs/training.md`](docs/training.md)。

Recommended full training command / 推荐全量训练命令：

```powershell
uv run python train.py --tokenizer char `
  --data-file datasets/AnimeName/dmhy_weak_char.jsonl `
  --vocab-file datasets/AnimeName/vocab.char.json `
  --save-dir checkpoints/dmhy-char-full `
  --init-model-dir . `
  --epochs 2 `
  --batch-size 256 `
  --learning-rate 0.00008 `
  --warmup-steps 300 `
  --max-seq-length 128 `
  --train-split 0.98 `
  --num-workers 4 `
  --checkpoint-steps 1000 `
  --save-total-limit 3 `
  --parse-eval-limit 2048 `
  --case-eval-file data/parser_regression_cases.json `
  --seed 52 `
  --experiment-name dmhy-char-full
```

## Publish a New Checkpoint / 发布新 checkpoint

Copy final files to the repository root:

把 `final` 文件复制到仓库根目录：

```powershell
$final = "checkpoints/dmhy-char-full/final"
Copy-Item "$final/config.json" . -Force
Copy-Item "$final/model.safetensors" . -Force
Copy-Item "$final/tokenizer_config.json" . -Force
Copy-Item "$final/training_args.bin" . -Force
Copy-Item "$final/vocab.json" . -Force
Copy-Item "$final/run_metadata.json" . -Force
Copy-Item "$final/trainer_eval_metrics.json" . -Force
Copy-Item "$final/parse_eval_metrics.json" . -Force
Copy-Item "$final/case_metrics.json" . -Force
Copy-Item datasets/AnimeName/vocab.char.json .\vocab.char.json -Force
```

Export ONNX / 导出 ONNX：

```powershell
uv run python export_onnx.py --model-dir . --output exports/anime_filename_parser.onnx --max-length 128
```

Validate / 验证：

```powershell
uv run python evaluate_parser_cases.py --model-dir . --case-file data/parser_regression_cases.json --output case_metrics.json
uv run python onnx_inference.py "[GM-Team][国漫][神印王座][Throne of Seal][2022][200][AVC][GB][1080P].mp4"
```

## Dataset Submodule / 数据集子模块

If `datasets/AnimeName` changed, commit and push it first:

如果 `datasets/AnimeName` 有变动，先提交并推送它：

```powershell
git -C datasets/AnimeName add dmhy_weak.jsonl dmhy_weak.manifest.json dmhy_weak_char.jsonl dmhy_weak_char.manifest.json vocab.json vocab.char.json
git -C datasets/AnimeName commit -m "Update anime filename labels"
git -C datasets/AnimeName lfs push origin main --all
git -C datasets/AnimeName push origin main
```

Then commit the submodule pointer in this repo:

然后在本仓库提交 submodule pointer：

```powershell
git add datasets/AnimeName
git commit -m "Update AnimeName dataset pointer"
```

## LFS Push Order / LFS 推送顺序

Large model artifacts are tracked with Git LFS. If Hugging Face rejects a push
because an LFS pointer points to a missing object, upload LFS objects first:

大模型文件通过 Git LFS 跟踪。如果 Hugging Face 因 LFS pointer 缺对象拒绝 push，
先上传 LFS 对象：

```powershell
git lfs push origin main --all
git push origin main
```

For dataset changes:

数据集变动：

```powershell
git -C datasets/AnimeName lfs push origin main --all
git -C datasets/AnimeName push origin main
```

## Update MiruPlay / 更新 MiruPlay

From MiruPlay root:

在 MiruPlay 根目录：

```powershell
git submodule update --remote --recursive tools/anime_parser
git add tools/anime_parser
git commit -m "Update AniFileBERT submodule"
```

If Android assets changed, also stage:

如果 Android assets 变化，也要提交：

```text
scraper/src/main/assets/anime_parser/anime_filename_parser.onnx
scraper/src/main/assets/anime_parser/config.json
scraper/src/main/assets/anime_parser/vocab.json
```