Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
Document training lineage
Browse files- README.md +3 -2
- training_lineage.json +55 -0
README.md
CHANGED
|
@@ -54,6 +54,7 @@ This repository is the Hugging Face model repo used by MiruPlay as `tools/anime_
|
|
| 54 |
| Labels / 标签 | BIO labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, `SPECIAL` |
|
| 55 |
| Default checkpoint / 默认权重 | Repository root files (`config.json`, `model.safetensors`, `vocab.json`, `tokenizer_config.json`) |
|
| 56 |
| ONNX export / ONNX 导出 | `exports/anime_filename_parser.onnx` |
|
|
|
|
| 57 |
|
| 58 |
**中文**:根目录就是发布 checkpoint,不再保留旧的 `model/` 重复副本。默认解析路径是“模型 logits + 约束 BIO + 薄字段规范化”,不再默认启用重结构规则;直接 `from_pretrained()` 只能加载 token-classification 权重。
|
| 59 |
|
|
@@ -147,9 +148,9 @@ Current published checkpoint:
|
|
| 147 |
| ONNX parity / ONNX 误差 | max abs diff `4.0531e-05` |
|
| 148 |
| CPU thin-runtime latency / CPU 薄层运行时延迟 | ONNX avg `13.08 ms`, P95 `15.95 ms` |
|
| 149 |
|
| 150 |
-
**中文**:当前发布模型是“全量
|
| 151 |
|
| 152 |
-
**English**: The published checkpoint
|
| 153 |
|
| 154 |
Run regression:
|
| 155 |
|
|
|
|
| 54 |
| Labels / 标签 | BIO labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, `SPECIAL` |
|
| 55 |
| Default checkpoint / 默认权重 | Repository root files (`config.json`, `model.safetensors`, `vocab.json`, `tokenizer_config.json`) |
|
| 56 |
| ONNX export / ONNX 导出 | `exports/anime_filename_parser.onnx` |
|
| 57 |
+
| Training lineage / 训练链路 | `training_lineage.json` |
|
| 58 |
|
| 59 |
**中文**:根目录就是发布 checkpoint,不再保留旧的 `model/` 重复副本。默认解析路径是“模型 logits + 约束 BIO + 薄字段规范化”,不再默认启用重结构规则;直接 `from_pretrained()` 只能加载 token-classification 权重。
|
| 60 |
|
|
|
|
| 148 |
| ONNX parity / ONNX 误差 | max abs diff `4.0531e-05` |
|
| 149 |
| CPU thin-runtime latency / CPU 薄层运行时延迟 | ONNX avg `13.08 ms`, P95 `15.95 ms` |
|
| 150 |
|
| 151 |
+
**中文**:当前发布模型是“两阶段训练”产物:先在 `datasets/AnimeName/dmhy_weak_char.jsonl` 上全量 CUDA 重训,再做 thin hard-case focus 微调。细节见 `training_lineage.json`。README 主指标以 `model-only` 和默认薄层 `normalized-only` 为准;`--rule-assist` 只保留为兼容/诊断对照,不再作为模型质量标准。
|
| 152 |
|
| 153 |
+
**English**: The published checkpoint was trained in two stages: a full CUDA fine-tune on `datasets/AnimeName/dmhy_weak_char.jsonl`, followed by a thin hard-case focus fine-tune. See `training_lineage.json` for details. README quality numbers prioritize `model-only` and the default thin `normalized-only` runtime; `--rule-assist` is retained only for compatibility/diagnostics.
|
| 154 |
|
| 155 |
Run regression:
|
| 156 |
|
training_lineage.json
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"published_checkpoint": "repository_root",
|
| 3 |
+
"summary": "The published checkpoint was produced in two stages: a full-dataset CUDA fine-tune on dmhy_weak_char.jsonl, followed by a thin-runtime hard-case focus fine-tune.",
|
| 4 |
+
"summary_zh": "当前发布 checkpoint 是两阶段产物:先在 dmhy_weak_char.jsonl 上做全量 CUDA 微调,再做薄层运行时困难样本微调。",
|
| 5 |
+
"stages": [
|
| 6 |
+
{
|
| 7 |
+
"name": "dmhy-char-thin-gpu",
|
| 8 |
+
"type": "full_dataset_finetune",
|
| 9 |
+
"machine": "adqew@192.168.63.157",
|
| 10 |
+
"data_file": "datasets/AnimeName/dmhy_weak_char.jsonl",
|
| 11 |
+
"tokenizer_variant": "char",
|
| 12 |
+
"vocab_file": "datasets/AnimeName/vocab.char.json",
|
| 13 |
+
"vocab_size": 6199,
|
| 14 |
+
"max_seq_length": 128,
|
| 15 |
+
"train_samples": 619361,
|
| 16 |
+
"eval_samples": 12641,
|
| 17 |
+
"epochs": 2.0,
|
| 18 |
+
"batch_size": 256,
|
| 19 |
+
"learning_rate": 0.00006,
|
| 20 |
+
"warmup_steps": 300,
|
| 21 |
+
"seed": 55,
|
| 22 |
+
"device": "cuda",
|
| 23 |
+
"fp16": true,
|
| 24 |
+
"eval_f1": 0.9962419041019217,
|
| 25 |
+
"eval_accuracy": 0.9991988685517916,
|
| 26 |
+
"role": "Base checkpoint for the final hard-case focus stage."
|
| 27 |
+
},
|
| 28 |
+
{
|
| 29 |
+
"name": "dmhy-char-thin-hardfocus",
|
| 30 |
+
"type": "hard_case_focus_finetune",
|
| 31 |
+
"machine": "adqew@192.168.63.157",
|
| 32 |
+
"data_file": "data/thin_hard_focus_char.jsonl",
|
| 33 |
+
"tokenizer_variant": "char",
|
| 34 |
+
"vocab_file": "datasets/AnimeName/vocab.char.json",
|
| 35 |
+
"vocab_size": 6199,
|
| 36 |
+
"max_seq_length": 128,
|
| 37 |
+
"train_samples": 117089,
|
| 38 |
+
"eval_samples": 6163,
|
| 39 |
+
"epochs": 2.0,
|
| 40 |
+
"batch_size": 256,
|
| 41 |
+
"learning_rate": 0.00004,
|
| 42 |
+
"warmup_steps": 80,
|
| 43 |
+
"seed": 58,
|
| 44 |
+
"device": "cuda",
|
| 45 |
+
"fp16": true,
|
| 46 |
+
"eval_f1": 0.9972066016906769,
|
| 47 |
+
"eval_accuracy": 0.9994733938512463,
|
| 48 |
+
"fixed_regression_model_only": "25/26",
|
| 49 |
+
"fixed_regression_normalized_only": "26/26",
|
| 50 |
+
"heldout_model_only": "1014/1024",
|
| 51 |
+
"heldout_normalized_only": "1017/1024",
|
| 52 |
+
"role": "Published repository-root checkpoint."
|
| 53 |
+
}
|
| 54 |
+
]
|
| 55 |
+
}
|