ModerRAS commited on
Commit
f2ec095
·
1 Parent(s): f712f4b

Document training lineage

Browse files
Files changed (2) hide show
  1. README.md +3 -2
  2. training_lineage.json +55 -0
README.md CHANGED
@@ -54,6 +54,7 @@ This repository is the Hugging Face model repo used by MiruPlay as `tools/anime_
54
  | Labels / 标签 | BIO labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, `SPECIAL` |
55
  | Default checkpoint / 默认权重 | Repository root files (`config.json`, `model.safetensors`, `vocab.json`, `tokenizer_config.json`) |
56
  | ONNX export / ONNX 导出 | `exports/anime_filename_parser.onnx` |
 
57
 
58
  **中文**:根目录就是发布 checkpoint,不再保留旧的 `model/` 重复副本。默认解析路径是“模型 logits + 约束 BIO + 薄字段规范化”,不再默认启用重结构规则;直接 `from_pretrained()` 只能加载 token-classification 权重。
59
 
@@ -147,9 +148,9 @@ Current published checkpoint:
147
  | ONNX parity / ONNX 误差 | max abs diff `4.0531e-05` |
148
  | CPU thin-runtime latency / CPU 薄层运行时延迟 | ONNX avg `13.08 ms`, P95 `15.95 ms` |
149
 
150
- **中文**:当前发布模型是“全量重标注 char 模型 + thin hard-case focus 微调。README 主指标以 `model-only` 和默认薄层 `normalized-only` 为准;`--rule-assist` 只保留为兼容/诊断对照,不再作为模型质量标准。
151
 
152
- **English**: The published checkpoint is the full-relabel character model plus a thin hard-case focus fine-tune. README quality numbers prioritize `model-only` and the default thin `normalized-only` runtime; `--rule-assist` is retained only for compatibility/diagnostics.
153
 
154
  Run regression:
155
 
 
54
  | Labels / 标签 | BIO labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, `SPECIAL` |
55
  | Default checkpoint / 默认权重 | Repository root files (`config.json`, `model.safetensors`, `vocab.json`, `tokenizer_config.json`) |
56
  | ONNX export / ONNX 导出 | `exports/anime_filename_parser.onnx` |
57
+ | Training lineage / 训练链路 | `training_lineage.json` |
58
 
59
  **中文**:根目录就是发布 checkpoint,不再保留旧的 `model/` 重复副本。默认解析路径是“模型 logits + 约束 BIO + 薄字段规范化”,不再默认启用重结构规则;直接 `from_pretrained()` 只能加载 token-classification 权重。
60
 
 
148
  | ONNX parity / ONNX 误差 | max abs diff `4.0531e-05` |
149
  | CPU thin-runtime latency / CPU 薄层运行时延迟 | ONNX avg `13.08 ms`, P95 `15.95 ms` |
150
 
151
+ **中文**:当前发布模型是“两阶段训练”产物:先在 `datasets/AnimeName/dmhy_weak_char.jsonl` 上全量 CUDA 重训,再做 thin hard-case focus 微调。细节见 `training_lineage.json`。README 主指标以 `model-only` 和默认薄层 `normalized-only` 为准;`--rule-assist` 只保留为兼容/诊断对照,不再作为模型质量标准。
152
 
153
+ **English**: The published checkpoint was trained in two stages: a full CUDA fine-tune on `datasets/AnimeName/dmhy_weak_char.jsonl`, followed by a thin hard-case focus fine-tune. See `training_lineage.json` for details. README quality numbers prioritize `model-only` and the default thin `normalized-only` runtime; `--rule-assist` is retained only for compatibility/diagnostics.
154
 
155
  Run regression:
156
 
training_lineage.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "published_checkpoint": "repository_root",
3
+ "summary": "The published checkpoint was produced in two stages: a full-dataset CUDA fine-tune on dmhy_weak_char.jsonl, followed by a thin-runtime hard-case focus fine-tune.",
4
+ "summary_zh": "当前发布 checkpoint 是两阶段产物:先在 dmhy_weak_char.jsonl 上做全量 CUDA 微调,再做薄层运行时困难样本微调。",
5
+ "stages": [
6
+ {
7
+ "name": "dmhy-char-thin-gpu",
8
+ "type": "full_dataset_finetune",
9
+ "machine": "adqew@192.168.63.157",
10
+ "data_file": "datasets/AnimeName/dmhy_weak_char.jsonl",
11
+ "tokenizer_variant": "char",
12
+ "vocab_file": "datasets/AnimeName/vocab.char.json",
13
+ "vocab_size": 6199,
14
+ "max_seq_length": 128,
15
+ "train_samples": 619361,
16
+ "eval_samples": 12641,
17
+ "epochs": 2.0,
18
+ "batch_size": 256,
19
+ "learning_rate": 0.00006,
20
+ "warmup_steps": 300,
21
+ "seed": 55,
22
+ "device": "cuda",
23
+ "fp16": true,
24
+ "eval_f1": 0.9962419041019217,
25
+ "eval_accuracy": 0.9991988685517916,
26
+ "role": "Base checkpoint for the final hard-case focus stage."
27
+ },
28
+ {
29
+ "name": "dmhy-char-thin-hardfocus",
30
+ "type": "hard_case_focus_finetune",
31
+ "machine": "adqew@192.168.63.157",
32
+ "data_file": "data/thin_hard_focus_char.jsonl",
33
+ "tokenizer_variant": "char",
34
+ "vocab_file": "datasets/AnimeName/vocab.char.json",
35
+ "vocab_size": 6199,
36
+ "max_seq_length": 128,
37
+ "train_samples": 117089,
38
+ "eval_samples": 6163,
39
+ "epochs": 2.0,
40
+ "batch_size": 256,
41
+ "learning_rate": 0.00004,
42
+ "warmup_steps": 80,
43
+ "seed": 58,
44
+ "device": "cuda",
45
+ "fp16": true,
46
+ "eval_f1": 0.9972066016906769,
47
+ "eval_accuracy": 0.9994733938512463,
48
+ "fixed_regression_model_only": "25/26",
49
+ "fixed_regression_normalized_only": "26/26",
50
+ "heldout_model_only": "1014/1024",
51
+ "heldout_normalized_only": "1017/1024",
52
+ "role": "Published repository-root checkpoint."
53
+ }
54
+ ]
55
+ }