ModerRAS commited on
Commit
410e000
·
1 Parent(s): 24a2cb6

重建词表:从632K dmhy_weak.jsonl统计频率取top8000,覆盖96.2%

Browse files

- 词表从3000扩展到8000,新增'['、']'、常见字幕组名(Snow/LoliHouse/KTXP等)
- OOV率从25%降到3.8%,修复训练/推理token不一致问题
- 更新config.py默认vocab_size,修复build_vocab_from_data传递max_size
- 添加colab_train.py自动训练脚本
- 更新README训练说明和CUDA 12.6配置

Files changed (8) hide show
  1. README.md +64 -10
  2. colab_train.py +134 -0
  3. config.py +1 -1
  4. data/dmhy/vocab.json +0 -0
  5. data/vocab.json +0 -0
  6. model/vocab.json +0 -0
  7. train.py +5 -4
  8. vocab.json +0 -0
README.md CHANGED
@@ -30,7 +30,7 @@ The checkpoint in this repository is the DMHY weak-label fine-tuned regex-tokeni
30
  - Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
31
  - Tokenizer: custom regex/structure tokenizer implemented in `tokenizer.py`
32
  - Max sequence length: 64
33
- - Parameters: about 4M
34
 
35
  The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
36
 
@@ -38,12 +38,27 @@ The model files are stored at the repository root so `BertForTokenClassification
38
 
39
  Training data snapshots are published separately in [`ModerRAS/AnimeName`](https://huggingface.co/datasets/ModerRAS/AnimeName), and this repository includes it as a nested git submodule at `datasets/AnimeName`.
40
 
41
- Current DMHY export waterline:
42
 
43
- - Last exported `files.id`: `689304`
44
- - Next incremental export: `--min-id 689305`
45
- - Weak-labeled samples: `263042`
46
- - Mixed training samples: `363042`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  ## Evaluation
49
 
@@ -99,7 +114,32 @@ git submodule update --init --recursive
99
 
100
  ## Training
101
 
102
- Regenerate or export datasets:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
  ```bash
105
  python data_generator.py --num-samples 100000
@@ -107,18 +147,32 @@ python dmhy_dataset.py --db D:/WorkSpace/Python/dmhy-parser/dmhy_anime.db --outp
107
  python mix_datasets.py --synthetic data/synthetic.jsonl --dmhy data/dmhy/dmhy_weak.jsonl --output data/dmhy/mixed_train.jsonl
108
  ```
109
 
110
- Fine-tune from the synthetic checkpoint or train from scratch:
111
 
112
  ```bash
113
- python train.py --data-file data/dmhy/mixed_train.jsonl --save-dir checkpoints/dmhy-finetune --init-model-dir checkpoints/final --epochs 1 --batch-size 128 --learning-rate 0.0003 --warmup-steps 300 --seed 42
 
 
 
 
 
 
114
  ```
115
 
116
- Export ONNX for MiruPlay Android assets:
117
 
118
  ```bash
119
  python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --output exports/anime_filename_parser.onnx
120
  ```
121
 
 
 
 
 
 
 
 
 
122
  ## Repository Layout
123
 
124
  - `model.safetensors`, `config.json`, `vocab.json`: default fine-tuned model
 
30
  - Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
31
  - Tokenizer: custom regex/structure tokenizer implemented in `tokenizer.py`
32
  - Max sequence length: 64
33
+ - Parameters: about 5M
34
 
35
  The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
36
 
 
38
 
39
  Training data snapshots are published separately in [`ModerRAS/AnimeName`](https://huggingface.co/datasets/ModerRAS/AnimeName), and this repository includes it as a nested git submodule at `datasets/AnimeName`.
40
 
41
+ Current DMHY export waterline (from `datasets/AnimeName`):
42
 
43
+ - Last exported `files.id`: `1675184`
44
+ - Next incremental export: `--min-id 1675185`
45
+ - Weak-labeled samples: `632002`
46
+ - Mixed training samples: `732002`
47
+
48
+ ## Vocabulary
49
+
50
+ The default `vocab.json` contains **8000 tokens** (up from 3000) built from frequency
51
+ analysis of the full 632K DMHY weak-label dataset. Tokens not in the vocabulary
52
+ become `[UNK]`, so larger vocabulary directly improves coverage:
53
+
54
+ | Vocab size | Coverage | Model params |
55
+ |------------|----------|-------------|
56
+ | 3000 (old) | 90.4% | ~4.0M |
57
+ | 8000 (current) | 96.2% | ~5.3M |
58
+
59
+ Common fansub group names (`Snow`, `LoliHouse`, `DMG`, `KTXP`, `Sakurato`, etc.)
60
+ and individual bracket characters (`[`, `]`, `(`, `)`) are included in the new
61
+ vocabulary.
62
 
63
  ## Evaluation
64
 
 
114
 
115
  ## Training
116
 
117
+ ### Prerequisites (Windows / Local GPU)
118
+
119
+ PyTorch 2.11+ with CUDA 12.6 is required for GPU training:
120
+
121
+ ```bash
122
+ pip install torch --index-url https://download.pytorch.org/whl/cu126
123
+ pip install -r requirements.txt
124
+ ```
125
+
126
+ ### Fine-tune with rebuilt vocabulary
127
+
128
+ ```bash
129
+ python train.py --data-file datasets/AnimeName/dmhy_weak.jsonl \
130
+ --vocab-file datasets/AnimeName/vocab.json \
131
+ --save-dir checkpoints/dmhy-finetune \
132
+ --init-model-dir . \
133
+ --epochs 10 --batch-size 128 \
134
+ --learning-rate 0.0003 --warmup-steps 300 --seed 42
135
+ ```
136
+
137
+ The model loads the old 3000-token checkpoint, `resize_token_embeddings()` adds
138
+ 5000 new randomly-initialized slots for the new vocabulary, and fine-tuning
139
+ trains the full model. About 96% of token occurrences are now covered (vs 90%
140
+ with the old 3000-token vocabulary).
141
+
142
+ ### Regenerate datasets from source
143
 
144
  ```bash
145
  python data_generator.py --num-samples 100000
 
147
  python mix_datasets.py --synthetic data/synthetic.jsonl --dmhy data/dmhy/dmhy_weak.jsonl --output data/dmhy/mixed_train.jsonl
148
  ```
149
 
150
+ ### Rebuild vocabulary (if needed)
151
 
152
  ```bash
153
+ python -c "
154
+ import json, collections
155
+ tokens = collections.Counter()
156
+ [ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ]
157
+ vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])}
158
+ json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
159
+ "
160
  ```
161
 
162
+ ### Export ONNX for MiruPlay Android
163
 
164
  ```bash
165
  python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --output exports/anime_filename_parser.onnx
166
  ```
167
 
168
+ ---
169
+
170
+ ## Google Colab Training
171
+
172
+ Upload and run [`colab_train.py`](colab_train.py) in a Colab GPU runtime.
173
+ It will mount Google Drive, clone both repos, install dependencies, and run
174
+ the full training pipeline. Checkpoints are saved to your Drive automatically.
175
+
176
  ## Repository Layout
177
 
178
  - `model.safetensors`, `config.json`, `vocab.json`: default fine-tuned model
colab_train.py ADDED
@@ -0,0 +1,134 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ """AniFileBERT — Google Colab Training Script
3
+ =============================================
4
+
5
+ How to use:
6
+ 1. Open https://colab.research.google.com/
7
+ 2. File → Upload notebook → select this file, OR
8
+ Copy the entire content into a new code cell
9
+ 3. Runtime → Change runtime type → T4 GPU
10
+ 4. Run all
11
+
12
+ What it does:
13
+ - Mounts Google Drive (for persistent checkpoints)
14
+ - Clones AniFileBERT repo + AnimeName dataset submodule
15
+ - Installs PyTorch + Transformers dependencies
16
+ - Runs training: fine-tune from current checkpoint with 8000-token vocab
17
+ - Saves final model to Drive
18
+
19
+ Output:
20
+ - Checkpoints saved to: MyDrive/AniFileBERT/checkpoints/
21
+ - Final model at: MyDrive/AniFileBERT/checkpoints/dmhy-finetune/final/
22
+ """
23
+
24
+ import os
25
+ import sys
26
+ import subprocess
27
+ import time
28
+
29
+
30
+ def run(cmd, echo=True):
31
+ """Run a shell command and print output in real time."""
32
+ if echo:
33
+ print(f"\n$ {cmd}")
34
+ proc = subprocess.Popen(
35
+ cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
36
+ text=True, bufsize=1
37
+ )
38
+ for line in proc.stdout:
39
+ print(line, end="")
40
+ proc.wait()
41
+ if proc.returncode != 0:
42
+ raise RuntimeError(f"Command failed (exit code {proc.returncode}): {cmd}")
43
+ return proc.returncode
44
+
45
+
46
+ # ── 1. Mount Google Drive ──────────────────────────────────────
47
+ print("=" * 60)
48
+ print("STEP 1: Mount Google Drive")
49
+ print("=" * 60)
50
+ from google.colab import drive
51
+ drive.mount("/content/drive")
52
+
53
+ DRIVE_ROOT = "/content/drive/MyDrive/AniFileBERT"
54
+ os.makedirs(DRIVE_ROOT, exist_ok=True)
55
+ print(f"Checkpoints will be saved to: {DRIVE_ROOT}")
56
+
57
+ # ── 2. Clone repositories ──────────────────────────────────────
58
+ print("\n" + "=" * 60)
59
+ print("STEP 2: Clone AniFileBERT repository")
60
+ print("=" * 60)
61
+
62
+ REPO_DIR = "/content/AniFileBERT"
63
+ if not os.path.isdir(REPO_DIR):
64
+ os.chdir("/content")
65
+ run("git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT")
66
+ else:
67
+ print("Repository already exists, pulling latest...")
68
+ os.chdir(REPO_DIR)
69
+ run("git pull")
70
+ run("git submodule update --init --recursive")
71
+
72
+ os.chdir(REPO_DIR)
73
+
74
+ # ── 3. Install dependencies ────────────────────────────────────
75
+ print("\n" + "=" * 60)
76
+ print("STEP 3: Install dependencies")
77
+ print("=" * 60)
78
+ # Colab comes with PyTorch + CUDA pre-installed. Just install the extras.
79
+ run("pip install transformers accelerate seqeval")
80
+
81
+ # ── 4. Verify GPU ──────────────────────────────────────────────
82
+ print("\n" + "=" * 60)
83
+ print("STEP 4: Verify GPU")
84
+ print("=" * 60)
85
+ run("nvidia-smi 2>/dev/null || echo 'No GPU found — training will be slow on CPU'")
86
+ run('python -c "import torch; print(f\"PyTorch {torch.__version__}, CUDA available: {torch.cuda.is_available()}\")"')
87
+
88
+ # ── 5. Verify vocab ────────────────────────────────────────────
89
+ print("\n" + "=" * 60)
90
+ print("STEP 5: Verify vocabulary")
91
+ print("=" * 60)
92
+ run('python -c "import json; v=json.load(open(\"vocab.json\")); print(f\"Vocab size: {len(v)} tokens\"); print(f\"Key tokens present: [={repr(\"[\" in v)}, ]={repr(\"]\" in v)}\" )"')
93
+
94
+ # ── 6. Run training ────────────────────────────────────────────
95
+ print("\n" + "=" * 60)
96
+ print("STEP 6: Train model")
97
+ print("=" * 60)
98
+
99
+ # The 8000-token vocab is already in datasets/AnimeName/vocab.json.
100
+ # The old checkpoint (3000-token embedding) gets resized automatically.
101
+ SAVE_DIR = os.path.join(DRIVE_ROOT, "checkpoints", "dmhy-finetune")
102
+
103
+ run(
104
+ f"python train.py "
105
+ f"--data-file datasets/AnimeName/dmhy_weak.jsonl "
106
+ f"--vocab-file datasets/AnimeName/vocab.json "
107
+ f"--save-dir {SAVE_DIR} "
108
+ f"--init-model-dir . "
109
+ f"--epochs 10 --batch-size 128 "
110
+ f"--learning-rate 0.0003 --warmup-steps 300 "
111
+ f"--seed 42 "
112
+ f"--no-shuffle"
113
+ )
114
+
115
+ # ── 7. Export ONNX ─────────────────────────────────────────────
116
+ print("\n" + "=" * 60)
117
+ print("STEP 7: Export ONNX (optional)")
118
+ print("=" * 60)
119
+ ONNX_OUT = os.path.join(SAVE_DIR, "..", "anime_filename_parser.onnx")
120
+ run(
121
+ f"python export_onnx.py "
122
+ f"--model-dir {SAVE_DIR}/final "
123
+ f"--output {ONNX_OUT}"
124
+ )
125
+
126
+ # ── 8. Summary ─────────────────────────────────────────────────
127
+ print("\n" + "=" * 60)
128
+ print("DONE!")
129
+ print("=" * 60)
130
+ print(f"\nCheckpoints: {SAVE_DIR}/")
131
+ print(f"Final model: {SAVE_DIR}/final/")
132
+ print(f"ONNX export: {ONNX_OUT}")
133
+ print(f"\nAll files are on Google Drive — they persist across Colab sessions.")
134
+ print(f"You can also download them from the Drive web UI.")
config.py CHANGED
@@ -42,7 +42,7 @@ class Config:
42
  max_seq_length: int = 64
43
 
44
  # Vocabulary (set dynamically from tokenizer)
45
- vocab_size: int = 3000 # placeholder, overridden after tokenizer vocab is built
46
 
47
  # Special tokens
48
  pad_token: str = "[PAD]"
 
42
  max_seq_length: int = 64
43
 
44
  # Vocabulary (set dynamically from tokenizer)
45
+ vocab_size: int = 8000 # placeholder, overridden after tokenizer vocab is built
46
 
47
  # Special tokens
48
  pad_token: str = "[PAD]"
data/dmhy/vocab.json CHANGED
The diff for this file is too large to render. See raw diff
 
data/vocab.json CHANGED
The diff for this file is too large to render. See raw diff
 
model/vocab.json CHANGED
The diff for this file is too large to render. See raw diff
 
train.py CHANGED
@@ -93,13 +93,14 @@ def resolve_vocab_path(data_file: str, tokenizer_variant: str, explicit_path: Op
93
  return os.path.join(os.path.dirname(data_file), name)
94
 
95
 
96
- def build_vocab_from_data(data: List[Dict], tokenizer: AnimeTokenizer, vocab_path: str) -> None:
 
97
  token_lists: List[List[str]] = []
98
  for item in data:
99
  tokens, labels = align_tokens_for_tokenizer(item["tokens"], item["labels"], tokenizer)
100
  token_lists.append(tokens)
101
 
102
- tokenizer.build_vocab(token_lists)
103
  save_dir = os.path.dirname(vocab_path) or "."
104
  os.makedirs(save_dir, exist_ok=True)
105
  with open(vocab_path, "w", encoding="utf-8") as f:
@@ -145,8 +146,8 @@ def main():
145
  vocab_path = resolve_vocab_path(config.data_file, args.tokenizer, args.vocab_file)
146
  tokenizer = create_tokenizer(args.tokenizer)
147
  if args.rebuild_vocab or not os.path.isfile(vocab_path):
148
- print(f" Building {args.tokenizer} vocab: {vocab_path}")
149
- build_vocab_from_data(all_data, tokenizer, vocab_path)
150
  tokenizer = create_tokenizer(args.tokenizer, vocab_file=vocab_path)
151
  print(f" Variant: {args.tokenizer}")
152
  print(f" Vocab size: {tokenizer.vocab_size}")
 
93
  return os.path.join(os.path.dirname(data_file), name)
94
 
95
 
96
+ def build_vocab_from_data(data: List[Dict], tokenizer: AnimeTokenizer, vocab_path: str,
97
+ max_size: Optional[int] = None) -> None:
98
  token_lists: List[List[str]] = []
99
  for item in data:
100
  tokens, labels = align_tokens_for_tokenizer(item["tokens"], item["labels"], tokenizer)
101
  token_lists.append(tokens)
102
 
103
+ tokenizer.build_vocab(token_lists, max_size=max_size)
104
  save_dir = os.path.dirname(vocab_path) or "."
105
  os.makedirs(save_dir, exist_ok=True)
106
  with open(vocab_path, "w", encoding="utf-8") as f:
 
146
  vocab_path = resolve_vocab_path(config.data_file, args.tokenizer, args.vocab_file)
147
  tokenizer = create_tokenizer(args.tokenizer)
148
  if args.rebuild_vocab or not os.path.isfile(vocab_path):
149
+ print(f" Building {args.tokenizer} vocab: {vocab_path} (max_size={config.vocab_size})")
150
+ build_vocab_from_data(all_data, tokenizer, vocab_path, max_size=config.vocab_size)
151
  tokenizer = create_tokenizer(args.tokenizer, vocab_file=vocab_path)
152
  print(f" Variant: {args.tokenizer}")
153
  print(f" Vocab size: {tokenizer.vocab_size}")
vocab.json CHANGED
The diff for this file is too large to render. See raw diff