重建词表：从632K dmhy_weak.jsonl统计频率取top8000，覆盖96.2%

- 词表从3000扩展到8000，新增'['、']'、常见字幕组名(Snow/LoliHouse/KTXP等)
- OOV率从25%降到3.8%，修复训练/推理token不一致问题
- 更新config.py默认vocab_size，修复build_vocab_from_data传递max_size
- 添加colab_train.py自动训练脚本
- 更新README训练说明和CUDA 12.6配置

Files changed (8) hide show

README.md +64 -10
colab_train.py +134 -0
config.py +1 -1
data/dmhy/vocab.json +0 -0
data/vocab.json +0 -0
model/vocab.json +0 -0
train.py +5 -4
vocab.json +0 -0

README.md CHANGED Viewed

@@ -30,7 +30,7 @@ The checkpoint in this repository is the DMHY weak-label fine-tuned regex-tokeni
 - Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
 - Tokenizer: custom regex/structure tokenizer implemented in `tokenizer.py`
 - Max sequence length: 64
-- Parameters: about 4M
 The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
@@ -38,12 +38,27 @@ The model files are stored at the repository root so `BertForTokenClassification
 Training data snapshots are published separately in [`ModerRAS/AnimeName`](https://huggingface.co/datasets/ModerRAS/AnimeName), and this repository includes it as a nested git submodule at `datasets/AnimeName`.
-Current DMHY export waterline:
-- Last exported `files.id`: `689304`
-- Next incremental export: `--min-id 689305`
-- Weak-labeled samples: `263042`
-- Mixed training samples: `363042`
 ## Evaluation
@@ -99,7 +114,32 @@ git submodule update --init --recursive
 ## Training
-Regenerate or export datasets:
 ```bash
 python data_generator.py --num-samples 100000
@@ -107,18 +147,32 @@ python dmhy_dataset.py --db D:/WorkSpace/Python/dmhy-parser/dmhy_anime.db --outp
 python mix_datasets.py --synthetic data/synthetic.jsonl --dmhy data/dmhy/dmhy_weak.jsonl --output data/dmhy/mixed_train.jsonl
 ```
-Fine-tune from the synthetic checkpoint or train from scratch:
 ```bash
-python train.py --data-file data/dmhy/mixed_train.jsonl --save-dir checkpoints/dmhy-finetune --init-model-dir checkpoints/final --epochs 1 --batch-size 128 --learning-rate 0.0003 --warmup-steps 300 --seed 42
 ```
-Export ONNX for MiruPlay Android assets:
 ```bash
 python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --output exports/anime_filename_parser.onnx
 ```
 ## Repository Layout
 - `model.safetensors`, `config.json`, `vocab.json`: default fine-tuned model

 - Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
 - Tokenizer: custom regex/structure tokenizer implemented in `tokenizer.py`
 - Max sequence length: 64
+- Parameters: about 5M
 The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
 Training data snapshots are published separately in [`ModerRAS/AnimeName`](https://huggingface.co/datasets/ModerRAS/AnimeName), and this repository includes it as a nested git submodule at `datasets/AnimeName`.
+Current DMHY export waterline (from `datasets/AnimeName`):
+- Last exported `files.id`: `1675184`
+- Next incremental export: `--min-id 1675185`
+- Weak-labeled samples: `632002`
+- Mixed training samples: `732002`
+## Vocabulary
+The default `vocab.json` contains **8000 tokens** (up from 3000) built from frequency
+analysis of the full 632K DMHY weak-label dataset. Tokens not in the vocabulary
+become `[UNK]`, so larger vocabulary directly improves coverage:
+| Vocab size | Coverage | Model params |
+|------------|----------|-------------|
+| 3000 (old) | 90.4% | ~4.0M |
+| 8000 (current) | 96.2% | ~5.3M |
+Common fansub group names (`Snow`, `LoliHouse`, `DMG`, `KTXP`, `Sakurato`, etc.)
+and individual bracket characters (`[`, `]`, `(`, `)`) are included in the new
+vocabulary.
 ## Evaluation
 ## Training
+### Prerequisites (Windows / Local GPU)
+PyTorch 2.11+ with CUDA 12.6 is required for GPU training:
+```bash
+pip install torch --index-url https://download.pytorch.org/whl/cu126
+pip install -r requirements.txt
+```
+### Fine-tune with rebuilt vocabulary
+```bash
+python train.py --data-file datasets/AnimeName/dmhy_weak.jsonl \
+  --vocab-file datasets/AnimeName/vocab.json \
+  --save-dir checkpoints/dmhy-finetune \
+  --init-model-dir . \
+  --epochs 10 --batch-size 128 \
+  --learning-rate 0.0003 --warmup-steps 300 --seed 42
+```
+The model loads the old 3000-token checkpoint, `resize_token_embeddings()` adds
+5000 new randomly-initialized slots for the new vocabulary, and fine-tuning
+trains the full model. About 96% of token occurrences are now covered (vs 90%
+with the old 3000-token vocabulary).
+### Regenerate datasets from source
 ```bash
 python data_generator.py --num-samples 100000
 python mix_datasets.py --synthetic data/synthetic.jsonl --dmhy data/dmhy/dmhy_weak.jsonl --output data/dmhy/mixed_train.jsonl
 ```
+### Rebuild vocabulary (if needed)
 ```bash
+python -c "
+import json, collections
+tokens = collections.Counter()
+[ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ]
+vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])}
+json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
+"
 ```
+### Export ONNX for MiruPlay Android
 ```bash
 python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --output exports/anime_filename_parser.onnx
 ```
+---
+## Google Colab Training
+Upload and run [`colab_train.py`](colab_train.py) in a Colab GPU runtime.
+It will mount Google Drive, clone both repos, install dependencies, and run
+the full training pipeline. Checkpoints are saved to your Drive automatically.
 ## Repository Layout
 - `model.safetensors`, `config.json`, `vocab.json`: default fine-tuned model

colab_train.py ADDED Viewed

	@@ -0,0 +1,134 @@

+# -*- coding: utf-8 -*-
+"""AniFileBERT — Google Colab Training Script
+=============================================
+How to use:
+  1. Open https://colab.research.google.com/
+  2. File → Upload notebook → select this file, OR
+     Copy the entire content into a new code cell
+  3. Runtime → Change runtime type → T4 GPU
+  4. Run all
+What it does:
+  - Mounts Google Drive (for persistent checkpoints)
+  - Clones AniFileBERT repo + AnimeName dataset submodule
+  - Installs PyTorch + Transformers dependencies
+  - Runs training: fine-tune from current checkpoint with 8000-token vocab
+  - Saves final model to Drive
+Output:
+  - Checkpoints saved to: MyDrive/AniFileBERT/checkpoints/
+  - Final model at:       MyDrive/AniFileBERT/checkpoints/dmhy-finetune/final/
+"""
+import os
+import sys
+import subprocess
+import time
+def run(cmd, echo=True):
+    """Run a shell command and print output in real time."""
+    if echo:
+        print(f"\n$ {cmd}")
+    proc = subprocess.Popen(
+        cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
+        text=True, bufsize=1
+    )
+    for line in proc.stdout:
+        print(line, end="")
+    proc.wait()
+    if proc.returncode != 0:
+        raise RuntimeError(f"Command failed (exit code {proc.returncode}): {cmd}")
+    return proc.returncode
+# ── 1. Mount Google Drive ──────────────────────────────────────
+print("=" * 60)
+print("STEP 1: Mount Google Drive")
+print("=" * 60)
+from google.colab import drive
+drive.mount("/content/drive")
+DRIVE_ROOT = "/content/drive/MyDrive/AniFileBERT"
+os.makedirs(DRIVE_ROOT, exist_ok=True)
+print(f"Checkpoints will be saved to: {DRIVE_ROOT}")
+# ── 2. Clone repositories ──────────────────────────────────────
+print("\n" + "=" * 60)
+print("STEP 2: Clone AniFileBERT repository")
+print("=" * 60)
+REPO_DIR = "/content/AniFileBERT"
+if not os.path.isdir(REPO_DIR):
+    os.chdir("/content")
+    run("git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT")
+else:
+    print("Repository already exists, pulling latest...")
+    os.chdir(REPO_DIR)
+    run("git pull")
+    run("git submodule update --init --recursive")
+os.chdir(REPO_DIR)
+# ── 3. Install dependencies ────────────────────────────────────
+print("\n" + "=" * 60)
+print("STEP 3: Install dependencies")
+print("=" * 60)
+# Colab comes with PyTorch + CUDA pre-installed. Just install the extras.
+run("pip install transformers accelerate seqeval")
+# ── 4. Verify GPU ──────────────────────────────────────────────
+print("\n" + "=" * 60)
+print("STEP 4: Verify GPU")
+print("=" * 60)
+run("nvidia-smi 2>/dev/null || echo 'No GPU found — training will be slow on CPU'")
+run('python -c "import torch; print(f\"PyTorch {torch.__version__}, CUDA available: {torch.cuda.is_available()}\")"')
+# ── 5. Verify vocab ────────────────────────────────────────────
+print("\n" + "=" * 60)
+print("STEP 5: Verify vocabulary")
+print("=" * 60)
+run('python -c "import json; v=json.load(open(\"vocab.json\")); print(f\"Vocab size: {len(v)} tokens\"); print(f\"Key tokens present: [={repr(\"[\" in v)}, ]={repr(\"]\" in v)}\" )"')
+# ── 6. Run training ────────────────────────────────────────────
+print("\n" + "=" * 60)
+print("STEP 6: Train model")
+print("=" * 60)
+# The 8000-token vocab is already in datasets/AnimeName/vocab.json.
+# The old checkpoint (3000-token embedding) gets resized automatically.
+SAVE_DIR = os.path.join(DRIVE_ROOT, "checkpoints", "dmhy-finetune")
+run(
+    f"python train.py "
+    f"--data-file datasets/AnimeName/dmhy_weak.jsonl "
+    f"--vocab-file datasets/AnimeName/vocab.json "
+    f"--save-dir {SAVE_DIR} "
+    f"--init-model-dir . "
+    f"--epochs 10 --batch-size 128 "
+    f"--learning-rate 0.0003 --warmup-steps 300 "
+    f"--seed 42 "
+    f"--no-shuffle"
+)
+# ── 7. Export ONNX ─────────────────────────────────────────────
+print("\n" + "=" * 60)
+print("STEP 7: Export ONNX (optional)")
+print("=" * 60)
+ONNX_OUT = os.path.join(SAVE_DIR, "..", "anime_filename_parser.onnx")
+run(
+    f"python export_onnx.py "
+    f"--model-dir {SAVE_DIR}/final "
+    f"--output {ONNX_OUT}"
+)
+# ── 8. Summary ─────────────────────────────────────────────────
+print("\n" + "=" * 60)
+print("DONE!")
+print("=" * 60)
+print(f"\nCheckpoints:  {SAVE_DIR}/")
+print(f"Final model:  {SAVE_DIR}/final/")
+print(f"ONNX export:  {ONNX_OUT}")
+print(f"\nAll files are on Google Drive — they persist across Colab sessions.")
+print(f"You can also download them from the Drive web UI.")

config.py CHANGED Viewed

@@ -42,7 +42,7 @@ class Config:
     max_seq_length: int = 64
     # Vocabulary (set dynamically from tokenizer)
-    vocab_size: int = 3000  # placeholder, overridden after tokenizer vocab is built
     # Special tokens
     pad_token: str = "[PAD]"

     max_seq_length: int = 64
     # Vocabulary (set dynamically from tokenizer)
+    vocab_size: int = 8000  # placeholder, overridden after tokenizer vocab is built
     # Special tokens
     pad_token: str = "[PAD]"

data/dmhy/vocab.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

data/vocab.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

model/vocab.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

train.py CHANGED Viewed

@@ -93,13 +93,14 @@ def resolve_vocab_path(data_file: str, tokenizer_variant: str, explicit_path: Op
     return os.path.join(os.path.dirname(data_file), name)
-def build_vocab_from_data(data: List[Dict], tokenizer: AnimeTokenizer, vocab_path: str) -> None:
     token_lists: List[List[str]] = []
     for item in data:
         tokens, labels = align_tokens_for_tokenizer(item["tokens"], item["labels"], tokenizer)
         token_lists.append(tokens)
-    tokenizer.build_vocab(token_lists)
     save_dir = os.path.dirname(vocab_path) or "."
     os.makedirs(save_dir, exist_ok=True)
     with open(vocab_path, "w", encoding="utf-8") as f:
@@ -145,8 +146,8 @@ def main():
     vocab_path = resolve_vocab_path(config.data_file, args.tokenizer, args.vocab_file)
     tokenizer = create_tokenizer(args.tokenizer)
     if args.rebuild_vocab or not os.path.isfile(vocab_path):
-        print(f"  Building {args.tokenizer} vocab: {vocab_path}")
-        build_vocab_from_data(all_data, tokenizer, vocab_path)
     tokenizer = create_tokenizer(args.tokenizer, vocab_file=vocab_path)
     print(f"  Variant: {args.tokenizer}")
     print(f"  Vocab size: {tokenizer.vocab_size}")

     return os.path.join(os.path.dirname(data_file), name)
+def build_vocab_from_data(data: List[Dict], tokenizer: AnimeTokenizer, vocab_path: str,
+                         max_size: Optional[int] = None) -> None:
     token_lists: List[List[str]] = []
     for item in data:
         tokens, labels = align_tokens_for_tokenizer(item["tokens"], item["labels"], tokenizer)
         token_lists.append(tokens)
+    tokenizer.build_vocab(token_lists, max_size=max_size)
     save_dir = os.path.dirname(vocab_path) or "."
     os.makedirs(save_dir, exist_ok=True)
     with open(vocab_path, "w", encoding="utf-8") as f:
     vocab_path = resolve_vocab_path(config.data_file, args.tokenizer, args.vocab_file)
     tokenizer = create_tokenizer(args.tokenizer)
     if args.rebuild_vocab or not os.path.isfile(vocab_path):
+        print(f"  Building {args.tokenizer} vocab: {vocab_path} (max_size={config.vocab_size})")
+        build_vocab_from_data(all_data, tokenizer, vocab_path, max_size=config.vocab_size)
     tokenizer = create_tokenizer(args.tokenizer, vocab_file=vocab_path)
     print(f"  Variant: {args.tokenizer}")
     print(f"  Vocab size: {tokenizer.vocab_size}")

vocab.json CHANGED Viewed

The diff for this file is too large to render. See raw diff