Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
重建词表:从632K dmhy_weak.jsonl统计频率取top8000,覆盖96.2%
Browse files- 词表从3000扩展到8000,新增'['、']'、常见字幕组名(Snow/LoliHouse/KTXP等)
- OOV率从25%降到3.8%,修复训练/推理token不一致问题
- 更新config.py默认vocab_size,修复build_vocab_from_data传递max_size
- 添加colab_train.py自动训练脚本
- 更新README训练说明和CUDA 12.6配置
- README.md +64 -10
- colab_train.py +134 -0
- config.py +1 -1
- data/dmhy/vocab.json +0 -0
- data/vocab.json +0 -0
- model/vocab.json +0 -0
- train.py +5 -4
- vocab.json +0 -0
README.md
CHANGED
|
@@ -30,7 +30,7 @@ The checkpoint in this repository is the DMHY weak-label fine-tuned regex-tokeni
|
|
| 30 |
- Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
|
| 31 |
- Tokenizer: custom regex/structure tokenizer implemented in `tokenizer.py`
|
| 32 |
- Max sequence length: 64
|
| 33 |
-
- Parameters: about
|
| 34 |
|
| 35 |
The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
|
| 36 |
|
|
@@ -38,12 +38,27 @@ The model files are stored at the repository root so `BertForTokenClassification
|
|
| 38 |
|
| 39 |
Training data snapshots are published separately in [`ModerRAS/AnimeName`](https://huggingface.co/datasets/ModerRAS/AnimeName), and this repository includes it as a nested git submodule at `datasets/AnimeName`.
|
| 40 |
|
| 41 |
-
Current DMHY export waterline:
|
| 42 |
|
| 43 |
-
- Last exported `files.id`: `
|
| 44 |
-
- Next incremental export: `--min-id
|
| 45 |
-
- Weak-labeled samples: `
|
| 46 |
-
- Mixed training samples: `
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
## Evaluation
|
| 49 |
|
|
@@ -99,7 +114,32 @@ git submodule update --init --recursive
|
|
| 99 |
|
| 100 |
## Training
|
| 101 |
|
| 102 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
|
| 104 |
```bash
|
| 105 |
python data_generator.py --num-samples 100000
|
|
@@ -107,18 +147,32 @@ python dmhy_dataset.py --db D:/WorkSpace/Python/dmhy-parser/dmhy_anime.db --outp
|
|
| 107 |
python mix_datasets.py --synthetic data/synthetic.jsonl --dmhy data/dmhy/dmhy_weak.jsonl --output data/dmhy/mixed_train.jsonl
|
| 108 |
```
|
| 109 |
|
| 110 |
-
|
| 111 |
|
| 112 |
```bash
|
| 113 |
-
python
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
```
|
| 115 |
|
| 116 |
-
Export ONNX for MiruPlay Android
|
| 117 |
|
| 118 |
```bash
|
| 119 |
python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --output exports/anime_filename_parser.onnx
|
| 120 |
```
|
| 121 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
## Repository Layout
|
| 123 |
|
| 124 |
- `model.safetensors`, `config.json`, `vocab.json`: default fine-tuned model
|
|
|
|
| 30 |
- Labels: BIO token labels for `TITLE`, `SEASON`, `EPISODE`, `GROUP`, `RESOLUTION`, `SOURCE`, and `SPECIAL`
|
| 31 |
- Tokenizer: custom regex/structure tokenizer implemented in `tokenizer.py`
|
| 32 |
- Max sequence length: 64
|
| 33 |
+
- Parameters: about 5M
|
| 34 |
|
| 35 |
The model files are stored at the repository root so `BertForTokenClassification.from_pretrained("ModerRAS/AniFileBERT")` can load the weights. Use `inference.py` for end-to-end parsing because the tokenizer is custom rather than a standard WordPiece tokenizer.
|
| 36 |
|
|
|
|
| 38 |
|
| 39 |
Training data snapshots are published separately in [`ModerRAS/AnimeName`](https://huggingface.co/datasets/ModerRAS/AnimeName), and this repository includes it as a nested git submodule at `datasets/AnimeName`.
|
| 40 |
|
| 41 |
+
Current DMHY export waterline (from `datasets/AnimeName`):
|
| 42 |
|
| 43 |
+
- Last exported `files.id`: `1675184`
|
| 44 |
+
- Next incremental export: `--min-id 1675185`
|
| 45 |
+
- Weak-labeled samples: `632002`
|
| 46 |
+
- Mixed training samples: `732002`
|
| 47 |
+
|
| 48 |
+
## Vocabulary
|
| 49 |
+
|
| 50 |
+
The default `vocab.json` contains **8000 tokens** (up from 3000) built from frequency
|
| 51 |
+
analysis of the full 632K DMHY weak-label dataset. Tokens not in the vocabulary
|
| 52 |
+
become `[UNK]`, so larger vocabulary directly improves coverage:
|
| 53 |
+
|
| 54 |
+
| Vocab size | Coverage | Model params |
|
| 55 |
+
|------------|----------|-------------|
|
| 56 |
+
| 3000 (old) | 90.4% | ~4.0M |
|
| 57 |
+
| 8000 (current) | 96.2% | ~5.3M |
|
| 58 |
+
|
| 59 |
+
Common fansub group names (`Snow`, `LoliHouse`, `DMG`, `KTXP`, `Sakurato`, etc.)
|
| 60 |
+
and individual bracket characters (`[`, `]`, `(`, `)`) are included in the new
|
| 61 |
+
vocabulary.
|
| 62 |
|
| 63 |
## Evaluation
|
| 64 |
|
|
|
|
| 114 |
|
| 115 |
## Training
|
| 116 |
|
| 117 |
+
### Prerequisites (Windows / Local GPU)
|
| 118 |
+
|
| 119 |
+
PyTorch 2.11+ with CUDA 12.6 is required for GPU training:
|
| 120 |
+
|
| 121 |
+
```bash
|
| 122 |
+
pip install torch --index-url https://download.pytorch.org/whl/cu126
|
| 123 |
+
pip install -r requirements.txt
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
### Fine-tune with rebuilt vocabulary
|
| 127 |
+
|
| 128 |
+
```bash
|
| 129 |
+
python train.py --data-file datasets/AnimeName/dmhy_weak.jsonl \
|
| 130 |
+
--vocab-file datasets/AnimeName/vocab.json \
|
| 131 |
+
--save-dir checkpoints/dmhy-finetune \
|
| 132 |
+
--init-model-dir . \
|
| 133 |
+
--epochs 10 --batch-size 128 \
|
| 134 |
+
--learning-rate 0.0003 --warmup-steps 300 --seed 42
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
The model loads the old 3000-token checkpoint, `resize_token_embeddings()` adds
|
| 138 |
+
5000 new randomly-initialized slots for the new vocabulary, and fine-tuning
|
| 139 |
+
trains the full model. About 96% of token occurrences are now covered (vs 90%
|
| 140 |
+
with the old 3000-token vocabulary).
|
| 141 |
+
|
| 142 |
+
### Regenerate datasets from source
|
| 143 |
|
| 144 |
```bash
|
| 145 |
python data_generator.py --num-samples 100000
|
|
|
|
| 147 |
python mix_datasets.py --synthetic data/synthetic.jsonl --dmhy data/dmhy/dmhy_weak.jsonl --output data/dmhy/mixed_train.jsonl
|
| 148 |
```
|
| 149 |
|
| 150 |
+
### Rebuild vocabulary (if needed)
|
| 151 |
|
| 152 |
```bash
|
| 153 |
+
python -c "
|
| 154 |
+
import json, collections
|
| 155 |
+
tokens = collections.Counter()
|
| 156 |
+
[ tokens.update(item['tokens']) for item in [json.loads(l) for l in open('datasets/AnimeName/dmhy_weak.jsonl')] if item ]
|
| 157 |
+
vocab = {t:i for i,t in enumerate(['[PAD]','[UNK]','[CLS]','[SEP]'] + [t for t,_ in tokens.most_common(7996)])}
|
| 158 |
+
json.dump(vocab, open('vocab.json','w'), ensure_ascii=False, indent=2)
|
| 159 |
+
"
|
| 160 |
```
|
| 161 |
|
| 162 |
+
### Export ONNX for MiruPlay Android
|
| 163 |
|
| 164 |
```bash
|
| 165 |
python export_onnx.py --model-dir checkpoints/dmhy-finetune/final --output exports/anime_filename_parser.onnx
|
| 166 |
```
|
| 167 |
|
| 168 |
+
---
|
| 169 |
+
|
| 170 |
+
## Google Colab Training
|
| 171 |
+
|
| 172 |
+
Upload and run [`colab_train.py`](colab_train.py) in a Colab GPU runtime.
|
| 173 |
+
It will mount Google Drive, clone both repos, install dependencies, and run
|
| 174 |
+
the full training pipeline. Checkpoints are saved to your Drive automatically.
|
| 175 |
+
|
| 176 |
## Repository Layout
|
| 177 |
|
| 178 |
- `model.safetensors`, `config.json`, `vocab.json`: default fine-tuned model
|
colab_train.py
ADDED
|
@@ -0,0 +1,134 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# -*- coding: utf-8 -*-
|
| 2 |
+
"""AniFileBERT — Google Colab Training Script
|
| 3 |
+
=============================================
|
| 4 |
+
|
| 5 |
+
How to use:
|
| 6 |
+
1. Open https://colab.research.google.com/
|
| 7 |
+
2. File → Upload notebook → select this file, OR
|
| 8 |
+
Copy the entire content into a new code cell
|
| 9 |
+
3. Runtime → Change runtime type → T4 GPU
|
| 10 |
+
4. Run all
|
| 11 |
+
|
| 12 |
+
What it does:
|
| 13 |
+
- Mounts Google Drive (for persistent checkpoints)
|
| 14 |
+
- Clones AniFileBERT repo + AnimeName dataset submodule
|
| 15 |
+
- Installs PyTorch + Transformers dependencies
|
| 16 |
+
- Runs training: fine-tune from current checkpoint with 8000-token vocab
|
| 17 |
+
- Saves final model to Drive
|
| 18 |
+
|
| 19 |
+
Output:
|
| 20 |
+
- Checkpoints saved to: MyDrive/AniFileBERT/checkpoints/
|
| 21 |
+
- Final model at: MyDrive/AniFileBERT/checkpoints/dmhy-finetune/final/
|
| 22 |
+
"""
|
| 23 |
+
|
| 24 |
+
import os
|
| 25 |
+
import sys
|
| 26 |
+
import subprocess
|
| 27 |
+
import time
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
def run(cmd, echo=True):
|
| 31 |
+
"""Run a shell command and print output in real time."""
|
| 32 |
+
if echo:
|
| 33 |
+
print(f"\n$ {cmd}")
|
| 34 |
+
proc = subprocess.Popen(
|
| 35 |
+
cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
|
| 36 |
+
text=True, bufsize=1
|
| 37 |
+
)
|
| 38 |
+
for line in proc.stdout:
|
| 39 |
+
print(line, end="")
|
| 40 |
+
proc.wait()
|
| 41 |
+
if proc.returncode != 0:
|
| 42 |
+
raise RuntimeError(f"Command failed (exit code {proc.returncode}): {cmd}")
|
| 43 |
+
return proc.returncode
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
# ── 1. Mount Google Drive ──────────────────────────────────────
|
| 47 |
+
print("=" * 60)
|
| 48 |
+
print("STEP 1: Mount Google Drive")
|
| 49 |
+
print("=" * 60)
|
| 50 |
+
from google.colab import drive
|
| 51 |
+
drive.mount("/content/drive")
|
| 52 |
+
|
| 53 |
+
DRIVE_ROOT = "/content/drive/MyDrive/AniFileBERT"
|
| 54 |
+
os.makedirs(DRIVE_ROOT, exist_ok=True)
|
| 55 |
+
print(f"Checkpoints will be saved to: {DRIVE_ROOT}")
|
| 56 |
+
|
| 57 |
+
# ── 2. Clone repositories ──────────────────────────────────────
|
| 58 |
+
print("\n" + "=" * 60)
|
| 59 |
+
print("STEP 2: Clone AniFileBERT repository")
|
| 60 |
+
print("=" * 60)
|
| 61 |
+
|
| 62 |
+
REPO_DIR = "/content/AniFileBERT"
|
| 63 |
+
if not os.path.isdir(REPO_DIR):
|
| 64 |
+
os.chdir("/content")
|
| 65 |
+
run("git clone --recursive https://huggingface.co/ModerRAS/AniFileBERT")
|
| 66 |
+
else:
|
| 67 |
+
print("Repository already exists, pulling latest...")
|
| 68 |
+
os.chdir(REPO_DIR)
|
| 69 |
+
run("git pull")
|
| 70 |
+
run("git submodule update --init --recursive")
|
| 71 |
+
|
| 72 |
+
os.chdir(REPO_DIR)
|
| 73 |
+
|
| 74 |
+
# ── 3. Install dependencies ────────────────────────────────────
|
| 75 |
+
print("\n" + "=" * 60)
|
| 76 |
+
print("STEP 3: Install dependencies")
|
| 77 |
+
print("=" * 60)
|
| 78 |
+
# Colab comes with PyTorch + CUDA pre-installed. Just install the extras.
|
| 79 |
+
run("pip install transformers accelerate seqeval")
|
| 80 |
+
|
| 81 |
+
# ── 4. Verify GPU ──────────────────────────────────────────────
|
| 82 |
+
print("\n" + "=" * 60)
|
| 83 |
+
print("STEP 4: Verify GPU")
|
| 84 |
+
print("=" * 60)
|
| 85 |
+
run("nvidia-smi 2>/dev/null || echo 'No GPU found — training will be slow on CPU'")
|
| 86 |
+
run('python -c "import torch; print(f\"PyTorch {torch.__version__}, CUDA available: {torch.cuda.is_available()}\")"')
|
| 87 |
+
|
| 88 |
+
# ── 5. Verify vocab ────────────────────────────────────────────
|
| 89 |
+
print("\n" + "=" * 60)
|
| 90 |
+
print("STEP 5: Verify vocabulary")
|
| 91 |
+
print("=" * 60)
|
| 92 |
+
run('python -c "import json; v=json.load(open(\"vocab.json\")); print(f\"Vocab size: {len(v)} tokens\"); print(f\"Key tokens present: [={repr(\"[\" in v)}, ]={repr(\"]\" in v)}\" )"')
|
| 93 |
+
|
| 94 |
+
# ── 6. Run training ────────────────────────────────────────────
|
| 95 |
+
print("\n" + "=" * 60)
|
| 96 |
+
print("STEP 6: Train model")
|
| 97 |
+
print("=" * 60)
|
| 98 |
+
|
| 99 |
+
# The 8000-token vocab is already in datasets/AnimeName/vocab.json.
|
| 100 |
+
# The old checkpoint (3000-token embedding) gets resized automatically.
|
| 101 |
+
SAVE_DIR = os.path.join(DRIVE_ROOT, "checkpoints", "dmhy-finetune")
|
| 102 |
+
|
| 103 |
+
run(
|
| 104 |
+
f"python train.py "
|
| 105 |
+
f"--data-file datasets/AnimeName/dmhy_weak.jsonl "
|
| 106 |
+
f"--vocab-file datasets/AnimeName/vocab.json "
|
| 107 |
+
f"--save-dir {SAVE_DIR} "
|
| 108 |
+
f"--init-model-dir . "
|
| 109 |
+
f"--epochs 10 --batch-size 128 "
|
| 110 |
+
f"--learning-rate 0.0003 --warmup-steps 300 "
|
| 111 |
+
f"--seed 42 "
|
| 112 |
+
f"--no-shuffle"
|
| 113 |
+
)
|
| 114 |
+
|
| 115 |
+
# ── 7. Export ONNX ─────────────────────────────────────────────
|
| 116 |
+
print("\n" + "=" * 60)
|
| 117 |
+
print("STEP 7: Export ONNX (optional)")
|
| 118 |
+
print("=" * 60)
|
| 119 |
+
ONNX_OUT = os.path.join(SAVE_DIR, "..", "anime_filename_parser.onnx")
|
| 120 |
+
run(
|
| 121 |
+
f"python export_onnx.py "
|
| 122 |
+
f"--model-dir {SAVE_DIR}/final "
|
| 123 |
+
f"--output {ONNX_OUT}"
|
| 124 |
+
)
|
| 125 |
+
|
| 126 |
+
# ── 8. Summary ─────────────────────────────────────────────────
|
| 127 |
+
print("\n" + "=" * 60)
|
| 128 |
+
print("DONE!")
|
| 129 |
+
print("=" * 60)
|
| 130 |
+
print(f"\nCheckpoints: {SAVE_DIR}/")
|
| 131 |
+
print(f"Final model: {SAVE_DIR}/final/")
|
| 132 |
+
print(f"ONNX export: {ONNX_OUT}")
|
| 133 |
+
print(f"\nAll files are on Google Drive — they persist across Colab sessions.")
|
| 134 |
+
print(f"You can also download them from the Drive web UI.")
|
config.py
CHANGED
|
@@ -42,7 +42,7 @@ class Config:
|
|
| 42 |
max_seq_length: int = 64
|
| 43 |
|
| 44 |
# Vocabulary (set dynamically from tokenizer)
|
| 45 |
-
vocab_size: int =
|
| 46 |
|
| 47 |
# Special tokens
|
| 48 |
pad_token: str = "[PAD]"
|
|
|
|
| 42 |
max_seq_length: int = 64
|
| 43 |
|
| 44 |
# Vocabulary (set dynamically from tokenizer)
|
| 45 |
+
vocab_size: int = 8000 # placeholder, overridden after tokenizer vocab is built
|
| 46 |
|
| 47 |
# Special tokens
|
| 48 |
pad_token: str = "[PAD]"
|
data/dmhy/vocab.json
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
data/vocab.json
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
model/vocab.json
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
train.py
CHANGED
|
@@ -93,13 +93,14 @@ def resolve_vocab_path(data_file: str, tokenizer_variant: str, explicit_path: Op
|
|
| 93 |
return os.path.join(os.path.dirname(data_file), name)
|
| 94 |
|
| 95 |
|
| 96 |
-
def build_vocab_from_data(data: List[Dict], tokenizer: AnimeTokenizer, vocab_path: str
|
|
|
|
| 97 |
token_lists: List[List[str]] = []
|
| 98 |
for item in data:
|
| 99 |
tokens, labels = align_tokens_for_tokenizer(item["tokens"], item["labels"], tokenizer)
|
| 100 |
token_lists.append(tokens)
|
| 101 |
|
| 102 |
-
tokenizer.build_vocab(token_lists)
|
| 103 |
save_dir = os.path.dirname(vocab_path) or "."
|
| 104 |
os.makedirs(save_dir, exist_ok=True)
|
| 105 |
with open(vocab_path, "w", encoding="utf-8") as f:
|
|
@@ -145,8 +146,8 @@ def main():
|
|
| 145 |
vocab_path = resolve_vocab_path(config.data_file, args.tokenizer, args.vocab_file)
|
| 146 |
tokenizer = create_tokenizer(args.tokenizer)
|
| 147 |
if args.rebuild_vocab or not os.path.isfile(vocab_path):
|
| 148 |
-
print(f" Building {args.tokenizer} vocab: {vocab_path}")
|
| 149 |
-
build_vocab_from_data(all_data, tokenizer, vocab_path)
|
| 150 |
tokenizer = create_tokenizer(args.tokenizer, vocab_file=vocab_path)
|
| 151 |
print(f" Variant: {args.tokenizer}")
|
| 152 |
print(f" Vocab size: {tokenizer.vocab_size}")
|
|
|
|
| 93 |
return os.path.join(os.path.dirname(data_file), name)
|
| 94 |
|
| 95 |
|
| 96 |
+
def build_vocab_from_data(data: List[Dict], tokenizer: AnimeTokenizer, vocab_path: str,
|
| 97 |
+
max_size: Optional[int] = None) -> None:
|
| 98 |
token_lists: List[List[str]] = []
|
| 99 |
for item in data:
|
| 100 |
tokens, labels = align_tokens_for_tokenizer(item["tokens"], item["labels"], tokenizer)
|
| 101 |
token_lists.append(tokens)
|
| 102 |
|
| 103 |
+
tokenizer.build_vocab(token_lists, max_size=max_size)
|
| 104 |
save_dir = os.path.dirname(vocab_path) or "."
|
| 105 |
os.makedirs(save_dir, exist_ok=True)
|
| 106 |
with open(vocab_path, "w", encoding="utf-8") as f:
|
|
|
|
| 146 |
vocab_path = resolve_vocab_path(config.data_file, args.tokenizer, args.vocab_file)
|
| 147 |
tokenizer = create_tokenizer(args.tokenizer)
|
| 148 |
if args.rebuild_vocab or not os.path.isfile(vocab_path):
|
| 149 |
+
print(f" Building {args.tokenizer} vocab: {vocab_path} (max_size={config.vocab_size})")
|
| 150 |
+
build_vocab_from_data(all_data, tokenizer, vocab_path, max_size=config.vocab_size)
|
| 151 |
tokenizer = create_tokenizer(args.tokenizer, vocab_file=vocab_path)
|
| 152 |
print(f" Variant: {args.tokenizer}")
|
| 153 |
print(f" Vocab size: {tokenizer.vocab_size}")
|
vocab.json
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|