Token Classification
Transformers
ONNX
Safetensors
English
Japanese
Chinese
bert
anime
filename-parsing
Eval Results (legacy)
Instructions to use ModerRAS/AniFileBERT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ModerRAS/AniFileBERT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="ModerRAS/AniFileBERT")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("ModerRAS/AniFileBERT") model = AutoModelForTokenClassification.from_pretrained("ModerRAS/AniFileBERT") - Notebooks
- Google Colab
- Kaggle
File size: 1,785 Bytes
c705a32 a61b883 c705a32 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | # AniFileBERT encoded dataset cache
Builds split train/eval `.npy` shard caches for `anifilebert.train`.
The tool mirrors the Python char-tokenizer training encoder for JSONL rows with
`filename`, `tokens`, and `labels`, including projection from source tokens to
character labels and the structural media-label repairs used by training.
Example:
```powershell
cargo run --release --manifest-path tools\encoded_dataset_cache\Cargo.toml -- `
--input data\schema_v2_hard_focus_char_seed63.jsonl `
--vocab-file datasets\AnimeName\vocab.char.json `
--label-schema-file label_schema.json `
--output-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 `
--max-length 128 `
--train-split 0.95 `
--seed 63 `
--shard-size 25000 `
--threads 16
```
Multiple JSONL inputs can be encoded into one deterministic train/eval split.
Pass `--input-repeat` once per `--input` when an augmentation source should be
upweighted:
```powershell
cargo run --release --manifest-path tools\encoded_dataset_cache\Cargo.toml -- `
--input data\schema_v2_hard_focus_char_seed63.jsonl `
--input data\schema_v2_synthetic_aug.jsonl `
--input-repeat 1 `
--input-repeat 3 `
--vocab-file datasets\AnimeName\vocab.char.json `
--label-schema-file label_schema.json `
--output-dir data\encoded_cache\schema_v2_hard_focus_seed63_synth_pathleaf_repeat3 `
--max-length 128 `
--train-split 0.995 `
--seed 63 `
--shard-size 25000 `
--threads 16
```
Use the cache in training:
```powershell
.\.venv\Scripts\python.exe -m anifilebert.train `
--tokenizer char `
--data-file data\schema_v2_hard_focus_char_seed63.jsonl `
--vocab-file datasets\AnimeName\vocab.char.json `
--encoded-cache-dir data\encoded_cache\schema_v2_hard_focus_char_seed63 `
--max-seq-length 128
```
|