Josephgflowers/Finance-Instruct-500k
Viewer โข Updated โข 518k โข 1.95k โข 227
This repository contains a Byte-Pair Encoding (BPE) tokenizer fine-tuned on the Finance-Instruct-500k dataset, starting from the base model yakul259/english-bpe-tokenizer-60k.
It is tailored for financial domain text processing, capturing domain-specific terminology and patterns while maintaining efficient subword segmentation.
Key Features:
<cls> and <sep> special tokens.train<cls> โ Classification token<sep> โ Separator token<unk> โ Unknown token<pad> โ Padding token<mask> โ Masking token (MLM tasks)$A:0 <sep>:0 <cls>:2$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2BpeTrainer from Hugging Face tokenizers library<cls>, <sep>, <unk>, <pad>, <mask>This tokenizer is released under the MIT License.
If you use this tokenizer, please cite:
title = Finance BPE Tokenizer Fine-tuned on Finance-Instruct-500k
author = yakul259
year = 2025
publisher = Hugging Face
Base model
yakul259/english-bpe-tokenizer-60k