yakul259's picture
Update README.md
72ae1be verified
metadata
license: mit
datasets:
  - Josephgflowers/Finance-Instruct-500k
language:
  - en
base_model:
  - openai-community/gpt2
library_name: transformers
tags:
  - finance
  - banking
  - tokeniser

Finance-Instruct-Tokenizer

Model Description

This is a custom tokenizer trained using the GPT-2 tokenizer as a base and fine-tuned on the Josephgflowers/Finance-Instruct-500k dataset.
The tokenizer is optimized for financial and economic instruction-based language tasks, including question answering, summarization, and conversational agents in the finance domain.

Key Features:

  • Vocabulary size: 25,000 tokens
  • Domain-specific token coverage for finance, banking, and investment terminology
  • Compatible with GPT-2 and other models supporting BPE tokenization

Disclaimer

This tokenizer was created purely for experimental and personal use.
It is not a reliable or production-ready model and its use is not encouraged in critical systems or commercial applications.
Performance, safety, and bias have not been fully evaluated.


Intended Uses & Limitations

Intended uses:

  • Tokenization for finance-related chatbots
  • Preprocessing for financial text classification, QA, or summarization models
  • Training or fine-tuning language models on domain-specific data

Limitations:

  • Domain-specific — may underperform on unrelated general-domain tasks
  • Based on a pre-trained GPT-2 tokenizer; not an entirely novel tokenization scheme
  • Dataset is primarily English; performance on other languages may be limited

Training Data


Training Procedure

The tokenizer was trained using:

  • tokenizer.train_new_from_iterator() on text extracted from the dataset
  • Target vocabulary size: 25,000 tokens
  • Special tokens inherited from GPT-2 tokenizer

Evaluation

Evaluation was performed qualitatively by checking token coverage and vocabulary composition on finance-related texts.
Quantitative evaluation can be performed by measuring:

  • Token coverage rate on unseen financial documents
  • Average tokens per sentence compared to baseline GPT-2 tokenizer
  • Perplexity when integrated with a fine-tuned language model

Developed by

  • Yakùl (Tokenizer developer)

Acknowledgements


License

  • Please refer to the licenses of the base tokenizer and dataset before commercial use.
  • This tokenizer inherits the license of the base GPT-2 tokenizer and the dataset license.