Update README.md

72ae1be verified 9 months ago

3.12 kB

license: mit
datasets:
  - Josephgflowers/Finance-Instruct-500k
language:
  - en
base_model:
  - openai-community/gpt2
library_name: transformers
tags:
  - finance
  - banking
  - tokeniser

Finance-Instruct-Tokenizer

Model Description

This is a custom tokenizer trained using the GPT-2 tokenizer as a base and fine-tuned on the Josephgflowers/Finance-Instruct-500k dataset.
The tokenizer is optimized for financial and economic instruction-based language tasks, including question answering, summarization, and conversational agents in the finance domain.

Key Features:

Vocabulary size: 25,000 tokens
Domain-specific token coverage for finance, banking, and investment terminology
Compatible with GPT-2 and other models supporting BPE tokenization

Disclaimer

This tokenizer was created purely for experimental and personal use.
It is not a reliable or production-ready model and its use is not encouraged in critical systems or commercial applications.
Performance, safety, and bias have not been fully evaluated.

Intended Uses & Limitations

Intended uses:

Tokenization for finance-related chatbots
Preprocessing for financial text classification, QA, or summarization models
Training or fine-tuning language models on domain-specific data

Limitations:

Domain-specific — may underperform on unrelated general-domain tasks
Based on a pre-trained GPT-2 tokenizer; not an entirely novel tokenization scheme
Dataset is primarily English; performance on other languages may be limited

Training Data

Dataset: Josephgflowers/Finance-Instruct-500k
Data type: Finance-related instruction-response pairs
Data source: Open-source dataset on Hugging Face Datasets

Training Procedure

The tokenizer was trained using:

tokenizer.train_new_from_iterator() on text extracted from the dataset
Target vocabulary size: 25,000 tokens
Special tokens inherited from GPT-2 tokenizer

Evaluation

Evaluation was performed qualitatively by checking token coverage and vocabulary composition on finance-related texts.
Quantitative evaluation can be performed by measuring:

Token coverage rate on unseen financial documents
Average tokens per sentence compared to baseline GPT-2 tokenizer
Perplexity when integrated with a fine-tuned language model

Developed by

Yakùl (Tokenizer developer)

Acknowledgements

Base tokenizer: GPT-2 by OpenAI and Hugging Face
Dataset: Josephgflowers/Finance-Instruct-500k
Thanks to the Hugging Face community for open-source tools and resources

License

Please refer to the licenses of the base tokenizer and dataset before commercial use.
This tokenizer inherits the license of the base GPT-2 tokenizer and the dataset license.