tiger-transformer / README.md

Update README.md

5d214e5 verified 7 days ago

3.66 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- financial-analysis
	- transformer
	- classification
	- finbert
	- financial-statements
	base_model: yiyanghkust/finbert-pretrain
	model-index:
	- name: tiger-transformer
	results: []
	---

	# Tiger Transformer (Standardizing Financial Statements)

	This model is a fine-tuned version of [yiyanghkust/finbert-pretrain](https://huggingface.co/yiyanghkust/finbert-pretrain) designed to standardize financial statement line items from Balance Sheets and Income Statements into a unified schema.

	Full Source Code & Training Data: [GitHub - Ruinius/tiger-transformer](https://github.com/Ruinius/tiger-transformer)

	## Model Description

	The Tiger Transformer serves as a specialized classification engine for financial analysis AI agents. It addresses the inconsistency found in broad-purpose LLMs when mapping diverse, raw line items (e.g., "Cash & Equivalents", "Cash and due from banks") to standardized accounting categories.

	### Key Features:
	- Context-Aware Classification: Unlike simple keyword matching, this model uses a context window of 2 lines before and 2 lines after the target line to refine predictions.
	- Architecture: Fine-tuned `BertForSequenceClassification` using the FinBERT base.
	- Quantization Support: A quantized version (`pytorch_model_quantized.pt`) is available for low-latency CPU inference.

	## Intended Uses & Limitations

	### Intended Use
	Standardizing raw line items extracted from 10-K, 10-Q, and other financial reports into a consistent format for downstream financial modeling (DCF, ROIC analysis, etc.).

	### Training Data Strategy
	The model was trained on a painstakingly curated dataset of manually cleaned financial statement labels. To maximize performance on a niche dataset, the model utilizes all available high-quality labels for training, with validation performed iteratively against new unseen batches.

	### Performance
	- Accuracy: 90-95% on modern financial reports.
	- Robustness: High accuracy on critical fields (Subtotals and Totals), which are essential for structural validation.
	- Limitations: Accuracy may decrease for companies in highly specialized industries or niche regions with non-standard terminology not present in the training set.

	## Training Procedure

	### Input Format
	The model expects input strings formatted with surrounding context:
	`[PREV_2] [PREV_1] [SECTION] [RAW_NAME] [NEXT_1] [NEXT_2]`

	* `[SECTION]`: Balance Sheet or Income Statement.
	* `[RAW_NAME]`: The line item name to be classified.
	* `[PREV/NEXT]`: Surrounding line items providing structural context.

	### Hyperparameters
	- Base Model: FinBERT
	- Quantization: Dynamic quantization (int8) applied to Linear layers for optimized CPU performance.

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained("Ruinius/tiger-transformer")
	model = AutoModelForSequenceClassification.from_pretrained("Ruinius/tiger-transformer")

	# Example input with context
	text = "Cash and Short-term Investments [SEP] Cash and Equivalents [SEP] Balance Sheet [SEP] Accounts Receivable [SEP] Inventory [SEP] Prepaid Expenses"
	inputs = tokenizer(text, return_tensors="pt")

	with torch.no_grad():
	logits = model(**inputs).logits
	predicted_class_id = logits.argmax().item()

	# Map ID back to label using model.config.id2label
	```

	## Acknowledgments & Licensing
	This project is a fine-tuned version of the FinBERT-Pretrain model developed by Yang et al. (HKUST).
	Licensed under the Apache License 2.0. Same as the base FinBERT model.