Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,80 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- financial-analysis
|
| 7 |
+
- transformer
|
| 8 |
+
- classification
|
| 9 |
+
- finbert
|
| 10 |
+
- financial-statements
|
| 11 |
+
base_model: yiyanghkust/finbert-pretrain
|
| 12 |
+
model-index:
|
| 13 |
+
- name: tiger-transformer
|
| 14 |
+
results: []
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# Tiger Transformer (Standardizing Financial Statements)
|
| 18 |
+
|
| 19 |
+
This model is a fine-tuned version of [yiyanghkust/finbert-pretrain](https://huggingface.co/yiyanghkust/finbert-pretrain) designed to standardize financial statement line items from Balance Sheets and Income Statements into a unified schema.
|
| 20 |
+
|
| 21 |
+
## Model Description
|
| 22 |
+
|
| 23 |
+
The **Tiger Transformer** serves as a specialized classification engine for financial analysis AI agents. It addresses the inconsistency found in broad-purpose LLMs when mapping diverse, raw line items (e.g., "Cash & Equivalents", "Cash and due from banks") to standardized accounting categories.
|
| 24 |
+
|
| 25 |
+
### Key Features:
|
| 26 |
+
- **Context-Aware Classification**: Unlike simple keyword matching, this model uses a context window of 2 lines before and 2 lines after the target line to refine predictions.
|
| 27 |
+
- **Architecture**: Fine-tuned `BertForSequenceClassification` using the FinBERT base.
|
| 28 |
+
- **Quantization Support**: A quantized version (`pytorch_model_quantized.pt`) is available for low-latency CPU inference.
|
| 29 |
+
|
| 30 |
+
## Intended Uses & Limitations
|
| 31 |
+
|
| 32 |
+
### Intended Use
|
| 33 |
+
Standardizing raw line items extracted from 10-K, 10-Q, and other financial reports into a consistent format for downstream financial modeling (DCF, ROIC analysis, etc.).
|
| 34 |
+
|
| 35 |
+
### Training Data Strategy
|
| 36 |
+
The model was trained on a painstakingly curated dataset of manually cleaned financial statement labels. To maximize performance on a niche dataset, the model utilizes all available high-quality labels for training, with validation performed iteratively against new unseen batches.
|
| 37 |
+
|
| 38 |
+
### Performance
|
| 39 |
+
- **Accuracy**: 90-95% on modern financial reports.
|
| 40 |
+
- **Robustness**: High accuracy on critical fields (Subtotals and Totals), which are essential for structural validation.
|
| 41 |
+
- **Limitations**: Accuracy may decrease for companies in highly specialized industries or niche regions with non-standard terminology not present in the training set.
|
| 42 |
+
|
| 43 |
+
## Training Procedure
|
| 44 |
+
|
| 45 |
+
### Input Format
|
| 46 |
+
The model expects input strings formatted with surrounding context:
|
| 47 |
+
`[PREV_2] [PREV_1] [SECTION] [RAW_NAME] [NEXT_1] [NEXT_2]`
|
| 48 |
+
|
| 49 |
+
* `[SECTION]`: Balance Sheet or Income Statement.
|
| 50 |
+
* `[RAW_NAME]`: The line item name to be classified.
|
| 51 |
+
* `[PREV/NEXT]`: Surrounding line items providing structural context.
|
| 52 |
+
|
| 53 |
+
### Hyperparameters
|
| 54 |
+
- **Base Model**: FinBERT
|
| 55 |
+
- **Quantization**: Dynamic quantization (int8) applied to Linear layers for optimized CPU performance.
|
| 56 |
+
|
| 57 |
+
## Usage
|
| 58 |
+
|
| 59 |
+
```python
|
| 60 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 61 |
+
import torch
|
| 62 |
+
|
| 63 |
+
# Load model and tokenizer
|
| 64 |
+
tokenizer = AutoTokenizer.from_pretrained("Ruinius/tiger-transformer")
|
| 65 |
+
model = AutoModelForSequenceClassification.from_pretrained("Ruinius/tiger-transformer")
|
| 66 |
+
|
| 67 |
+
# Example input with context
|
| 68 |
+
text = "Cash and Short-term Investments [SEP] Cash and Equivalents [SEP] Balance Sheet [SEP] Accounts Receivable [SEP] Inventory [SEP] Prepaid Expenses"
|
| 69 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 70 |
+
|
| 71 |
+
with torch.no_grad():
|
| 72 |
+
logits = model(**inputs).logits
|
| 73 |
+
predicted_class_id = logits.argmax().item()
|
| 74 |
+
|
| 75 |
+
# Map ID back to label using model.config.id2label
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
## Acknowledgments & Licensing
|
| 79 |
+
This project is a fine-tuned version of the FinBERT-Pretrain model developed by Yang et al. (HKUST).
|
| 80 |
+
Licensed under the **Apache License 2.0**. Same as the base FinBERT model.
|