Ruinius commited on
Commit
9d5e293
·
verified ·
1 Parent(s): 095fda4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -2
README.md CHANGED
@@ -1,3 +1,80 @@
1
  ---
2
- {}
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - financial-analysis
7
+ - transformer
8
+ - classification
9
+ - finbert
10
+ - financial-statements
11
+ base_model: yiyanghkust/finbert-pretrain
12
+ model-index:
13
+ - name: tiger-transformer
14
+ results: []
15
+ ---
16
+
17
+ # Tiger Transformer (Standardizing Financial Statements)
18
+
19
+ This model is a fine-tuned version of [yiyanghkust/finbert-pretrain](https://huggingface.co/yiyanghkust/finbert-pretrain) designed to standardize financial statement line items from Balance Sheets and Income Statements into a unified schema.
20
+
21
+ ## Model Description
22
+
23
+ The **Tiger Transformer** serves as a specialized classification engine for financial analysis AI agents. It addresses the inconsistency found in broad-purpose LLMs when mapping diverse, raw line items (e.g., "Cash & Equivalents", "Cash and due from banks") to standardized accounting categories.
24
+
25
+ ### Key Features:
26
+ - **Context-Aware Classification**: Unlike simple keyword matching, this model uses a context window of 2 lines before and 2 lines after the target line to refine predictions.
27
+ - **Architecture**: Fine-tuned `BertForSequenceClassification` using the FinBERT base.
28
+ - **Quantization Support**: A quantized version (`pytorch_model_quantized.pt`) is available for low-latency CPU inference.
29
+
30
+ ## Intended Uses & Limitations
31
+
32
+ ### Intended Use
33
+ Standardizing raw line items extracted from 10-K, 10-Q, and other financial reports into a consistent format for downstream financial modeling (DCF, ROIC analysis, etc.).
34
+
35
+ ### Training Data Strategy
36
+ The model was trained on a painstakingly curated dataset of manually cleaned financial statement labels. To maximize performance on a niche dataset, the model utilizes all available high-quality labels for training, with validation performed iteratively against new unseen batches.
37
+
38
+ ### Performance
39
+ - **Accuracy**: 90-95% on modern financial reports.
40
+ - **Robustness**: High accuracy on critical fields (Subtotals and Totals), which are essential for structural validation.
41
+ - **Limitations**: Accuracy may decrease for companies in highly specialized industries or niche regions with non-standard terminology not present in the training set.
42
+
43
+ ## Training Procedure
44
+
45
+ ### Input Format
46
+ The model expects input strings formatted with surrounding context:
47
+ `[PREV_2] [PREV_1] [SECTION] [RAW_NAME] [NEXT_1] [NEXT_2]`
48
+
49
+ * `[SECTION]`: Balance Sheet or Income Statement.
50
+ * `[RAW_NAME]`: The line item name to be classified.
51
+ * `[PREV/NEXT]`: Surrounding line items providing structural context.
52
+
53
+ ### Hyperparameters
54
+ - **Base Model**: FinBERT
55
+ - **Quantization**: Dynamic quantization (int8) applied to Linear layers for optimized CPU performance.
56
+
57
+ ## Usage
58
+
59
+ ```python
60
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
61
+ import torch
62
+
63
+ # Load model and tokenizer
64
+ tokenizer = AutoTokenizer.from_pretrained("Ruinius/tiger-transformer")
65
+ model = AutoModelForSequenceClassification.from_pretrained("Ruinius/tiger-transformer")
66
+
67
+ # Example input with context
68
+ text = "Cash and Short-term Investments [SEP] Cash and Equivalents [SEP] Balance Sheet [SEP] Accounts Receivable [SEP] Inventory [SEP] Prepaid Expenses"
69
+ inputs = tokenizer(text, return_tensors="pt")
70
+
71
+ with torch.no_grad():
72
+ logits = model(**inputs).logits
73
+ predicted_class_id = logits.argmax().item()
74
+
75
+ # Map ID back to label using model.config.id2label
76
+ ```
77
+
78
+ ## Acknowledgments & Licensing
79
+ This project is a fine-tuned version of the FinBERT-Pretrain model developed by Yang et al. (HKUST).
80
+ Licensed under the **Apache License 2.0**. Same as the base FinBERT model.