|
|
--- |
|
|
tags: |
|
|
- financial-filings |
|
|
- classification |
|
|
- xgboost |
|
|
- jina-embeddings-v3 |
|
|
- finance |
|
|
- nlp |
|
|
library_name: xgboost |
|
|
metrics: |
|
|
- f1: 0.935 |
|
|
- accuracy: 0.95 |
|
|
model-index: |
|
|
- name: hierarchical-filing-classifier |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Financial Document Classification |
|
|
metrics: |
|
|
- type: f1 |
|
|
value: 0.935 |
|
|
name: Weighted F1 |
|
|
- type: accuracy |
|
|
value: 0.973 |
|
|
name: Top-2 Router Accuracy |
|
|
--- |
|
|
|
|
|
# Financial Reports Hierarchical Classifier |
|
|
|
|
|
This is a production-grade Hierarchical Cascade Classifier designed to categorize Global and European financial filings into **29 distinct classes**. It powers the classification engine for **FinancialReports**. |
|
|
|
|
|
## π Performance Highlights |
|
|
|
|
|
| Metric | Score | Interpretation | |
|
|
| :--- | :--- | :--- | |
|
|
| **Global Weighted F1** | **93.5%** | State-of-the-art performance for unstructured financial text. | |
|
|
| **Top-2 Router Accuracy** | **97.3%** | The correct specialist is consulted 97.3% of the time. | |
|
|
| **Call Transcript Precision** | **100%** | Zero false positives for transcripts. | |
|
|
| **Delisting Precision** | **100%** | High-precision signal for critical negative corporate events. | |
|
|
|
|
|
### Detailed Performance by Filing Type |
|
|
|
|
|
*Scores based on a hold-out test set of ~5,500 documents.* |
|
|
|
|
|
| Filing Type | Precision | Recall | F1-Score | |
|
|
| :--- | :--- | :--- | :--- | |
|
|
| **Interest Rate Update/Notice** | 98.9% | 98.1% | **0.99** | |
|
|
| **Proxy Solicitation** | 98.6% | 94.4% | **0.96** | |
|
|
| **Annual Report** | 96.7% | 95.6% | **0.96** | |
|
|
| **Investor Presentation** | 97.3% | 94.2% | **0.96** | |
|
|
| **Voting Results** | 94.9% | 96.4% | **0.96** | |
|
|
| **Audit Report** | 94.7% | 96.4% | **0.96** | |
|
|
| **Director's Dealing** | 95.9% | 95.0% | **0.95** | |
|
|
| **Dividend Notice** | 97.8% | 93.0% | **0.95** | |
|
|
| **Fund Factsheet** | 96.0% | 94.4% | **0.95** | |
|
|
| **Net Asset Value (NAV)** | 92.6% | 97.6% | **0.95** | |
|
|
| **Interim / Quarterly Report** | 93.7% | 96.3% | **0.95** | |
|
|
| **AGM Information** | 95.0% | 93.9% | **0.94** | |
|
|
| **Remuneration Info** | 97.3% | 91.4% | **0.94** | |
|
|
| **Report Publication Announcement** | 93.5% | 94.8% | **0.94** | |
|
|
| **Earnings Release** | 93.2% | 94.0% | **0.94** | |
|
|
| **ESG / Sustainability Info** | 96.3% | 90.7% | **0.93** | |
|
|
| **Governance Info** | 97.1% | 89.5% | **0.93** | |
|
|
| **Capital/Financing Update** | 97.0% | 89.2% | **0.93** | |
|
|
| **Call Transcript** | **100.0%** | 86.7% | **0.93** | |
|
|
| **Major Shareholding Notification** | 93.0% | 92.5% | **0.93** | |
|
|
| **Board/Management Info** | 91.8% | 93.4% | **0.93** | |
|
|
| **Transaction in Own Shares** | 90.0% | 94.8% | **0.92** | |
|
|
| **Legal Proceedings** | 92.7% | 90.3% | **0.91** | |
|
|
| **Regulatory Filings (Generic)** | 89.3% | 93.2% | **0.91** | |
|
|
| **Management Reports** | 90.8% | 88.2% | **0.89** | |
|
|
| **M&A Activity** | 95.1% | 81.4% | **0.88** | |
|
|
| **Share Issue/Capital Change** | 86.0% | 89.3% | **0.88** | |
|
|
| **Delisting Announcement** | **100.0%** | 75.9% | **0.86** | |
|
|
|
|
|
--- |
|
|
|
|
|
## ποΈ Architecture |
|
|
|
|
|
The system uses a **2-Stage Soft-Routing Architecture** to break the "Semantic Ceiling" often found in flat classifiers: |
|
|
|
|
|
1. **Level 1 (The Router):** A Jina-V3 embedding model feeds an XGBoost Router that predicts one of 8 Main Categories (e.g., "Financial Reporting", "Equity Info"). |
|
|
2. **Level 2 (The Specialists):** The document is passed to the top-2 most likely Specialist Models, which compete to assign the final fine-grained label. |
|
|
|
|
|
## β οΈ Critical Usage Note: The "Wrapper Effect" |
|
|
|
|
|
Financial documents are often massive (500+ pages) but must be truncated to fit into GPU memory for embedding. However, **Document Length** is a critical feature for distinguishing a full *Annual Report* from a short *Press Release* announcing it. |
|
|
|
|
|
**To achieve 93% accuracy, you must decouple embedding text from feature engineering:** |
|
|
|
|
|
1. **Embedding (GPU):** Pass the truncated text (e.g., first 32k characters) to Jina-V3. |
|
|
2. **Feature Vector (XGBoost):** Calculate `log1p(length)` using the **True Original Length** of the document, not the truncated string length. |
|
|
|
|
|
*If you do not provide the original length, the model will assume the document is short and may misclassify massive Annual Reports as simple Press Releases.* |
|
|
|
|
|
## π» Usage |
|
|
|
|
|
```python |
|
|
from huggingface_hub import snapshot_download |
|
|
import sys |
|
|
|
|
|
# 1. Download Models |
|
|
model_path = snapshot_download(repo_id="FinancialReports/hierarchical-filing-classifier") |
|
|
|
|
|
# 2. Add path and import wrapper |
|
|
sys.path.append(model_path) |
|
|
from inference_wrapper import FinancialFilingClassifier |
|
|
|
|
|
# 3. Initialize |
|
|
classifier = FinancialFilingClassifier(model_path) |
|
|
|
|
|
# 4. Scenario: A 2MB Annual Report |
|
|
real_doc_length = 2500000 # 2.5 Million chars |
|
|
truncated_text = "ACME CORP ANNUAL REPORT 2024... [Truncated at 32k chars]" |
|
|
|
|
|
# 5. Predict (Ensure your wrapper/API handles the length argument) |
|
|
result = classifier.predict( |
|
|
text=truncated_text, |
|
|
# Logic note: Ensure the classifier applies log1p to this value |
|
|
# instead of len(truncated_text) before passing to XGBoost. |
|
|
) |
|
|
|
|
|
print(result) |
|
|
# Output: |
|
|
# { |
|
|
# 'category': 'Financial Reporting', |
|
|
# 'label': 'Annual Report', |
|
|
# 'score': 0.985, |
|
|
# } |
|
|
``` |
|
|
|
|
|
## π Taxonomy (29 Classes) |
|
|
|
|
|
The model classifies documents into this hierarchy: |
|
|
|
|
|
| **Financial Reporting** | **Equity Information** | **Listing & Regulatory** | |
|
|
| :--- | :--- | :--- | |
|
|
| β’ Annual Report<br>β’ Earnings Release<br>β’ Interim / Quarterly Report<br>β’ Audit Report | β’ Major Shareholding Notification<br>β’ Transaction in Own Shares (Buyback)<br>β’ Share Issue / Capital Change<br>β’ Notice of Dividend Amount | β’ Regulatory Filings (RNS)<br>β’ Delisting Announcement<br>β’ Prospectus<br>β’ Registration Form | |
|
|
|
|
|
| **AGM Information** | **Management** | **Investor Comm** | |
|
|
| :--- | :--- | :--- | |
|
|
| β’ AGM Information (Pre/Post)<br>β’ Voting Results<br>β’ Proxy Solicitation | β’ Director's Dealing<br>β’ Management Reports<br>β’ Remuneration Info<br>β’ Board Changes | β’ Investor Presentation<br>β’ Call Transcript<br>β’ Report Publication Announcement | |
|
|
|
|
|
| **M&A and Legal** | **Debt Information** | **Investment Vehicle** | |
|
|
| :--- | :--- | :--- | |
|
|
| β’ M&A Activity<br>β’ Legal Proceedings Report | β’ Capital/Financing Update<br>β’ Interest Rate Notice | β’ Net Asset Value (NAV)<br>β’ Fund Factsheet | |
|
|
|
|
|
## π The Standard: Financial Reporting Classification Framework (FRCF) |
|
|
|
|
|
The taxonomy used by this model is based on the **[Financial Reporting Classification Framework (FRCF)](https://financialreports.eu/financial-reporting-classification-framework/)**, an open-source standard designed to organize corporate disclosures in a consistent, cross-jurisdictional format. |
|
|
|
|
|
Unlike fragmented regulatory schemes, the FRCF organizes disclosures by **functional purpose**, ensuring comparability across markets (e.g., mapping a US *10-K* and a European *Annual Financial Report* to the same standardized `Annual Report` category). |
|
|
|
|
|
* **[Explore the Framework](https://financialreports.eu/financial-reporting-classification-framework/)** |
|
|
* **[Download Methodology (PDF)](https://financialreports.eu/download/frcf-methodology.pdf)** |
|
|
|
|
|
## π Training Data |
|
|
|
|
|
The model was trained on a proprietary **Golden Dataset of 27,671 financial filings**, manually curated to represent the diverse landscape of global corporate reporting. |
|
|
|
|
|
* **Source:** Real-world filings from listed companies across **Europe (primary focus)**, North America, and Asia. |
|
|
* **Multilingual:** Includes documents in English, French, German, and other major European languages (leveraging the multilingual capabilities of Jina-V3). |
|
|
* **Diversity:** The dataset preserves the natural "long-tail" distribution of financial data, ranging from massive 500+ page **Annual Reports** to single-page **Press Releases** and complex **ESG Disclosures**. |
|
|
* **Quality Control:** Mapped to a strict 2-level hierarchy to resolve semantic ambiguities common in regulatory filings (e.g., distinguishing a *Share Buyback* announcement from a *Director's Dealing* notification). |
|
|
|
|
|
## βοΈ Deployment & Hardware |
|
|
|
|
|
This model is optimized for **GPU Inference** due to the heavy 8192-token context window of the Jina encoder. While CPU inference is possible, it is significantly slower. |
|
|
|
|
|
### Recommended Configuration |
|
|
|
|
|
| Component | Recommendation | Notes | |
|
|
| :--- | :--- | :--- | |
|
|
| **GPU** | **NVIDIA T4 (16GB)** | The "Sweet Spot" for cost/performance. Capable of ~50 docs/sec in batch mode. | |
|
|
| **Alternative** | NVIDIA L4 / A10 | Recommended for high-concurrency production APIs. | |
|
|
| **VRAM** | 16 GB Minimum | Required to embed long documents without OOM errors. | |
|
|
| **System RAM** | 16 GB+ | Standard requirement for PyTorch + XGBoost overhead. | |
|
|
|
|
|
### Critical Environment Settings |
|
|
|
|
|
To load the underlying Jina-V3 model, you **must** allow remote code execution in your environment variables (Docker, Kubernetes, or Hugging Face Endpoints): |
|
|
|
|
|
```bash |
|
|
HF_TRUST_REMOTE_CODE=True |
|
|
``` |
|
|
|
|
|
### Throughput Benchmarks (T4 GPU) |
|
|
* **Live API Latency:** ~200ms β 500ms per document. |
|
|
* **Batch Processing:** ~40 β 50 documents per second (Batch Size: 64). |
|
|
|