---
tags:
- financial-filings
- classification
- xgboost
- jina-embeddings-v3
- finance
- nlp
library_name: xgboost
metrics:
- f1: 0.935
- accuracy: 0.95
model-index:
- name: hierarchical-filing-classifier
results:
- task:
type: text-classification
name: Financial Document Classification
metrics:
- type: f1
value: 0.935
name: Weighted F1
- type: accuracy
value: 0.973
name: Top-2 Router Accuracy
---
# Financial Reports Hierarchical Classifier
This is a production-grade Hierarchical Cascade Classifier designed to categorize Global and European financial filings into **29 distinct classes**. It powers the classification engine for **FinancialReports**.
## 🚀 Performance Highlights
| Metric | Score | Interpretation |
| :--- | :--- | :--- |
| **Global Weighted F1** | **93.5%** | State-of-the-art performance for unstructured financial text. |
| **Top-2 Router Accuracy** | **97.3%** | The correct specialist is consulted 97.3% of the time. |
| **Call Transcript Precision** | **100%** | Zero false positives for transcripts. |
| **Delisting Precision** | **100%** | High-precision signal for critical negative corporate events. |
### Detailed Performance by Filing Type
*Scores based on a hold-out test set of ~5,500 documents.*
| Filing Type | Precision | Recall | F1-Score |
| :--- | :--- | :--- | :--- |
| **Interest Rate Update/Notice** | 98.9% | 98.1% | **0.99** |
| **Proxy Solicitation** | 98.6% | 94.4% | **0.96** |
| **Annual Report** | 96.7% | 95.6% | **0.96** |
| **Investor Presentation** | 97.3% | 94.2% | **0.96** |
| **Voting Results** | 94.9% | 96.4% | **0.96** |
| **Audit Report** | 94.7% | 96.4% | **0.96** |
| **Director's Dealing** | 95.9% | 95.0% | **0.95** |
| **Dividend Notice** | 97.8% | 93.0% | **0.95** |
| **Fund Factsheet** | 96.0% | 94.4% | **0.95** |
| **Net Asset Value (NAV)** | 92.6% | 97.6% | **0.95** |
| **Interim / Quarterly Report** | 93.7% | 96.3% | **0.95** |
| **AGM Information** | 95.0% | 93.9% | **0.94** |
| **Remuneration Info** | 97.3% | 91.4% | **0.94** |
| **Report Publication Announcement** | 93.5% | 94.8% | **0.94** |
| **Earnings Release** | 93.2% | 94.0% | **0.94** |
| **ESG / Sustainability Info** | 96.3% | 90.7% | **0.93** |
| **Governance Info** | 97.1% | 89.5% | **0.93** |
| **Capital/Financing Update** | 97.0% | 89.2% | **0.93** |
| **Call Transcript** | **100.0%** | 86.7% | **0.93** |
| **Major Shareholding Notification** | 93.0% | 92.5% | **0.93** |
| **Board/Management Info** | 91.8% | 93.4% | **0.93** |
| **Transaction in Own Shares** | 90.0% | 94.8% | **0.92** |
| **Legal Proceedings** | 92.7% | 90.3% | **0.91** |
| **Regulatory Filings (Generic)** | 89.3% | 93.2% | **0.91** |
| **Management Reports** | 90.8% | 88.2% | **0.89** |
| **M&A Activity** | 95.1% | 81.4% | **0.88** |
| **Share Issue/Capital Change** | 86.0% | 89.3% | **0.88** |
| **Delisting Announcement** | **100.0%** | 75.9% | **0.86** |
---
## 🏗️ Architecture
The system uses a **2-Stage Soft-Routing Architecture** to break the "Semantic Ceiling" often found in flat classifiers:
1. **Level 1 (The Router):** A Jina-V3 embedding model feeds an XGBoost Router that predicts one of 8 Main Categories (e.g., "Financial Reporting", "Equity Info").
2. **Level 2 (The Specialists):** The document is passed to the top-2 most likely Specialist Models, which compete to assign the final fine-grained label.
## ⚠️ Critical Usage Note: The "Wrapper Effect"
Financial documents are often massive (500+ pages) but must be truncated to fit into GPU memory for embedding. However, **Document Length** is a critical feature for distinguishing a full *Annual Report* from a short *Press Release* announcing it.
**To achieve 93% accuracy, you must decouple embedding text from feature engineering:**
1. **Embedding (GPU):** Pass the truncated text (e.g., first 32k characters) to Jina-V3.
2. **Feature Vector (XGBoost):** Calculate `log1p(length)` using the **True Original Length** of the document, not the truncated string length.
*If you do not provide the original length, the model will assume the document is short and may misclassify massive Annual Reports as simple Press Releases.*
## 💻 Usage
```python
from huggingface_hub import snapshot_download
import sys
# 1. Download Models
model_path = snapshot_download(repo_id="FinancialReports/hierarchical-filing-classifier")
# 2. Add path and import wrapper
sys.path.append(model_path)
from inference_wrapper import FinancialFilingClassifier
# 3. Initialize
classifier = FinancialFilingClassifier(model_path)
# 4. Scenario: A 2MB Annual Report
real_doc_length = 2500000 # 2.5 Million chars
truncated_text = "ACME CORP ANNUAL REPORT 2024... [Truncated at 32k chars]"
# 5. Predict (Ensure your wrapper/API handles the length argument)
result = classifier.predict(
text=truncated_text,
# Logic note: Ensure the classifier applies log1p to this value
# instead of len(truncated_text) before passing to XGBoost.
)
print(result)
# Output:
# {
# 'category': 'Financial Reporting',
# 'label': 'Annual Report',
# 'score': 0.985,
# }
```
## 📂 Taxonomy (29 Classes)
The model classifies documents into this hierarchy:
| **Financial Reporting** | **Equity Information** | **Listing & Regulatory** |
| :--- | :--- | :--- |
| • Annual Report
• Earnings Release
• Interim / Quarterly Report
• Audit Report | • Major Shareholding Notification
• Transaction in Own Shares (Buyback)
• Share Issue / Capital Change
• Notice of Dividend Amount | • Regulatory Filings (RNS)
• Delisting Announcement
• Prospectus
• Registration Form |
| **AGM Information** | **Management** | **Investor Comm** |
| :--- | :--- | :--- |
| • AGM Information (Pre/Post)
• Voting Results
• Proxy Solicitation | • Director's Dealing
• Management Reports
• Remuneration Info
• Board Changes | • Investor Presentation
• Call Transcript
• Report Publication Announcement |
| **M&A and Legal** | **Debt Information** | **Investment Vehicle** |
| :--- | :--- | :--- |
| • M&A Activity
• Legal Proceedings Report | • Capital/Financing Update
• Interest Rate Notice | • Net Asset Value (NAV)
• Fund Factsheet |
## 📜 The Standard: Financial Reporting Classification Framework (FRCF)
The taxonomy used by this model is based on the **[Financial Reporting Classification Framework (FRCF)](https://financialreports.eu/financial-reporting-classification-framework/)**, an open-source standard designed to organize corporate disclosures in a consistent, cross-jurisdictional format.
Unlike fragmented regulatory schemes, the FRCF organizes disclosures by **functional purpose**, ensuring comparability across markets (e.g., mapping a US *10-K* and a European *Annual Financial Report* to the same standardized `Annual Report` category).
* **[Explore the Framework](https://financialreports.eu/financial-reporting-classification-framework/)**
* **[Download Methodology (PDF)](https://financialreports.eu/download/frcf-methodology.pdf)**
## 📚 Training Data
The model was trained on a proprietary **Golden Dataset of 27,671 financial filings**, manually curated to represent the diverse landscape of global corporate reporting.
* **Source:** Real-world filings from listed companies across **Europe (primary focus)**, North America, and Asia.
* **Multilingual:** Includes documents in English, French, German, and other major European languages (leveraging the multilingual capabilities of Jina-V3).
* **Diversity:** The dataset preserves the natural "long-tail" distribution of financial data, ranging from massive 500+ page **Annual Reports** to single-page **Press Releases** and complex **ESG Disclosures**.
* **Quality Control:** Mapped to a strict 2-level hierarchy to resolve semantic ambiguities common in regulatory filings (e.g., distinguishing a *Share Buyback* announcement from a *Director's Dealing* notification).
## ⚙️ Deployment & Hardware
This model is optimized for **GPU Inference** due to the heavy 8192-token context window of the Jina encoder. While CPU inference is possible, it is significantly slower.
### Recommended Configuration
| Component | Recommendation | Notes |
| :--- | :--- | :--- |
| **GPU** | **NVIDIA T4 (16GB)** | The "Sweet Spot" for cost/performance. Capable of ~50 docs/sec in batch mode. |
| **Alternative** | NVIDIA L4 / A10 | Recommended for high-concurrency production APIs. |
| **VRAM** | 16 GB Minimum | Required to embed long documents without OOM errors. |
| **System RAM** | 16 GB+ | Standard requirement for PyTorch + XGBoost overhead. |
### Critical Environment Settings
To load the underlying Jina-V3 model, you **must** allow remote code execution in your environment variables (Docker, Kubernetes, or Hugging Face Endpoints):
```bash
HF_TRUST_REMOTE_CODE=True
```
### Throughput Benchmarks (T4 GPU)
* **Live API Latency:** ~200ms – 500ms per document.
* **Batch Processing:** ~40 – 50 documents per second (Batch Size: 64).