---
tags:
- financial-filings
- classification
- xgboost
- jina-embeddings-v3
- finance
- nlp
library_name: xgboost
metrics:
- f1: 0.935
- accuracy: 0.95
model-index:
- name: hierarchical-filing-classifier
  results:
  - task:
      type: text-classification
      name: Financial Document Classification
    metrics:
    - type: f1
      value: 0.935
      name: Weighted F1
    - type: accuracy
      value: 0.973
      name: Top-2 Router Accuracy
---

# Financial Reports Hierarchical Classifier

This is a production-grade Hierarchical Cascade Classifier designed to categorize Global and European financial filings into **29 distinct classes**. It powers the classification engine for **FinancialReports**.

## 🚀 Performance Highlights

| Metric | Score | Interpretation |
| :--- | :--- | :--- |
| **Global Weighted F1** | **93.5%** | State-of-the-art performance for unstructured financial text. |
| **Top-2 Router Accuracy** | **97.3%** | The correct specialist is consulted 97.3% of the time. |
| **Call Transcript Precision** | **100%** | Zero false positives for transcripts. |
| **Delisting Precision** | **100%** | High-precision signal for critical negative corporate events. |

### Detailed Performance by Filing Type

*Scores based on a hold-out test set of ~5,500 documents.*

| Filing Type | Precision | Recall | F1-Score |
| :--- | :--- | :--- | :--- |
| **Interest Rate Update/Notice** | 98.9% | 98.1% | **0.99** |
| **Proxy Solicitation** | 98.6% | 94.4% | **0.96** |
| **Annual Report** | 96.7% | 95.6% | **0.96** |
| **Investor Presentation** | 97.3% | 94.2% | **0.96** |
| **Voting Results** | 94.9% | 96.4% | **0.96** |
| **Audit Report** | 94.7% | 96.4% | **0.96** |
| **Director's Dealing** | 95.9% | 95.0% | **0.95** |
| **Dividend Notice** | 97.8% | 93.0% | **0.95** |
| **Fund Factsheet** | 96.0% | 94.4% | **0.95** |
| **Net Asset Value (NAV)** | 92.6% | 97.6% | **0.95** |
| **Interim / Quarterly Report** | 93.7% | 96.3% | **0.95** |
| **AGM Information** | 95.0% | 93.9% | **0.94** |
| **Remuneration Info** | 97.3% | 91.4% | **0.94** |
| **Report Publication Announcement** | 93.5% | 94.8% | **0.94** |
| **Earnings Release** | 93.2% | 94.0% | **0.94** |
| **ESG / Sustainability Info** | 96.3% | 90.7% | **0.93** |
| **Governance Info** | 97.1% | 89.5% | **0.93** |
| **Capital/Financing Update** | 97.0% | 89.2% | **0.93** |
| **Call Transcript** | **100.0%** | 86.7% | **0.93** |
| **Major Shareholding Notification** | 93.0% | 92.5% | **0.93** |
| **Board/Management Info** | 91.8% | 93.4% | **0.93** |
| **Transaction in Own Shares** | 90.0% | 94.8% | **0.92** |
| **Legal Proceedings** | 92.7% | 90.3% | **0.91** |
| **Regulatory Filings (Generic)** | 89.3% | 93.2% | **0.91** |
| **Management Reports** | 90.8% | 88.2% | **0.89** |
| **M&A Activity** | 95.1% | 81.4% | **0.88** |
| **Share Issue/Capital Change** | 86.0% | 89.3% | **0.88** |
| **Delisting Announcement** | **100.0%** | 75.9% | **0.86** |

---

## 🏗️ Architecture

The system uses a **2-Stage Soft-Routing Architecture** to break the "Semantic Ceiling" often found in flat classifiers:

1.  **Level 1 (The Router):** A Jina-V3 embedding model feeds an XGBoost Router that predicts one of 8 Main Categories (e.g., "Financial Reporting", "Equity Info").
2.  **Level 2 (The Specialists):** The document is passed to the top-2 most likely Specialist Models, which compete to assign the final fine-grained label.

## ⚠️ Critical Usage Note: The "Wrapper Effect"

Financial documents are often massive (500+ pages) but must be truncated to fit into GPU memory for embedding. However, **Document Length** is a critical feature for distinguishing a full *Annual Report* from a short *Press Release* announcing it.

**To achieve 93% accuracy, you must decouple embedding text from feature engineering:**

1.  **Embedding (GPU):** Pass the truncated text (e.g., first 32k characters) to Jina-V3.
2.  **Feature Vector (XGBoost):** Calculate `log1p(length)` using the **True Original Length** of the document, not the truncated string length.

*If you do not provide the original length, the model will assume the document is short and may misclassify massive Annual Reports as simple Press Releases.*

## 💻 Usage

```python
from huggingface_hub import snapshot_download
import sys

# 1. Download Models
model_path = snapshot_download(repo_id="FinancialReports/hierarchical-filing-classifier")

# 2. Add path and import wrapper
sys.path.append(model_path)
from inference_wrapper import FinancialFilingClassifier

# 3. Initialize
classifier = FinancialFilingClassifier(model_path)

# 4. Scenario: A 2MB Annual Report
real_doc_length = 2500000  # 2.5 Million chars
truncated_text = "ACME CORP ANNUAL REPORT 2024... [Truncated at 32k chars]"

# 5. Predict (Ensure your wrapper/API handles the length argument)
result = classifier.predict(
    text=truncated_text, 
    # Logic note: Ensure the classifier applies log1p to this value
    # instead of len(truncated_text) before passing to XGBoost.
)

print(result)
# Output:
# {
#   'category': 'Financial Reporting', 
#   'label': 'Annual Report', 
#   'score': 0.985,
# }
```

## 📂 Taxonomy (29 Classes)

The model classifies documents into this hierarchy:

| **Financial Reporting** | **Equity Information** | **Listing & Regulatory** |
| :--- | :--- | :--- |
| • Annual Report<br>• Earnings Release<br>• Interim / Quarterly Report<br>• Audit Report | • Major Shareholding Notification<br>• Transaction in Own Shares (Buyback)<br>• Share Issue / Capital Change<br>• Notice of Dividend Amount | • Regulatory Filings (RNS)<br>• Delisting Announcement<br>• Prospectus<br>• Registration Form |

| **AGM Information** | **Management** | **Investor Comm** |
| :--- | :--- | :--- |
| • AGM Information (Pre/Post)<br>• Voting Results<br>• Proxy Solicitation | • Director's Dealing<br>• Management Reports<br>• Remuneration Info<br>• Board Changes | • Investor Presentation<br>• Call Transcript<br>• Report Publication Announcement |

| **M&A and Legal** | **Debt Information** | **Investment Vehicle** |
| :--- | :--- | :--- |
| • M&A Activity<br>• Legal Proceedings Report | • Capital/Financing Update<br>• Interest Rate Notice | • Net Asset Value (NAV)<br>• Fund Factsheet |

## 📜 The Standard: Financial Reporting Classification Framework (FRCF)

The taxonomy used by this model is based on the **[Financial Reporting Classification Framework (FRCF)](https://financialreports.eu/financial-reporting-classification-framework/)**, an open-source standard designed to organize corporate disclosures in a consistent, cross-jurisdictional format.

Unlike fragmented regulatory schemes, the FRCF organizes disclosures by **functional purpose**, ensuring comparability across markets (e.g., mapping a US *10-K* and a European *Annual Financial Report* to the same standardized `Annual Report` category).

* **[Explore the Framework](https://financialreports.eu/financial-reporting-classification-framework/)**
* **[Download Methodology (PDF)](https://financialreports.eu/download/frcf-methodology.pdf)**

## 📚 Training Data

The model was trained on a proprietary **Golden Dataset of 27,671 financial filings**, manually curated to represent the diverse landscape of global corporate reporting.

* **Source:** Real-world filings from listed companies across **Europe (primary focus)**, North America, and Asia.
* **Multilingual:** Includes documents in English, French, German, and other major European languages (leveraging the multilingual capabilities of Jina-V3).
* **Diversity:** The dataset preserves the natural "long-tail" distribution of financial data, ranging from massive 500+ page **Annual Reports** to single-page **Press Releases** and complex **ESG Disclosures**.
* **Quality Control:** Mapped to a strict 2-level hierarchy to resolve semantic ambiguities common in regulatory filings (e.g., distinguishing a *Share Buyback* announcement from a *Director's Dealing* notification).

## ⚙️ Deployment & Hardware

This model is optimized for **GPU Inference** due to the heavy 8192-token context window of the Jina encoder. While CPU inference is possible, it is significantly slower.

### Recommended Configuration

| Component | Recommendation | Notes |
| :--- | :--- | :--- |
| **GPU** | **NVIDIA T4 (16GB)** | The "Sweet Spot" for cost/performance. Capable of ~50 docs/sec in batch mode. |
| **Alternative** | NVIDIA L4 / A10 | Recommended for high-concurrency production APIs. |
| **VRAM** | 16 GB Minimum | Required to embed long documents without OOM errors. |
| **System RAM** | 16 GB+ | Standard requirement for PyTorch + XGBoost overhead. |

### Critical Environment Settings

To load the underlying Jina-V3 model, you **must** allow remote code execution in your environment variables (Docker, Kubernetes, or Hugging Face Endpoints):

```bash
HF_TRUST_REMOTE_CODE=True
```

### Throughput Benchmarks (T4 GPU)
* **Live API Latency:** ~200ms – 500ms per document.
* **Batch Processing:** ~40 – 50 documents per second (Batch Size: 64).