--- tags: - financial-filings - classification - xgboost - jina-embeddings-v3 - finance - nlp library_name: xgboost metrics: - f1: 0.935 - accuracy: 0.95 model-index: - name: hierarchical-filing-classifier results: - task: type: text-classification name: Financial Document Classification metrics: - type: f1 value: 0.935 name: Weighted F1 - type: accuracy value: 0.973 name: Top-2 Router Accuracy --- # Financial Reports Hierarchical Classifier This is a production-grade Hierarchical Cascade Classifier designed to categorize Global and European financial filings into **29 distinct classes**. It powers the classification engine for **FinancialReports**. ## 🚀 Performance Highlights | Metric | Score | Interpretation | | :--- | :--- | :--- | | **Global Weighted F1** | **93.5%** | State-of-the-art performance for unstructured financial text. | | **Top-2 Router Accuracy** | **97.3%** | The correct specialist is consulted 97.3% of the time. | | **Call Transcript Precision** | **100%** | Zero false positives for transcripts. | | **Delisting Precision** | **100%** | High-precision signal for critical negative corporate events. | ### Detailed Performance by Filing Type *Scores based on a hold-out test set of ~5,500 documents.* | Filing Type | Precision | Recall | F1-Score | | :--- | :--- | :--- | :--- | | **Interest Rate Update/Notice** | 98.9% | 98.1% | **0.99** | | **Proxy Solicitation** | 98.6% | 94.4% | **0.96** | | **Annual Report** | 96.7% | 95.6% | **0.96** | | **Investor Presentation** | 97.3% | 94.2% | **0.96** | | **Voting Results** | 94.9% | 96.4% | **0.96** | | **Audit Report** | 94.7% | 96.4% | **0.96** | | **Director's Dealing** | 95.9% | 95.0% | **0.95** | | **Dividend Notice** | 97.8% | 93.0% | **0.95** | | **Fund Factsheet** | 96.0% | 94.4% | **0.95** | | **Net Asset Value (NAV)** | 92.6% | 97.6% | **0.95** | | **Interim / Quarterly Report** | 93.7% | 96.3% | **0.95** | | **AGM Information** | 95.0% | 93.9% | **0.94** | | **Remuneration Info** | 97.3% | 91.4% | **0.94** | | **Report Publication Announcement** | 93.5% | 94.8% | **0.94** | | **Earnings Release** | 93.2% | 94.0% | **0.94** | | **ESG / Sustainability Info** | 96.3% | 90.7% | **0.93** | | **Governance Info** | 97.1% | 89.5% | **0.93** | | **Capital/Financing Update** | 97.0% | 89.2% | **0.93** | | **Call Transcript** | **100.0%** | 86.7% | **0.93** | | **Major Shareholding Notification** | 93.0% | 92.5% | **0.93** | | **Board/Management Info** | 91.8% | 93.4% | **0.93** | | **Transaction in Own Shares** | 90.0% | 94.8% | **0.92** | | **Legal Proceedings** | 92.7% | 90.3% | **0.91** | | **Regulatory Filings (Generic)** | 89.3% | 93.2% | **0.91** | | **Management Reports** | 90.8% | 88.2% | **0.89** | | **M&A Activity** | 95.1% | 81.4% | **0.88** | | **Share Issue/Capital Change** | 86.0% | 89.3% | **0.88** | | **Delisting Announcement** | **100.0%** | 75.9% | **0.86** | --- ## 🏗️ Architecture The system uses a **2-Stage Soft-Routing Architecture** to break the "Semantic Ceiling" often found in flat classifiers: 1. **Level 1 (The Router):** A Jina-V3 embedding model feeds an XGBoost Router that predicts one of 8 Main Categories (e.g., "Financial Reporting", "Equity Info"). 2. **Level 2 (The Specialists):** The document is passed to the top-2 most likely Specialist Models, which compete to assign the final fine-grained label. ## ⚠️ Critical Usage Note: The "Wrapper Effect" Financial documents are often massive (500+ pages) but must be truncated to fit into GPU memory for embedding. However, **Document Length** is a critical feature for distinguishing a full *Annual Report* from a short *Press Release* announcing it. **To achieve 93% accuracy, you must decouple embedding text from feature engineering:** 1. **Embedding (GPU):** Pass the truncated text (e.g., first 32k characters) to Jina-V3. 2. **Feature Vector (XGBoost):** Calculate `log1p(length)` using the **True Original Length** of the document, not the truncated string length. *If you do not provide the original length, the model will assume the document is short and may misclassify massive Annual Reports as simple Press Releases.* ## 💻 Usage ```python from huggingface_hub import snapshot_download import sys # 1. Download Models model_path = snapshot_download(repo_id="FinancialReports/hierarchical-filing-classifier") # 2. Add path and import wrapper sys.path.append(model_path) from inference_wrapper import FinancialFilingClassifier # 3. Initialize classifier = FinancialFilingClassifier(model_path) # 4. Scenario: A 2MB Annual Report real_doc_length = 2500000 # 2.5 Million chars truncated_text = "ACME CORP ANNUAL REPORT 2024... [Truncated at 32k chars]" # 5. Predict (Ensure your wrapper/API handles the length argument) result = classifier.predict( text=truncated_text, # Logic note: Ensure the classifier applies log1p to this value # instead of len(truncated_text) before passing to XGBoost. ) print(result) # Output: # { # 'category': 'Financial Reporting', # 'label': 'Annual Report', # 'score': 0.985, # } ``` ## 📂 Taxonomy (29 Classes) The model classifies documents into this hierarchy: | **Financial Reporting** | **Equity Information** | **Listing & Regulatory** | | :--- | :--- | :--- | | • Annual Report
• Earnings Release
• Interim / Quarterly Report
• Audit Report | • Major Shareholding Notification
• Transaction in Own Shares (Buyback)
• Share Issue / Capital Change
• Notice of Dividend Amount | • Regulatory Filings (RNS)
• Delisting Announcement
• Prospectus
• Registration Form | | **AGM Information** | **Management** | **Investor Comm** | | :--- | :--- | :--- | | • AGM Information (Pre/Post)
• Voting Results
• Proxy Solicitation | • Director's Dealing
• Management Reports
• Remuneration Info
• Board Changes | • Investor Presentation
• Call Transcript
• Report Publication Announcement | | **M&A and Legal** | **Debt Information** | **Investment Vehicle** | | :--- | :--- | :--- | | • M&A Activity
• Legal Proceedings Report | • Capital/Financing Update
• Interest Rate Notice | • Net Asset Value (NAV)
• Fund Factsheet | ## 📜 The Standard: Financial Reporting Classification Framework (FRCF) The taxonomy used by this model is based on the **[Financial Reporting Classification Framework (FRCF)](https://financialreports.eu/financial-reporting-classification-framework/)**, an open-source standard designed to organize corporate disclosures in a consistent, cross-jurisdictional format. Unlike fragmented regulatory schemes, the FRCF organizes disclosures by **functional purpose**, ensuring comparability across markets (e.g., mapping a US *10-K* and a European *Annual Financial Report* to the same standardized `Annual Report` category). * **[Explore the Framework](https://financialreports.eu/financial-reporting-classification-framework/)** * **[Download Methodology (PDF)](https://financialreports.eu/download/frcf-methodology.pdf)** ## 📚 Training Data The model was trained on a proprietary **Golden Dataset of 27,671 financial filings**, manually curated to represent the diverse landscape of global corporate reporting. * **Source:** Real-world filings from listed companies across **Europe (primary focus)**, North America, and Asia. * **Multilingual:** Includes documents in English, French, German, and other major European languages (leveraging the multilingual capabilities of Jina-V3). * **Diversity:** The dataset preserves the natural "long-tail" distribution of financial data, ranging from massive 500+ page **Annual Reports** to single-page **Press Releases** and complex **ESG Disclosures**. * **Quality Control:** Mapped to a strict 2-level hierarchy to resolve semantic ambiguities common in regulatory filings (e.g., distinguishing a *Share Buyback* announcement from a *Director's Dealing* notification). ## ⚙️ Deployment & Hardware This model is optimized for **GPU Inference** due to the heavy 8192-token context window of the Jina encoder. While CPU inference is possible, it is significantly slower. ### Recommended Configuration | Component | Recommendation | Notes | | :--- | :--- | :--- | | **GPU** | **NVIDIA T4 (16GB)** | The "Sweet Spot" for cost/performance. Capable of ~50 docs/sec in batch mode. | | **Alternative** | NVIDIA L4 / A10 | Recommended for high-concurrency production APIs. | | **VRAM** | 16 GB Minimum | Required to embed long documents without OOM errors. | | **System RAM** | 16 GB+ | Standard requirement for PyTorch + XGBoost overhead. | ### Critical Environment Settings To load the underlying Jina-V3 model, you **must** allow remote code execution in your environment variables (Docker, Kubernetes, or Hugging Face Endpoints): ```bash HF_TRUST_REMOTE_CODE=True ``` ### Throughput Benchmarks (T4 GPU) * **Live API Latency:** ~200ms – 500ms per document. * **Batch Processing:** ~40 – 50 documents per second (Batch Size: 64).