hierarchical-filing-classifier / README.md

Update README.md

1933e19 verified about 2 months ago

9.08 kB

	---
	tags:
	- financial-filings
	- classification
	- xgboost
	- jina-embeddings-v3
	- finance
	- nlp
	library_name: xgboost
	metrics:
	- f1: 0.935
	- accuracy: 0.95
	model-index:
	- name: hierarchical-filing-classifier
	results:
	- task:
	type: text-classification
	name: Financial Document Classification
	metrics:
	- type: f1
	value: 0.935
	name: Weighted F1
	- type: accuracy
	value: 0.973
	name: Top-2 Router Accuracy
	---

	# Financial Reports Hierarchical Classifier

	This is a production-grade Hierarchical Cascade Classifier designed to categorize Global and European financial filings into 29 distinct classes. It powers the classification engine for FinancialReports.

	## 🚀 Performance Highlights

	\| Metric \| Score \| Interpretation \|
	\| :--- \| :--- \| :--- \|
	\| Global Weighted F1 \| 93.5% \| State-of-the-art performance for unstructured financial text. \|
	\| Top-2 Router Accuracy \| 97.3% \| The correct specialist is consulted 97.3% of the time. \|
	\| Call Transcript Precision \| 100% \| Zero false positives for transcripts. \|
	\| Delisting Precision \| 100% \| High-precision signal for critical negative corporate events. \|

	### Detailed Performance by Filing Type

	Scores based on a hold-out test set of ~5,500 documents.

	\| Filing Type \| Precision \| Recall \| F1-Score \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| Interest Rate Update/Notice \| 98.9% \| 98.1% \| 0.99 \|
	\| Proxy Solicitation \| 98.6% \| 94.4% \| 0.96 \|
	\| Annual Report \| 96.7% \| 95.6% \| 0.96 \|
	\| Investor Presentation \| 97.3% \| 94.2% \| 0.96 \|
	\| Voting Results \| 94.9% \| 96.4% \| 0.96 \|
	\| Audit Report \| 94.7% \| 96.4% \| 0.96 \|
	\| Director's Dealing \| 95.9% \| 95.0% \| 0.95 \|
	\| Dividend Notice \| 97.8% \| 93.0% \| 0.95 \|
	\| Fund Factsheet \| 96.0% \| 94.4% \| 0.95 \|
	\| Net Asset Value (NAV) \| 92.6% \| 97.6% \| 0.95 \|
	\| Interim / Quarterly Report \| 93.7% \| 96.3% \| 0.95 \|
	\| AGM Information \| 95.0% \| 93.9% \| 0.94 \|
	\| Remuneration Info \| 97.3% \| 91.4% \| 0.94 \|
	\| Report Publication Announcement \| 93.5% \| 94.8% \| 0.94 \|
	\| Earnings Release \| 93.2% \| 94.0% \| 0.94 \|
	\| ESG / Sustainability Info \| 96.3% \| 90.7% \| 0.93 \|
	\| Governance Info \| 97.1% \| 89.5% \| 0.93 \|
	\| Capital/Financing Update \| 97.0% \| 89.2% \| 0.93 \|
	\| Call Transcript \| 100.0% \| 86.7% \| 0.93 \|
	\| Major Shareholding Notification \| 93.0% \| 92.5% \| 0.93 \|
	\| Board/Management Info \| 91.8% \| 93.4% \| 0.93 \|
	\| Transaction in Own Shares \| 90.0% \| 94.8% \| 0.92 \|
	\| Legal Proceedings \| 92.7% \| 90.3% \| 0.91 \|
	\| Regulatory Filings (Generic) \| 89.3% \| 93.2% \| 0.91 \|
	\| Management Reports \| 90.8% \| 88.2% \| 0.89 \|
	\| M&A Activity \| 95.1% \| 81.4% \| 0.88 \|
	\| Share Issue/Capital Change \| 86.0% \| 89.3% \| 0.88 \|
	\| Delisting Announcement \| 100.0% \| 75.9% \| 0.86 \|

	---

	## 🏗️ Architecture

	The system uses a 2-Stage Soft-Routing Architecture to break the "Semantic Ceiling" often found in flat classifiers:

	1. Level 1 (The Router): A Jina-V3 embedding model feeds an XGBoost Router that predicts one of 8 Main Categories (e.g., "Financial Reporting", "Equity Info").
	2. Level 2 (The Specialists): The document is passed to the top-2 most likely Specialist Models, which compete to assign the final fine-grained label.

	## ⚠️ Critical Usage Note: The "Wrapper Effect"

	Financial documents are often massive (500+ pages) but must be truncated to fit into GPU memory for embedding. However, Document Length is a critical feature for distinguishing a full Annual Report from a short Press Release announcing it.

	To achieve 93% accuracy, you must decouple embedding text from feature engineering:

	1. Embedding (GPU): Pass the truncated text (e.g., first 32k characters) to Jina-V3.
	2. Feature Vector (XGBoost): Calculate `log1p(length)` using the True Original Length of the document, not the truncated string length.

	If you do not provide the original length, the model will assume the document is short and may misclassify massive Annual Reports as simple Press Releases.

	## 💻 Usage

	```python
	from huggingface_hub import snapshot_download
	import sys

	# 1. Download Models
	model_path = snapshot_download(repo_id="FinancialReports/hierarchical-filing-classifier")

	# 2. Add path and import wrapper
	sys.path.append(model_path)
	from inference_wrapper import FinancialFilingClassifier

	# 3. Initialize
	classifier = FinancialFilingClassifier(model_path)

	# 4. Scenario: A 2MB Annual Report
	real_doc_length = 2500000 # 2.5 Million chars
	truncated_text = "ACME CORP ANNUAL REPORT 2024... [Truncated at 32k chars]"

	# 5. Predict (Ensure your wrapper/API handles the length argument)
	result = classifier.predict(
	text=truncated_text,
	# Logic note: Ensure the classifier applies log1p to this value
	# instead of len(truncated_text) before passing to XGBoost.
	)

	print(result)
	# Output:
	# {
	# 'category': 'Financial Reporting',
	# 'label': 'Annual Report',
	# 'score': 0.985,
	# }
	```

	## 📂 Taxonomy (29 Classes)

	The model classifies documents into this hierarchy:

	\| Financial Reporting \| Equity Information \| Listing & Regulatory \|
	\| :--- \| :--- \| :--- \|
	\| • Annual Report<br>• Earnings Release<br>• Interim / Quarterly Report<br>• Audit Report \| • Major Shareholding Notification<br>• Transaction in Own Shares (Buyback)<br>• Share Issue / Capital Change<br>• Notice of Dividend Amount \| • Regulatory Filings (RNS)<br>• Delisting Announcement<br>• Prospectus<br>• Registration Form \|

	\| AGM Information \| Management \| Investor Comm \|
	\| :--- \| :--- \| :--- \|
	\| • AGM Information (Pre/Post)<br>• Voting Results<br>• Proxy Solicitation \| • Director's Dealing<br>• Management Reports<br>• Remuneration Info<br>• Board Changes \| • Investor Presentation<br>• Call Transcript<br>• Report Publication Announcement \|

	\| M&A and Legal \| Debt Information \| Investment Vehicle \|
	\| :--- \| :--- \| :--- \|
	\| • M&A Activity<br>• Legal Proceedings Report \| • Capital/Financing Update<br>• Interest Rate Notice \| • Net Asset Value (NAV)<br>• Fund Factsheet \|

	## 📜 The Standard: Financial Reporting Classification Framework (FRCF)

	The taxonomy used by this model is based on the [Financial Reporting Classification Framework (FRCF)](https://financialreports.eu/financial-reporting-classification-framework/), an open-source standard designed to organize corporate disclosures in a consistent, cross-jurisdictional format.

	Unlike fragmented regulatory schemes, the FRCF organizes disclosures by functional purpose, ensuring comparability across markets (e.g., mapping a US 10-K and a European Annual Financial Report to the same standardized `Annual Report` category).

	* [Explore the Framework](https://financialreports.eu/financial-reporting-classification-framework/)
	* [Download Methodology (PDF)](https://financialreports.eu/download/frcf-methodology.pdf)

	## 📚 Training Data

	The model was trained on a proprietary Golden Dataset of 27,671 financial filings, manually curated to represent the diverse landscape of global corporate reporting.

	* Source: Real-world filings from listed companies across Europe (primary focus), North America, and Asia.
	* Multilingual: Includes documents in English, French, German, and other major European languages (leveraging the multilingual capabilities of Jina-V3).
	* Diversity: The dataset preserves the natural "long-tail" distribution of financial data, ranging from massive 500+ page Annual Reports to single-page Press Releases and complex ESG Disclosures.
	* Quality Control: Mapped to a strict 2-level hierarchy to resolve semantic ambiguities common in regulatory filings (e.g., distinguishing a Share Buyback announcement from a Director's Dealing notification).

	## ⚙️ Deployment & Hardware

	This model is optimized for GPU Inference due to the heavy 8192-token context window of the Jina encoder. While CPU inference is possible, it is significantly slower.

	### Recommended Configuration

	\| Component \| Recommendation \| Notes \|
	\| :--- \| :--- \| :--- \|
	\| GPU \| NVIDIA T4 (16GB) \| The "Sweet Spot" for cost/performance. Capable of ~50 docs/sec in batch mode. \|
	\| Alternative \| NVIDIA L4 / A10 \| Recommended for high-concurrency production APIs. \|
	\| VRAM \| 16 GB Minimum \| Required to embed long documents without OOM errors. \|
	\| System RAM \| 16 GB+ \| Standard requirement for PyTorch + XGBoost overhead. \|

	### Critical Environment Settings

	To load the underlying Jina-V3 model, you must allow remote code execution in your environment variables (Docker, Kubernetes, or Hugging Face Endpoints):

	```bash
	HF_TRUST_REMOTE_CODE=True
	```

	### Throughput Benchmarks (T4 GPU)
	* Live API Latency: ~200ms – 500ms per document.
	* Batch Processing: ~40 – 50 documents per second (Batch Size: 64).

	---
	tags:
	- financial-filings
	- classification
	- xgboost
	- jina-embeddings-v3
	- finance
	- nlp
	library_name: xgboost
	metrics:
	- f1: 0.935
	- accuracy: 0.95
	model-index:
	- name: hierarchical-filing-classifier
	results:
	- task:
	type: text-classification
	name: Financial Document Classification
	metrics:
	- type: f1
	value: 0.935
	name: Weighted F1
	- type: accuracy
	value: 0.973
	name: Top-2 Router Accuracy
	---

	# Financial Reports Hierarchical Classifier

	This is a production-grade Hierarchical Cascade Classifier designed to categorize Global and European financial filings into 29 distinct classes. It powers the classification engine for FinancialReports.

	## 🚀 Performance Highlights

	\| Metric \| Score \| Interpretation \|
	\| :--- \| :--- \| :--- \|
	\| Global Weighted F1 \| 93.5% \| State-of-the-art performance for unstructured financial text. \|
	\| Top-2 Router Accuracy \| 97.3% \| The correct specialist is consulted 97.3% of the time. \|
	\| Call Transcript Precision \| 100% \| Zero false positives for transcripts. \|
	\| Delisting Precision \| 100% \| High-precision signal for critical negative corporate events. \|

	### Detailed Performance by Filing Type

	Scores based on a hold-out test set of ~5,500 documents.

	\| Filing Type \| Precision \| Recall \| F1-Score \|
	\| :--- \| :--- \| :--- \| :--- \|
	\| Interest Rate Update/Notice \| 98.9% \| 98.1% \| 0.99 \|
	\| Proxy Solicitation \| 98.6% \| 94.4% \| 0.96 \|
	\| Annual Report \| 96.7% \| 95.6% \| 0.96 \|
	\| Investor Presentation \| 97.3% \| 94.2% \| 0.96 \|
	\| Voting Results \| 94.9% \| 96.4% \| 0.96 \|
	\| Audit Report \| 94.7% \| 96.4% \| 0.96 \|
	\| Director's Dealing \| 95.9% \| 95.0% \| 0.95 \|
	\| Dividend Notice \| 97.8% \| 93.0% \| 0.95 \|
	\| Fund Factsheet \| 96.0% \| 94.4% \| 0.95 \|
	\| Net Asset Value (NAV) \| 92.6% \| 97.6% \| 0.95 \|
	\| Interim / Quarterly Report \| 93.7% \| 96.3% \| 0.95 \|
	\| AGM Information \| 95.0% \| 93.9% \| 0.94 \|
	\| Remuneration Info \| 97.3% \| 91.4% \| 0.94 \|
	\| Report Publication Announcement \| 93.5% \| 94.8% \| 0.94 \|
	\| Earnings Release \| 93.2% \| 94.0% \| 0.94 \|
	\| ESG / Sustainability Info \| 96.3% \| 90.7% \| 0.93 \|
	\| Governance Info \| 97.1% \| 89.5% \| 0.93 \|
	\| Capital/Financing Update \| 97.0% \| 89.2% \| 0.93 \|
	\| Call Transcript \| 100.0% \| 86.7% \| 0.93 \|
	\| Major Shareholding Notification \| 93.0% \| 92.5% \| 0.93 \|
	\| Board/Management Info \| 91.8% \| 93.4% \| 0.93 \|
	\| Transaction in Own Shares \| 90.0% \| 94.8% \| 0.92 \|
	\| Legal Proceedings \| 92.7% \| 90.3% \| 0.91 \|
	\| Regulatory Filings (Generic) \| 89.3% \| 93.2% \| 0.91 \|
	\| Management Reports \| 90.8% \| 88.2% \| 0.89 \|
	\| M&A Activity \| 95.1% \| 81.4% \| 0.88 \|
	\| Share Issue/Capital Change \| 86.0% \| 89.3% \| 0.88 \|
	\| Delisting Announcement \| 100.0% \| 75.9% \| 0.86 \|

	---

	## 🏗️ Architecture

	The system uses a 2-Stage Soft-Routing Architecture to break the "Semantic Ceiling" often found in flat classifiers:

	1. Level 1 (The Router): A Jina-V3 embedding model feeds an XGBoost Router that predicts one of 8 Main Categories (e.g., "Financial Reporting", "Equity Info").
	2. Level 2 (The Specialists): The document is passed to the top-2 most likely Specialist Models, which compete to assign the final fine-grained label.

	## ⚠️ Critical Usage Note: The "Wrapper Effect"

	Financial documents are often massive (500+ pages) but must be truncated to fit into GPU memory for embedding. However, Document Length is a critical feature for distinguishing a full Annual Report from a short Press Release announcing it.

	To achieve 93% accuracy, you must decouple embedding text from feature engineering:

	1. Embedding (GPU): Pass the truncated text (e.g., first 32k characters) to Jina-V3.
	2. Feature Vector (XGBoost): Calculate `log1p(length)` using the True Original Length of the document, not the truncated string length.

	If you do not provide the original length, the model will assume the document is short and may misclassify massive Annual Reports as simple Press Releases.

	## 💻 Usage

	```python
	from huggingface_hub import snapshot_download
	import sys

	# 1. Download Models
	model_path = snapshot_download(repo_id="FinancialReports/hierarchical-filing-classifier")

	# 2. Add path and import wrapper
	sys.path.append(model_path)
	from inference_wrapper import FinancialFilingClassifier

	# 3. Initialize
	classifier = FinancialFilingClassifier(model_path)

	# 4. Scenario: A 2MB Annual Report
	real_doc_length = 2500000 # 2.5 Million chars
	truncated_text = "ACME CORP ANNUAL REPORT 2024... [Truncated at 32k chars]"

	# 5. Predict (Ensure your wrapper/API handles the length argument)
	result = classifier.predict(
	text=truncated_text,
	# Logic note: Ensure the classifier applies log1p to this value
	# instead of len(truncated_text) before passing to XGBoost.
	)

	print(result)
	# Output:
	# {
	# 'category': 'Financial Reporting',
	# 'label': 'Annual Report',
	# 'score': 0.985,
	# }
	```

	## 📂 Taxonomy (29 Classes)

	The model classifies documents into this hierarchy:

	\| Financial Reporting \| Equity Information \| Listing & Regulatory \|
	\| :--- \| :--- \| :--- \|
	\| • Annual Report<br>• Earnings Release<br>• Interim / Quarterly Report<br>• Audit Report \| • Major Shareholding Notification<br>• Transaction in Own Shares (Buyback)<br>• Share Issue / Capital Change<br>• Notice of Dividend Amount \| • Regulatory Filings (RNS)<br>• Delisting Announcement<br>• Prospectus<br>• Registration Form \|

	\| AGM Information \| Management \| Investor Comm \|
	\| :--- \| :--- \| :--- \|
	\| • AGM Information (Pre/Post)<br>• Voting Results<br>• Proxy Solicitation \| • Director's Dealing<br>• Management Reports<br>• Remuneration Info<br>• Board Changes \| • Investor Presentation<br>• Call Transcript<br>• Report Publication Announcement \|

	\| M&A and Legal \| Debt Information \| Investment Vehicle \|
	\| :--- \| :--- \| :--- \|
	\| • M&A Activity<br>• Legal Proceedings Report \| • Capital/Financing Update<br>• Interest Rate Notice \| • Net Asset Value (NAV)<br>• Fund Factsheet \|

	## 📜 The Standard: Financial Reporting Classification Framework (FRCF)

	The taxonomy used by this model is based on the [Financial Reporting Classification Framework (FRCF)](https://financialreports.eu/financial-reporting-classification-framework/), an open-source standard designed to organize corporate disclosures in a consistent, cross-jurisdictional format.

	Unlike fragmented regulatory schemes, the FRCF organizes disclosures by functional purpose, ensuring comparability across markets (e.g., mapping a US 10-K and a European Annual Financial Report to the same standardized `Annual Report` category).

	* [Explore the Framework](https://financialreports.eu/financial-reporting-classification-framework/)
	* [Download Methodology (PDF)](https://financialreports.eu/download/frcf-methodology.pdf)

	## 📚 Training Data

	The model was trained on a proprietary Golden Dataset of 27,671 financial filings, manually curated to represent the diverse landscape of global corporate reporting.

	* Source: Real-world filings from listed companies across Europe (primary focus), North America, and Asia.
	* Multilingual: Includes documents in English, French, German, and other major European languages (leveraging the multilingual capabilities of Jina-V3).
	* Diversity: The dataset preserves the natural "long-tail" distribution of financial data, ranging from massive 500+ page Annual Reports to single-page Press Releases and complex ESG Disclosures.
	* Quality Control: Mapped to a strict 2-level hierarchy to resolve semantic ambiguities common in regulatory filings (e.g., distinguishing a Share Buyback announcement from a Director's Dealing notification).

	## ⚙️ Deployment & Hardware

	This model is optimized for GPU Inference due to the heavy 8192-token context window of the Jina encoder. While CPU inference is possible, it is significantly slower.

	### Recommended Configuration

	\| Component \| Recommendation \| Notes \|
	\| :--- \| :--- \| :--- \|
	\| GPU \| NVIDIA T4 (16GB) \| The "Sweet Spot" for cost/performance. Capable of ~50 docs/sec in batch mode. \|
	\| Alternative \| NVIDIA L4 / A10 \| Recommended for high-concurrency production APIs. \|
	\| VRAM \| 16 GB Minimum \| Required to embed long documents without OOM errors. \|
	\| System RAM \| 16 GB+ \| Standard requirement for PyTorch + XGBoost overhead. \|

	### Critical Environment Settings

	To load the underlying Jina-V3 model, you must allow remote code execution in your environment variables (Docker, Kubernetes, or Hugging Face Endpoints):

	```bash
	HF_TRUST_REMOTE_CODE=True
	```

	### Throughput Benchmarks (T4 GPU)
	* Live API Latency: ~200ms – 500ms per document.
	* Batch Processing: ~40 – 50 documents per second (Batch Size: 64).