Financial Document Classifier (Jina-v3 Hybrid)

A production-grade classifier for regulatory financial documents, designed to categorize filings into 29 high-value classes.

This model utilizes a Hybrid Architecture:

  1. Semantic Encoder: A fine-tuned jina-embeddings-v3 (8192 token context) trained via Contrastive Learning (Triplet Loss).
  2. Classification Head: An XGBoost classifier trained on semantic vectors + metadata features (Document Length).

Repository: FinancialReports/jina-v3-financial-classifier

πŸ“Š Performance Metrics

This model is designed for high-stakes financial environments where precision is paramount. It supports a Confidence Gating workflow.

Metric Score Description
Production Reliability 90.0% Accuracy on documents with confidence > 75%.
Automation Coverage 89.7% Percentage of documents handled automatically.
Raw Accuracy 85.44% Baseline accuracy across all 29 classes.
Macro F1 0.86 Balanced performance across rare and common classes.

Key Class Performance

Class Precision Recall F1-Score
Director's Dealing 96% 98% 0.97
Remuneration Info 98% 90% 0.94
Net Asset Value 94% 99% 0.96
Voting Results 95% 92% 0.93
Annual Report 90% 86% 0.88

πŸš€ Usage (Python)

Because this is a Hybrid Model (Transformer + Gradient Boosted Tree), you must load the encoder and the classification head separately.

Installation

pip install sentence-transformers xgboost huggingface_hub numpy

Inference Script

import joblib
import xgboost as xgb
import numpy as np
from huggingface_hub import hf_hub_download
from sentence_transformers import SentenceTransformer

class FinancialClassifier:
    def __init__(self, repo_id="FinancialReports/jina-v3-financial-classifier"):
        print(f"Loading model from {repo_id}...")
        
        # 1. Load the Brain (Jina Encoder)
        self.encoder = SentenceTransformer(repo_id, trust_remote_code=True)
        self.encoder.max_seq_length = 8192 # Full context window
        
        # 2. Download & Load the Head (XGBoost)
        classifier_path = hf_hub_download(repo_id=repo_id, filename="financial_classifier_final_v1.json")
        self.classifier = xgb.XGBClassifier()
        self.classifier.load_model(classifier_path)
        
        # 3. Download & Load the Label Decoder
        decoder_path = hf_hub_download(repo_id=repo_id, filename="label_decoder_final_v1.pkl")
        self.id2label = joblib.load(decoder_path)
        print("βœ… System Ready.")

    def predict(self, text):
        # 1. Feature Extraction (Text + Log-Length)
        # We inject document length to distinguish short Earnings Releases from long Interim Reports.
        embedding = self.encoder.encode([text])[0]
        length_feature = np.log1p(len(text))
        length_norm = length_feature / 12.0 # Normalized scale
        
        # Combine features
        features = np.hstack([embedding, [length_norm]])
        
        # 2. Inference
        probs = self.classifier.predict_proba([features])[0]
        pred_id = np.argmax(probs)
        confidence = float(np.max(probs))
        label = self.id2label[pred_id]
        
        return {
            "label": label,
            "confidence": round(confidence, 4),
            "status": "accept" if confidence > 0.75 else "manual_review"
        }

# Example
clf = FinancialClassifier()
doc = "We are pleased to announce the acquisition of..."
result = clf.predict(doc)
print(result)

πŸ“‚ Taxonomy (29 Classes)

The model classifies text into one of the following categories:

  • Financial Reporting: Annual Report, Earnings Release, Interim / Quarterly Report, Periodic Financial Results.
  • Transactions: M&A Activity, Transaction in Own Shares, Share Issue/Capital Change, Capital/Financing Update.
  • Governance: Board/Management Information, Director's Dealing, Remuneration Information, Governance Information, Proxy Solicitation.
  • Shareholder Meetings: AGM Information, Declaration of Voting Results.
  • Funds: Net Asset Value, Fund Information / Factsheet, Notice of Dividend Amount.
  • Legal/Compliance: Regulatory Filings, Legal Proceedings Report, Delisting Announcement.

πŸ”§ Training Details

  • Base Model: jinaai/jina-embeddings-v3
  • Context Window: 8192 Tokens (Truncation Strategy: Tail)
  • Training Data: 27,000+ synthetic and augmented financial filings.
  • Objective: Batch All Triplet Loss (Contrastive Learning).
  • Hardware: Trained on NVIDIA A100 (40GB).

Limitations

  • Management Reports: The model occasionally confuses generic "Management Reports" with "Periodic Financial Results" due to high semantic overlap. Confidence gating is recommended.
  • Hybrid Requirement: This model cannot be loaded with AutoModelForSequenceClassification alone; it requires the accompanying XGBoost artifacts found in this repository.
Downloads last month
-
Safetensors
Model size
0.6B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for FinancialReports/jina-v3-financial-classifier

Finetuned
(30)
this model