Financial Document Classifier (Jina-v3 Hybrid)

A production-grade classifier for regulatory financial documents, designed to categorize filings into 29 high-value classes.

This model utilizes a Hybrid Architecture:

Semantic Encoder: A fine-tuned jina-embeddings-v3 (8192 token context) trained via Contrastive Learning (Triplet Loss).
Classification Head: An XGBoost classifier trained on semantic vectors + metadata features (Document Length).

Repository: FinancialReports/jina-v3-financial-classifier

📊 Performance Metrics

This model is designed for high-stakes financial environments where precision is paramount. It supports a Confidence Gating workflow.

Metric	Score	Description
Production Reliability	90.0%	Accuracy on documents with confidence > 75%.
Automation Coverage	89.7%	Percentage of documents handled automatically.
Raw Accuracy	85.44%	Baseline accuracy across all 29 classes.
Macro F1	0.86	Balanced performance across rare and common classes.

Key Class Performance

Class	Precision	Recall	F1-Score
Director's Dealing	96%	98%	0.97
Remuneration Info	98%	90%	0.94
Net Asset Value	94%	99%	0.96
Voting Results	95%	92%	0.93
Annual Report	90%	86%	0.88

🚀 Usage (Python)

Because this is a Hybrid Model (Transformer + Gradient Boosted Tree), you must load the encoder and the classification head separately.

Installation

pip install sentence-transformers xgboost huggingface_hub numpy

Inference Script

import joblib
import xgboost as xgb
import numpy as np
from huggingface_hub import hf_hub_download
from sentence_transformers import SentenceTransformer

class FinancialClassifier:
    def __init__(self, repo_id="FinancialReports/jina-v3-financial-classifier"):
        print(f"Loading model from {repo_id}...")
        
        # 1. Load the Brain (Jina Encoder)
        self.encoder = SentenceTransformer(repo_id, trust_remote_code=True)
        self.encoder.max_seq_length = 8192 # Full context window
        
        # 2. Download & Load the Head (XGBoost)
        classifier_path = hf_hub_download(repo_id=repo_id, filename="financial_classifier_final_v1.json")
        self.classifier = xgb.XGBClassifier()
        self.classifier.load_model(classifier_path)
        
        # 3. Download & Load the Label Decoder
        decoder_path = hf_hub_download(repo_id=repo_id, filename="label_decoder_final_v1.pkl")
        self.id2label = joblib.load(decoder_path)
        print("✅ System Ready.")

    def predict(self, text):
        # 1. Feature Extraction (Text + Log-Length)
        # We inject document length to distinguish short Earnings Releases from long Interim Reports.
        embedding = self.encoder.encode([text])[0]
        length_feature = np.log1p(len(text))
        length_norm = length_feature / 12.0 # Normalized scale
        
        # Combine features
        features = np.hstack([embedding, [length_norm]])
        
        # 2. Inference
        probs = self.classifier.predict_proba([features])[0]
        pred_id = np.argmax(probs)
        confidence = float(np.max(probs))
        label = self.id2label[pred_id]
        
        return {
            "label": label,
            "confidence": round(confidence, 4),
            "status": "accept" if confidence > 0.75 else "manual_review"
        }

# Example
clf = FinancialClassifier()
doc = "We are pleased to announce the acquisition of..."
result = clf.predict(doc)
print(result)

📂 Taxonomy (29 Classes)

The model classifies text into one of the following categories:

Financial Reporting: Annual Report, Earnings Release, Interim / Quarterly Report, Periodic Financial Results.
Transactions: M&A Activity, Transaction in Own Shares, Share Issue/Capital Change, Capital/Financing Update.
Governance: Board/Management Information, Director's Dealing, Remuneration Information, Governance Information, Proxy Solicitation.
Shareholder Meetings: AGM Information, Declaration of Voting Results.
Funds: Net Asset Value, Fund Information / Factsheet, Notice of Dividend Amount.
Legal/Compliance: Regulatory Filings, Legal Proceedings Report, Delisting Announcement.

🔧 Training Details

Base Model: jinaai/jina-embeddings-v3
Context Window: 8192 Tokens (Truncation Strategy: Tail)
Training Data: 27,000+ synthetic and augmented financial filings.
Objective: Batch All Triplet Loss (Contrastive Learning).
Hardware: Trained on NVIDIA A100 (40GB).

Limitations

Management Reports: The model occasionally confuses generic "Management Reports" with "Periodic Financial Results" due to high semantic overlap. Confidence gating is recommended.
Hybrid Requirement: This model cannot be loaded with AutoModelForSequenceClassification alone; it requires the accompanying XGBoost artifacts found in this repository.

Downloads last month: 2

Safetensors

Model size

0.6B params

Tensor type

BF16

Model tree for FinancialReports/jina-v3-financial-classifier

Base model

jinaai/jina-embeddings-v3

Finetuned

(28)

this model