File size: 7,462 Bytes
82e8cc8
9e1cc31
4f2205c
9e1cc31
 
4f2205c
82e8cc8
4f2205c
82e8cc8
 
9e1cc31
 
 
 
82e8cc8
 
9e1cc31
 
 
 
 
4f2205c
9e1cc31
 
 
 
 
 
 
 
 
 
 
 
 
 
4f2205c
9e1cc31
 
4f2205c
9e1cc31
 
 
82e8cc8
 
4f2205c
9e1cc31
 
 
4f2205c
9e1cc31
 
4f2205c
 
9e1cc31
 
 
 
 
4f2205c
9e1cc31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82e8cc8
9e1cc31
82e8cc8
9e1cc31
82e8cc8
9e1cc31
 
 
 
 
82e8cc8
9e1cc31
82e8cc8
9e1cc31
82e8cc8
9e1cc31
 
82e8cc8
4f2205c
 
9e1cc31
82e8cc8
9e1cc31
 
 
82e8cc8
9e1cc31
 
 
82e8cc8
9e1cc31
 
 
 
 
82e8cc8
9e1cc31
 
 
 
 
 
 
 
 
 
 
 
 
4f2205c
 
9e1cc31
 
 
 
4f2205c
9e1cc31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
# Model Card generated based on AutoTrain run
# Date: 2025-04-07
language:
- en # Primarily English from EDGAR
- multilingual # Assumed multilingual from European sources & XLM-R base
library_name: transformers
license: apache-2.0 # Or appropriate license
tags:
- text-classification
- financial-filings
- xlm-roberta
- autotrain
pipeline_tag: text-classification
base_model: FacebookAI/xlm-roberta-large
widget:
- text: "ACME Corp today announced its results for the fourth quarter..."
  example_title: "Example Filing Snippet"
datasets:
- custom # Combined Labelbox and EDGAR data
model-index:
- name: FinancialReports/filing-classification-xlmr # Model Repo ID
  results:
  - task:
      type: text-classification
      name: Text Classification
    dataset:
      type: custom
      name: Combined Financial Filings (Labelbox + EDGAR)
      split: validation
    metrics:
      - type: accuracy
        value: 0.9617
        name: Accuracy
      - type: f1
        value: 0.6470
        name: F1 (Macro)
      - type: f1
        value: 0.9597
        name: F1 (Weighted)
      - type: loss
        value: 0.1687
        name: Loss
---

# Model Card: FinancialReports Filing Classifier

## Model Details

* **Model Name:** `FinancialReports/filing-classification-xlmr` (Assumed Repo ID based on AutoTrain project & org)
* **Description:** This model is a fine-tuned version of `FacebookAI/xlm-roberta-large` designed for multi-class text classification of financial filing documents. It classifies input text (expected in markdown format) into one of 37 predefined filing type categories.
* **Base Model:** [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large)
* **Developed by:** FinancialReports ([financialreports.eu](https://financialreports.eu))
* **Model Version:** 1.0
* **Fine-tuning Framework:** Hugging Face AutoTrain

## Intended Use

* **Primary Use:** To automatically classify financial filing documents based on their textual content into one of 37 categories (e.g., Annual Report, Quarterly Report, Directors' Dealings, etc.).
* **Primary Users:** Financial analysts, data providers, regulatory compliance teams, researchers associated with FinancialReports.
* **Out-of-Scope Uses:** This model is not designed for sentiment analysis, named entity recognition, or classification tasks outside the defined 37 financial filing types. Performance on filing types significantly different from those in the training data is not guaranteed.

## Training Data

* **Dataset:** The model was fine-tuned on a combined dataset of approximately 14,233 financial filing documents.
* **Sources:**
    * ~9,700 documents custom-labeled via Labelbox, likely originating from European companies (potentially multilingual).
    * ~4,500 documents sourced from the US EDGAR database (English).
* **Preprocessing:** Document text was converted to Markdown format before training. AutoTrain handled the train/validation split (typically 80/20 or 90/10).
* **Labels:** The dataset covers 37 distinct filing type classifications. Due to the data sources, there is an imbalance, with some filing types being much more frequent than others.

## Training Procedure

* **Framework:** Hugging Face AutoTrain UI running within a Hugging Face Space.
* **Hardware:** Nvidia T4 GPU (small configuration).
* **Base Model:** `FacebookAI/xlm-roberta-large`
* **Key Hyperparameters (from AutoTrain):**
    * Epochs: 3
    * Batch Size: 8
    * Learning Rate: 5e-5
    * Max Sequence Length: 512
    * Optimizer: AdamW
    * Scheduler: Linear warmup
    * Mixed Precision: fp16

## Evaluation Results

The following metrics were reported by AutoTrain based on its internal validation split:

* **Loss:** 0.1687
* **Accuracy / F1 Micro:** 0.9617 (96.2%)
* **F1 Weighted:** 0.9597 (96.0%)
* **F1 Macro:** 0.6470 (64.7%)
* *(Precision/Recall scores show a similar pattern)*

**Interpretation:**

The model achieves very high overall accuracy and weighted F1 score, indicating excellent performance on the most common filing types within the dataset. However, the significantly lower **Macro F1 score (64.7%)** reveals a key limitation: the model struggles considerably with **less frequent (minority) filing types**. The high overall accuracy is largely driven by correctly classifying the majority classes. Performance across *all* 37 classes is uneven due to the inherent class imbalance in the training data.

## Limitations and Bias

* **Performance on Rare Classes:** As highlighted by the evaluation metrics, the model's ability to correctly identify infrequent filing types is significantly lower than for common types. Users should be cautious when relying on predictions for rare categories and consider using the confidence scores.
* **Data Source Bias:** The training data primarily comes from European and US sources. The model's performance on filings from other geographical regions or those written in languages not well-represented by XLM-RoBERTa or the training data is unknown and likely lower.
* **Markdown Formatting:** The model expects input text in Markdown format, similar to the training data. Performance may degrade on plain text or other formats.
* **Out-of-Distribution Data:** The model can only classify documents into the 37 types it was trained on. It cannot identify entirely new or unforeseen filing types.
* **Ambiguity:** Some filings may be genuinely ambiguous or borderline between categories, potentially leading to lower confidence predictions or misclassifications.

## How to Use

You can use this model via the Hugging Face `transformers` library:

```python
from transformers import pipeline

# Load the classifier pipeline (replace with your actual model repo ID on the Hub)
model_repo_id = "FinancialReports/filing-classification-xlmr"
classifier = pipeline("text-classification", model=model_repo_id)

# Example usage
filing_text = """
## ACME Corp Q4 Results

ACME Corporation today announced financial results for its fourth quarter ended December 31...
(Insert markdown filing text here)
"""

# Get top predictions with scores (confidence)
predictions = classifier(filing_text, top_k=5)
print(predictions)
# Expected output format:
# [{'label': 'Quarterly Report', 'score': 0.98}, {'label': 'Earnings Release', 'score': 0.01}, ...]

# --- To get probabilities for all classes ---
# from transformers import AutoTokenizer, AutoModelForSequenceClassification
# import torch
#
# tokenizer = AutoTokenizer.from_pretrained(model_repo_id)
# model = AutoModelForSequenceClassification.from_pretrained(model_repo_id)
# inputs = tokenizer(filing_text, return_tensors="pt", truncation=True, padding=True, max_length=512)
# with torch.no_grad():
#     logits = model(**inputs).logits
# probabilities = torch.softmax(logits, dim=-1)[0] # Get probabilities for first item
# results = [{"label": model.config.id2label[i], "score": prob.item()} for i, prob in enumerate(probabilities)]
# results.sort(key=lambda x: x["score"], reverse=True)
# print(results)
Citation@misc{financialreports_filing_classifier_2025,
  author    = {FinancialReports},
  title     = {XLM-RoBERTa-Large Financial Filing Classifier},
  year      = {2025},
  publisher = {Hugging Face},
  journal   = {Hugging Face Model Hub},
  howpublished = {\url{[https://huggingface.co/FinancialReports/filing-classification-xlmr](https://www.google.com/search?q=https://huggingface.co/FinancialReports/filing-classification-xlmr)}}, # Assumed URL
}