silashundhausen commited on
Commit
9e1cc31
·
verified ·
1 Parent(s): 82e8cc8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +135 -17
README.md CHANGED
@@ -1,37 +1,155 @@
1
-
2
  ---
 
 
 
 
 
3
  library_name: transformers
 
4
  tags:
5
- - autotrain
6
  - text-classification
 
 
 
 
7
  base_model: FacebookAI/xlm-roberta-large
8
  widget:
9
- - text: "I love AutoTrain"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- # Model Trained Using AutoTrain
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- - Problem type: Text Classification
15
 
16
- ## Validation Metrics
17
- loss: 0.16869813203811646
18
 
19
- f1_macro: 0.6470233113292341
 
 
 
 
20
 
21
- f1_micro: 0.9617140850017563
22
 
23
- f1_weighted: 0.9597252404005653
24
 
25
- precision_macro: 0.6657138827178418
 
26
 
27
- precision_micro: 0.9617140850017563
 
 
28
 
29
- precision_weighted: 0.9600327052750102
 
 
30
 
31
- recall_macro: 0.6540179851686874
 
 
32
 
33
- recall_micro: 0.9617140850017563
 
 
 
 
34
 
35
- recall_weighted: 0.9617140850017563
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
- accuracy: 0.9617140850017563
 
 
 
 
 
 
 
 
 
1
  ---
2
+ # Model Card generated based on AutoTrain run
3
+ # Date: 2025-04-05 (Please update with actual date)
4
+ language:
5
+ - en # Primarily English from EDGAR
6
+ - multilingual # Corrected special value
7
  library_name: transformers
8
+ license: apache-2.0 # Or appropriate license if you choose one
9
  tags:
 
10
  - text-classification
11
+ - financial-filings
12
+ - xlm-roberta
13
+ - autotrain
14
+ pipeline_tag: text-classification
15
  base_model: FacebookAI/xlm-roberta-large
16
  widget:
17
+ - text: "ACME Corp today announced its results for the fourth quarter..."
18
+ example_title: "Example Filing Snippet"
19
+ datasets:
20
+ - custom # Combined Labelbox and EDGAR data
21
+ model-index:
22
+ - name: xlm-roberta-large-fin-filing-classification # Example Name
23
+ results:
24
+ - task:
25
+ type: text-classification
26
+ name: Text Classification
27
+ dataset:
28
+ type: custom
29
+ name: Combined Financial Filings (Labelbox + EDGAR)
30
+ split: validation
31
+ # Corrected metrics format (array of objects, removed config object)
32
+ metrics:
33
+ - type: accuracy
34
+ value: 0.9617
35
+ name: Accuracy
36
+ - type: f1
37
+ value: 0.6470
38
+ name: F1 (Macro) # Averaging specified in name
39
+ - type: f1
40
+ value: 0.9597
41
+ name: F1 (Weighted) # Averaging specified in name
42
+ - type: loss
43
+ value: 0.1687
44
+ name: Loss
45
  ---
46
 
47
+ # Model Card: XLM-RoBERTa-Large Financial Filing Classifier
48
+
49
+ ## Model Details
50
+
51
+ * **Model Name:** `xlm-roberta-large-fin-filing-classification` (Example - Replace with your chosen Hub repo name)
52
+ * **Description:** This model is a fine-tuned version of `FacebookAI/xlm-roberta-large` designed for multi-class text classification of financial filing documents. It classifies input text (expected in markdown format) into one of 37 predefined filing type categories.
53
+ * **Base Model:** [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large)
54
+ * **Developed by:** [Your Name/Organization - e.g., silashundhausen]
55
+ * **Model Version:** 1.0 (Example)
56
+ * **Fine-tuning Framework:** Hugging Face AutoTrain
57
+
58
+ ## Intended Use
59
+
60
+ * **Primary Use:** To automatically classify financial filing documents based on their textual content into one of 37 categories (e.g., Annual Report, Quarterly Report, Directors' Dealings, etc.).
61
+ * **Primary Users:** Financial analysts, data providers, regulatory compliance teams, researchers.
62
+ * **Out-of-Scope Uses:** This model is not designed for sentiment analysis, named entity recognition, or classification tasks outside the defined 37 financial filing types. Performance on filing types significantly different from those in the training data is not guaranteed.
63
+
64
+ ## Training Data
65
+
66
+ * **Dataset:** The model was fine-tuned on a combined dataset of approximately 14,233 financial filing documents.
67
+ * **Sources:**
68
+ * ~9,700 documents custom-labeled via Labelbox, likely originating from European companies (potentially multilingual).
69
+ * ~4,500 documents sourced from the US EDGAR database (English).
70
+ * **Preprocessing:** Document text was converted to Markdown format before training. AutoTrain handled the train/validation split (typically 80/20 or 90/10).
71
+ * **Labels:** The dataset covers 37 distinct filing type classifications. Due to the data sources, there is an imbalance, with some filing types being much more frequent than others.
72
+
73
+ ## Training Procedure
74
+
75
+ * **Framework:** Hugging Face AutoTrain UI running within a Hugging Face Space.
76
+ * **Hardware:** Nvidia T4 GPU (small configuration).
77
+ * **Base Model:** `FacebookAI/xlm-roberta-large`
78
+ * **Key Hyperparameters (from AutoTrain):**
79
+ * Epochs: 3
80
+ * Batch Size: 8
81
+ * Learning Rate: 5e-5
82
+ * Max Sequence Length: 512
83
+ * Optimizer: AdamW
84
+ * Scheduler: Linear warmup
85
+ * Mixed Precision: fp16
86
+
87
+ ## Evaluation Results
88
+
89
+ The following metrics were reported by AutoTrain based on its internal validation split:
90
+
91
+ * **Loss:** 0.1687
92
+ * **Accuracy / F1 Micro:** 0.9617 (96.2%)
93
+ * **F1 Weighted:** 0.9597 (96.0%)
94
+ * **F1 Macro:** 0.6470 (64.7%)
95
+ * *(Precision/Recall scores show a similar pattern)*
96
+
97
+ **Interpretation:**
98
 
99
+ The model achieves very high overall accuracy and weighted F1 score, indicating excellent performance on the most common filing types within the dataset. However, the significantly lower **Macro F1 score (64.7%)** reveals a key limitation: the model struggles considerably with **less frequent (minority) filing types**. The high overall accuracy is largely driven by correctly classifying the majority classes. Performance across *all* 37 classes is uneven due to the inherent class imbalance in the training data.
100
 
101
+ ## Limitations and Bias
 
102
 
103
+ * **Performance on Rare Classes:** As highlighted by the evaluation metrics, the model's ability to correctly identify infrequent filing types is significantly lower than for common types. Users should be cautious when relying on predictions for rare categories and consider using the confidence scores.
104
+ * **Data Source Bias:** The training data primarily comes from European and US sources. The model's performance on filings from other geographical regions or those written in languages not well-represented by XLM-RoBERTa or the training data is unknown and likely lower.
105
+ * **Markdown Formatting:** The model expects input text in Markdown format, similar to the training data. Performance may degrade on plain text or other formats.
106
+ * **Out-of-Distribution Data:** The model can only classify documents into the 37 types it was trained on. It cannot identify entirely new or unforeseen filing types.
107
+ * **Ambiguity:** Some filings may be genuinely ambiguous or borderline between categories, potentially leading to lower confidence predictions or misclassifications.
108
 
109
+ ## How to Use
110
 
111
+ You can use this model via the Hugging Face `transformers` library:
112
 
113
+ ```python
114
+ from transformers import pipeline
115
 
116
+ # Load the classifier pipeline (replace with your actual model repo ID)
117
+ model_repo_id = "silashundhausen/filing-classification-xlmr" # Example ID
118
+ classifier = pipeline("text-classification", model=model_repo_id)
119
 
120
+ # Example usage
121
+ filing_text = """
122
+ ## ACME Corp Q4 Results
123
 
124
+ ACME Corporation today announced financial results for its fourth quarter ended December 31...
125
+ (Insert markdown filing text here)
126
+ """
127
 
128
+ # Get top predictions with scores (confidence)
129
+ predictions = classifier(filing_text, top_k=5)
130
+ print(predictions)
131
+ # Expected output format:
132
+ # [{'label': 'Quarterly Report', 'score': 0.98}, {'label': 'Earnings Release', 'score': 0.01}, ...]
133
 
134
+ # --- To get probabilities for all classes ---
135
+ # from transformers import AutoTokenizer, AutoModelForSequenceClassification
136
+ # import torch
137
+ #
138
+ # tokenizer = AutoTokenizer.from_pretrained(model_repo_id)
139
+ # model = AutoModelForSequenceClassification.from_pretrained(model_repo_id)
140
+ # inputs = tokenizer(filing_text, return_tensors="pt", truncation=True, padding=True, max_length=512)
141
+ # with torch.no_grad():
142
+ # logits = model(**inputs).logits
143
+ # probabilities = torch.softmax(logits, dim=-1)[0] # Get probabilities for first item
144
+ # results = [{"label": model.config.id2label[i], "score": prob.item()} for i, prob in enumerate(probabilities)]
145
+ # results.sort(key=lambda x: x["score"], reverse=True)
146
+ # print(results)
147
 
148
+ Citation@misc{your_model_citation_tag, # Consider creating one
149
+ author = {[Your Name/Organization]},
150
+ title = {XLM-RoBERTa-Large Financial Filing Classifier},
151
+ year = {2025},
152
+ publisher = {Hugging Face},
153
+ journal = {Hugging Face Model Hub},
154
+ howpublished = {\url{[https://huggingface.co/](https://huggingface.co/)[your-username]/[your-repo-name]}}, # Replace URL
155
+ }