FinancialReports
/

filing-classification-xlmr

@@ -1,37 +1,155 @@
 ---
 library_name: transformers
 tags:
-- autotrain
 - text-classification
 base_model: FacebookAI/xlm-roberta-large
 widget:
-- text: "I love AutoTrain"
 ---
-# Model Trained Using AutoTrain
-- Problem type: Text Classification
-## Validation Metrics
-loss: 0.16869813203811646
-f1_macro: 0.6470233113292341
-f1_micro: 0.9617140850017563
-f1_weighted: 0.9597252404005653
-precision_macro: 0.6657138827178418
-precision_micro: 0.9617140850017563
-precision_weighted: 0.9600327052750102
-recall_macro: 0.6540179851686874
-recall_micro: 0.9617140850017563
-recall_weighted: 0.9617140850017563
-accuracy: 0.9617140850017563

 ---
+# Model Card generated based on AutoTrain run
+# Date: 2025-04-05 (Please update with actual date)
+language:
+- en # Primarily English from EDGAR
+- multilingual # Corrected special value
 library_name: transformers
+license: apache-2.0 # Or appropriate license if you choose one
 tags:
 - text-classification
+- financial-filings
+- xlm-roberta
+- autotrain
+pipeline_tag: text-classification
 base_model: FacebookAI/xlm-roberta-large
 widget:
+- text: "ACME Corp today announced its results for the fourth quarter..."
+  example_title: "Example Filing Snippet"
+datasets:
+- custom # Combined Labelbox and EDGAR data
+model-index:
+- name: xlm-roberta-large-fin-filing-classification # Example Name
+  results:
+  - task:
+      type: text-classification
+      name: Text Classification
+    dataset:
+      type: custom
+      name: Combined Financial Filings (Labelbox + EDGAR)
+      split: validation
+    # Corrected metrics format (array of objects, removed config object)
+    metrics:
+      - type: accuracy
+        value: 0.9617
+        name: Accuracy
+      - type: f1
+        value: 0.6470
+        name: F1 (Macro) # Averaging specified in name
+      - type: f1
+        value: 0.9597
+        name: F1 (Weighted) # Averaging specified in name
+      - type: loss
+        value: 0.1687
+        name: Loss
 ---
+# Model Card: XLM-RoBERTa-Large Financial Filing Classifier
+## Model Details
+* **Model Name:** `xlm-roberta-large-fin-filing-classification` (Example - Replace with your chosen Hub repo name)
+* **Description:** This model is a fine-tuned version of `FacebookAI/xlm-roberta-large` designed for multi-class text classification of financial filing documents. It classifies input text (expected in markdown format) into one of 37 predefined filing type categories.
+* **Base Model:** [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large)
+* **Developed by:** [Your Name/Organization - e.g., silashundhausen]
+* **Model Version:** 1.0 (Example)
+* **Fine-tuning Framework:** Hugging Face AutoTrain
+## Intended Use
+* **Primary Use:** To automatically classify financial filing documents based on their textual content into one of 37 categories (e.g., Annual Report, Quarterly Report, Directors' Dealings, etc.).
+* **Primary Users:** Financial analysts, data providers, regulatory compliance teams, researchers.
+* **Out-of-Scope Uses:** This model is not designed for sentiment analysis, named entity recognition, or classification tasks outside the defined 37 financial filing types. Performance on filing types significantly different from those in the training data is not guaranteed.
+## Training Data
+* **Dataset:** The model was fine-tuned on a combined dataset of approximately 14,233 financial filing documents.
+* **Sources:**
+    * ~9,700 documents custom-labeled via Labelbox, likely originating from European companies (potentially multilingual).
+    * ~4,500 documents sourced from the US EDGAR database (English).
+* **Preprocessing:** Document text was converted to Markdown format before training. AutoTrain handled the train/validation split (typically 80/20 or 90/10).
+* **Labels:** The dataset covers 37 distinct filing type classifications. Due to the data sources, there is an imbalance, with some filing types being much more frequent than others.
+## Training Procedure
+* **Framework:** Hugging Face AutoTrain UI running within a Hugging Face Space.
+* **Hardware:** Nvidia T4 GPU (small configuration).
+* **Base Model:** `FacebookAI/xlm-roberta-large`
+* **Key Hyperparameters (from AutoTrain):**
+    * Epochs: 3
+    * Batch Size: 8
+    * Learning Rate: 5e-5
+    * Max Sequence Length: 512
+    * Optimizer: AdamW
+    * Scheduler: Linear warmup
+    * Mixed Precision: fp16
+## Evaluation Results
+The following metrics were reported by AutoTrain based on its internal validation split:
+* **Loss:** 0.1687
+* **Accuracy / F1 Micro:** 0.9617 (96.2%)
+* **F1 Weighted:** 0.9597 (96.0%)
+* **F1 Macro:** 0.6470 (64.7%)
+* *(Precision/Recall scores show a similar pattern)*
+**Interpretation:**
+The model achieves very high overall accuracy and weighted F1 score, indicating excellent performance on the most common filing types within the dataset. However, the significantly lower **Macro F1 score (64.7%)** reveals a key limitation: the model struggles considerably with **less frequent (minority) filing types**. The high overall accuracy is largely driven by correctly classifying the majority classes. Performance across *all* 37 classes is uneven due to the inherent class imbalance in the training data.
+## Limitations and Bias
+* **Performance on Rare Classes:** As highlighted by the evaluation metrics, the model's ability to correctly identify infrequent filing types is significantly lower than for common types. Users should be cautious when relying on predictions for rare categories and consider using the confidence scores.
+* **Data Source Bias:** The training data primarily comes from European and US sources. The model's performance on filings from other geographical regions or those written in languages not well-represented by XLM-RoBERTa or the training data is unknown and likely lower.
+* **Markdown Formatting:** The model expects input text in Markdown format, similar to the training data. Performance may degrade on plain text or other formats.
+* **Out-of-Distribution Data:** The model can only classify documents into the 37 types it was trained on. It cannot identify entirely new or unforeseen filing types.
+* **Ambiguity:** Some filings may be genuinely ambiguous or borderline between categories, potentially leading to lower confidence predictions or misclassifications.
+## How to Use
+You can use this model via the Hugging Face `transformers` library:
+```python
+from transformers import pipeline
+# Load the classifier pipeline (replace with your actual model repo ID)
+model_repo_id = "silashundhausen/filing-classification-xlmr" # Example ID
+classifier = pipeline("text-classification", model=model_repo_id)
+# Example usage
+filing_text = """
+## ACME Corp Q4 Results
+ACME Corporation today announced financial results for its fourth quarter ended December 31...
+(Insert markdown filing text here)
+"""
+# Get top predictions with scores (confidence)
+predictions = classifier(filing_text, top_k=5)
+print(predictions)
+# Expected output format:
+# [{'label': 'Quarterly Report', 'score': 0.98}, {'label': 'Earnings Release', 'score': 0.01}, ...]
+# --- To get probabilities for all classes ---
+# from transformers import AutoTokenizer, AutoModelForSequenceClassification
+# import torch
+#
+# tokenizer = AutoTokenizer.from_pretrained(model_repo_id)
+# model = AutoModelForSequenceClassification.from_pretrained(model_repo_id)
+# inputs = tokenizer(filing_text, return_tensors="pt", truncation=True, padding=True, max_length=512)
+# with torch.no_grad():
+#     logits = model(**inputs).logits
+# probabilities = torch.softmax(logits, dim=-1)[0] # Get probabilities for first item
+# results = [{"label": model.config.id2label[i], "score": prob.item()} for i, prob in enumerate(probabilities)]
+# results.sort(key=lambda x: x["score"], reverse=True)
+# print(results)
+Citation@misc{your_model_citation_tag, # Consider creating one
+  author    = {[Your Name/Organization]},
+  title     = {XLM-RoBERTa-Large Financial Filing Classifier},
+  year      = {2025},
+  publisher = {Hugging Face},
+  journal   = {Hugging Face Model Hub},
+  howpublished = {\url{[https://huggingface.co/](https://huggingface.co/)[your-username]/[your-repo-name]}}, # Replace URL
+}