|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
# Finance Document Classifier |
|
|
|
|
|
This repository contains a classifier for determining whether a document is finance-related. |
|
|
|
|
|
## Model Overview |
|
|
- A regression-based classifier with two classes: financial (1) and non-financial (0). |
|
|
- Uses `Snowflake/snowflake-arctic-embed-m` as the embedding model with a classification head. During the training, we train the model in a regression way. |
|
|
- We used `Qwen/Qwen2.5-72B-Instruct` to annotate 110k CulturaX documents with a note between 0 and 5, for the training, scores between [0,2] are converted to 0, [3,5] to 1. Then trained on 108k and test on 2k. |
|
|
|
|
|
|
|
|
## How to Use |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
|
|
# Load tokenizer and model |
|
|
tokenizer = AutoTokenizer.from_pretrained("DragonLLM/ClassiFin") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("DragonLLM/ClassiFin") |
|
|
|
|
|
# Example text |
|
|
text = "This is a test sentence." |
|
|
|
|
|
# Tokenize input |
|
|
inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True) |
|
|
|
|
|
# Get model outputs |
|
|
outputs = model(**inputs) |
|
|
logits = outputs.logits.float().detach().cpu().numpy() |
|
|
logits = logits.ravel().tolist() |
|
|
|
|
|
# Convert logits to class labels |
|
|
int_scores = [int(round(max(0, min(logit, 1)))) for logit in logits] # 0 for non-financial, 1 for financial |
|
|
``` |
|
|
|
|
|
## Model Performance |
|
|
- Evaluated on the test set of 2000 samples. |
|
|
|
|
|
``` |
|
|
precision recall f1-score support |
|
|
|
|
|
0 0.95 0.99 0.97 1750 |
|
|
1 0.92 0.62 0.74 250 |
|
|
accuracy 0.95 2000 |
|
|
macro avg 0.93 0.81 0.85 2000 |
|
|
weighted avg 0.94 0.95 0.94 2000 |
|
|
``` |
|
|
## Citation |
|
|
|
|
|
If you use this model in your research or applications, please cite this repository. |
|
|
|
|
|
``` |
|
|
@misc{ClassiFin, |
|
|
title={ClassiFin: Finance Document Classifier}, |
|
|
author={Liu, Jingshu and Qader, Raheel and Caillaut, Gaëtan and Nakhle, Mariam and Barthelemy, Jean-Gabriel and Sadoune, Arezki and Foly, Sabine}, |
|
|
url={https://huggingface.co/DragonLLM/ClassiFin}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|