File size: 3,061 Bytes
189d315
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20f38f8
 
189d315
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
language: en
license: apache-2.0
library_name: transformers
tags:
- finance
- nlp
- sentiment-analysis
- token-classification
- ner
- transformers
pipeline_tag: text-classification
task_categories:
- text-classification
- token-classification
---

# πŸ’Ή Finance NLP Toolkit

**Finance NLP Toolkit** is a practical starter pack for analyzing financial text with Transformers.  
It supports two core tasks:

1) **Sentiment Analysis** β€” positive / neutral / negative market tone  
2) **Named Entity Recognition (NER)** β€” companies, tickers, money, dates, etc.

This repository includes:
- Ready-to-run **inference snippets**
- **Training scripts** for fine-tuning on your datasets
- Label mapping examples and utilities

> **Note:** Initial release ships training + inference scaffolding.  
> Plug in your dataset and fine-tune, or point to an existing finance model.

---

## πŸš€ Quickstart (inference)

Install deps:
```bash
pip install -r requirements.txt

Sentiment:

from transformers import pipeline
sentiment = pipeline(
    "sentiment-analysis",
    model="Proooof/Finance-NLP-Toolkit",   # after you push your fine-tuned weights
    tokenizer="Proooof/Finance-NLP-Toolkit"
)
print(sentiment("The company reported record profits and raised guidance."))

NER:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
tok = AutoTokenizer.from_pretrained("YOUR-USERNAME/Finance-NLP-Toolkit", revision="ner")
ner_model = AutoModelForTokenClassification.from_pretrained("YOUR-USERNAME/Finance-NLP-Toolkit", revision="ner")
ner = pipeline("token-classification", model=ner_model, tokenizer=tok, aggregation_strategy="simple")
print(ner("Apple Inc. reported a $10 billion revenue increase in Q2 2025."))

Tip: Use branches to host multiple checkpoints in one repo:

main β†’ sentiment

ner β†’ NER model
Push each set of weights to its respective branch.

🧠 Training
Sentiment (3-class)
python training/train_sentiment.py \
  --model_name distilbert-base-uncased \
  --train_csv /path/train.csv \
  --eval_csv /path/valid.csv \
  --text_col text --label_col label \
  --output_dir ./outputs/sentiment \
  --epochs 3 --batch_size 16 --lr 5e-5

NER (BIO tags)
python training/train_ner.py \
  --model_name bert-base-cased \
  --train_json /path/train.jsonl \
  --eval_json /path/valid.jsonl \
  --text_col tokens --label_col ner_tags \
  --labels_file training/labels_ner.json \
  --output_dir ./outputs/ner \
  --epochs 5 --batch_size 8 --lr 3e-5


After training, push weights to the repo (e.g., git push origin main for sentiment and git push origin ner for NER).

πŸ“Š Expected outputs

Sentiment:

[{'label': 'POSITIVE', 'score': 0.98}]


NER:

[
  {'entity_group': 'ORG', 'word': 'Apple Inc.', 'score': 0.99},
  {'entity_group': 'MONEY', 'word': '$10 billion', 'score': 0.99},
  {'entity_group': 'DATE', 'word': 'Q2 2025', 'score': 0.98}
]

⚠️ Limitations

English focus; domain shift may reduce accuracy

Sarcasm/idioms can confound sentiment

NER needs domain labels for best performance

πŸ“œ License

Apache-2.0