Ranjit0034 commited on
Commit
e8fe888
·
verified ·
1 Parent(s): 082dc5c

Upload docs/model_cards/finee-dataset-README.md with huggingface_hub

Browse files
docs/model_cards/finee-dataset-README.md ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - hi
6
+ - ta
7
+ - te
8
+ - bn
9
+ - kn
10
+ size_categories:
11
+ - 100K<n<1M
12
+ task_categories:
13
+ - token-classification
14
+ - text-generation
15
+ tags:
16
+ - finance
17
+ - banking
18
+ - entity-extraction
19
+ - indian-banking
20
+ - sms
21
+ - synthetic
22
+ pretty_name: FinEE Dataset - Indian Financial Entity Extraction
23
+ ---
24
+
25
+ # FinEE Dataset
26
+
27
+ <p align="center">
28
+ <img src="https://img.shields.io/badge/Samples-152K%2B-blue" alt="Samples">
29
+ <img src="https://img.shields.io/badge/Languages-6-orange" alt="Languages">
30
+ <img src="https://img.shields.io/badge/Banks-25%2B-green" alt="Banks">
31
+ </p>
32
+
33
+ ## Dataset Description
34
+
35
+ A comprehensive dataset for training financial entity extraction models on Indian banking messages. Contains 152,000+ samples covering SMS, emails, and transaction notifications from major Indian banks.
36
+
37
+ ### Languages
38
+
39
+ - English (en) - 86%
40
+ - Hindi (hi) - 3%
41
+ - Tamil (ta) - 3%
42
+ - Telugu (te) - 3%
43
+ - Bengali (bn) - 3%
44
+ - Kannada (kn) - 2%
45
+
46
+ ### Supported Transaction Types
47
+
48
+ - UPI payments (PhonePe, GPay, Paytm)
49
+ - NEFT/IMPS/RTGS transfers
50
+ - Credit card transactions
51
+ - Debit card transactions
52
+ - EMI payments
53
+ - Refunds and reversals
54
+ - Salary credits
55
+ - Bill payments
56
+
57
+ ### Covered Banks
58
+
59
+ HDFC, ICICI, SBI, Axis, Kotak, PNB, BOB, Canara, Union, IDBI, IndusInd, Yes Bank, Federal, South Indian, Karur Vysya, and more.
60
+
61
+ ### Covered Merchants
62
+
63
+ - Food: Swiggy, Zomato, Zepto, BigBasket
64
+ - Shopping: Amazon, Flipkart, Myntra, Meesho
65
+ - Travel: Uber, Ola, IRCTC, MakeMyTrip
66
+ - Investment: Zerodha, Groww, Upstox, Angel One
67
+ - Bills: Airtel, Jio, electricity, gas
68
+ - Entertainment: Netflix, BookMyShow, Hotstar
69
+
70
+ ## Dataset Structure
71
+
72
+ ### Data Fields
73
+
74
+ ```json
75
+ {
76
+ "input": "HDFC Bank: Rs.2,500 debited from A/c XX1234...",
77
+ "output": {
78
+ "amount": 2500.0,
79
+ "type": "debit",
80
+ "account": "1234",
81
+ "bank": "HDFC",
82
+ "merchant": "Swiggy",
83
+ "category": "food",
84
+ "is_p2m": true
85
+ }
86
+ }
87
+ ```
88
+
89
+ ### Instruction Format (ChatML)
90
+
91
+ ```json
92
+ {
93
+ "messages": [
94
+ {"role": "system", "content": "You are a financial entity extraction assistant..."},
95
+ {"role": "user", "content": "Extract financial entities from: ..."},
96
+ {"role": "assistant", "content": "{\"amount\": 2500.0, ...}"}
97
+ ]
98
+ }
99
+ ```
100
+
101
+ ### Splits
102
+
103
+ | Split | Samples | Description |
104
+ |-------|---------|-------------|
105
+ | train | 137,267 | Training data |
106
+ | valid | 7,625 | Validation data |
107
+ | test | 7,627 | Test data (held out) |
108
+
109
+ ## Data Sources
110
+
111
+ 1. **Real Data** (2,419 samples)
112
+ - Anonymized ICICI Bank SMS messages
113
+ - Manually verified labels
114
+
115
+ 2. **Synthetic Data** (100,000 samples)
116
+ - Grammar-based generation
117
+ - Covers all bank templates
118
+ - Realistic amount distributions
119
+
120
+ 3. **Multilingual Synthetic** (50,100 samples)
121
+ - Hindi, Tamil, Telugu, Bengali, Kannada
122
+ - Markov chain for realistic flow
123
+ - Edge case oversampling
124
+
125
+ ## Usage
126
+
127
+ ### Load with Datasets
128
+
129
+ ```python
130
+ from datasets import load_dataset
131
+
132
+ dataset = load_dataset("Ranjit0034/finee-dataset")
133
+
134
+ # Access splits
135
+ train = dataset["train"]
136
+ valid = dataset["valid"]
137
+ test = dataset["test"]
138
+
139
+ # Iterate
140
+ for example in train:
141
+ print(example["input"])
142
+ print(example["output"])
143
+ ```
144
+
145
+ ### Load for Fine-tuning
146
+
147
+ ```python
148
+ from datasets import load_dataset
149
+
150
+ # Load instruction format
151
+ dataset = load_dataset("Ranjit0034/finee-dataset", data_files={
152
+ "train": "instruction/train.jsonl",
153
+ "valid": "instruction/valid.jsonl"
154
+ })
155
+ ```
156
+
157
+ ## Output Schema
158
+
159
+ | Field | Type | Description |
160
+ |-------|------|-------------|
161
+ | amount | float | Transaction amount in INR |
162
+ | type | string | "debit" or "credit" |
163
+ | account | string | Last 4 digits of account |
164
+ | bank | string | Bank name |
165
+ | date | string | Transaction date (YYYY-MM-DD) |
166
+ | time | string | Transaction time (HH:MM) |
167
+ | reference | string | UPI/NEFT reference number |
168
+ | merchant | string | Merchant name (P2M) |
169
+ | beneficiary | string | Person name (P2P) |
170
+ | vpa | string | UPI VPA address |
171
+ | category | string | Transaction category |
172
+ | is_p2m | boolean | true if merchant, false if P2P |
173
+ | balance | float | Balance after transaction |
174
+ | status | string | success/failed/pending |
175
+
176
+ ## Categories
177
+
178
+ - `food` - Restaurants, delivery
179
+ - `grocery` - Supermarkets
180
+ - `shopping` - E-commerce, retail
181
+ - `transport` - Cab, fuel
182
+ - `travel` - Flights, hotels
183
+ - `bills` - Utilities, recharge
184
+ - `entertainment` - Movies, streaming
185
+ - `healthcare` - Medical, pharmacy
186
+ - `investment` - Stocks, mutual funds
187
+ - `transfer` - P2P transfers
188
+ - `salary` - Income
189
+ - `emi` - Loan payments
190
+
191
+ ## Citation
192
+
193
+ ```bibtex
194
+ @dataset{finee_dataset,
195
+ title={FinEE Dataset: Indian Financial Entity Extraction},
196
+ author={Ranjit Behera},
197
+ year={2026},
198
+ url={https://huggingface.co/datasets/Ranjit0034/finee-dataset}
199
+ }
200
+ ```
201
+
202
+ ## License
203
+
204
+ Apache 2.0
205
+
206
+ ## Related
207
+
208
+ - 🤖 [FinEE Llama 8B](https://huggingface.co/Ranjit0034/finee-llama-8b) - Fine-tuned model
209
+ - 📦 [FinEE Package](https://pypi.org/project/finee/) - Python library
210
+ - 💻 [GitHub](https://github.com/Ranjitbehera0034/Finance-Entity-Extractor)