Ranjit Behera commited on
Commit
810c162
Β·
1 Parent(s): 2b1ff82

docs: clarify hybrid architecture (Regex default + optional LLM)

Browse files
Files changed (1) hide show
  1. README.md +74 -90
README.md CHANGED
@@ -21,17 +21,30 @@ pipeline_tag: text-generation
21
  [![PyPI](https://img.shields.io/pypi/v/finee?style=for-the-badge&logo=pypi&logoColor=white)](https://pypi.org/project/finee/)
22
  [![Tests](https://github.com/Ranjitbehera0034/Finance-Entity-Extractor/actions/workflows/tests.yml/badge.svg)](https://github.com/Ranjitbehera0034/Finance-Entity-Extractor/actions/workflows/tests.yml)
23
  [![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)](https://opensource.org/licenses/MIT)
 
24
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Ranjitbehera0034/Finance-Entity-Extractor/blob/main/examples/demo.ipynb)
25
 
26
- **Extract structured financial data from Indian banking messages.**
27
  <br>
28
- *94.5% field accuracy. <1ms latency. Zero setup.*
29
 
30
  </div>
31
 
32
  ---
33
 
34
- ## ⚑ Install & Run in 10 Seconds
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
  ```bash
37
  pip install finee
@@ -47,7 +60,25 @@ print(r.merchant) # "Swiggy"
47
  print(r.category) # "food"
48
  ```
49
 
50
- **No model download. No API keys. Works offline.**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
  ---
53
 
@@ -65,9 +96,7 @@ Every extraction returns this **guaranteed JSON structure**:
65
  "reference": "534567891234",// string - UPI/NEFT ref
66
  "merchant": "Swiggy", // string - Normalized name
67
  "category": "food", // string - food|shopping|transport|...
68
- "vpa": "swiggy@ybl", // string - Raw VPA
69
- "confidence": 0.95, // float - 0.0 to 1.0
70
- "confidence_level": "HIGH" // "LOW" | "MEDIUM" | "HIGH"
71
  }
72
  ```
73
 
@@ -75,55 +104,46 @@ Every extraction returns this **guaranteed JSON structure**:
75
 
76
  ## πŸ”¬ Verify Accuracy Yourself
77
 
78
- Don't trust "99% accuracy" claims. **Run the benchmark:**
79
-
80
  ```bash
81
- # Clone and test
82
  git clone https://github.com/Ranjitbehera0034/Finance-Entity-Extractor.git
83
  cd Finance-Entity-Extractor
84
  pip install finee
85
-
86
- # Run benchmark
87
  python benchmark.py --all
88
  ```
89
 
90
- **Test on YOUR data:**
91
- ```bash
92
- python benchmark.py --file your_transactions.jsonl
93
- ```
94
-
95
  ---
96
 
97
- ## πŸ’€ Torture Test (Edge Cases)
98
 
99
- Real bank SMS is messy. Here's how FinEE handles the chaos:
 
 
 
 
 
100
 
101
- | Edge Case | Input | Result |
102
- |-----------|-------|--------|
103
- | **Missing spaces** | `Rs.500.00debited from A/c1234` | βœ… amount=500.0 |
104
- | **Weird formatting** | `Rs 2,500/-debited dt:28/12/25` | βœ… amount=2500.0 |
105
- | **Mixed case** | `RS. 1500 DEBITED from ACCT` | βœ… amount=1500.0, type=debit |
106
- | **Unicode symbols** | `β‚Ή2,500 debited from β€’β€’β€’β€’ 3545` | βœ… amount=2500.0 |
107
- | **Multiple amounts** | `Rs.500 debited. Bal: Rs.15,000` | βœ… amount=500.0 (first) |
108
- | **Truncated SMS** | `Rs.2500 debited from A/c...3545 to swi...` | βœ… amount=2500.0 |
109
- | **Extra noise** | `ALERT! Dear Customer, Rs.500 debited... Ignore if done by you.` | βœ… amount=500.0 |
110
 
111
- **Run torture tests:**
112
- ```bash
113
- python benchmark.py --torture
114
- ```
 
 
 
 
 
115
 
116
  ---
117
 
118
- ## 🏦 Supported Banks
119
 
120
- | Bank | Debit | Credit | UPI | NEFT/IMPS |
121
- |------|:-----:|:------:|:---:|:---------:|
122
- | HDFC | βœ… | βœ… | βœ… | βœ… |
123
- | ICICI | βœ… | βœ… | βœ… | βœ… |
124
- | SBI | βœ… | βœ… | βœ… | βœ… |
125
- | Axis | βœ… | βœ… | βœ… | βœ… |
126
- | Kotak | βœ… | βœ… | βœ… | βœ… |
127
 
128
  ---
129
 
@@ -139,20 +159,18 @@ Input Text
139
  β”‚
140
  β–Ό
141
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€οΏ½οΏ½β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
142
- β”‚ TIER 1: Regex Engine (50+ battle-tested patterns) β”‚
143
- β”‚ Extract: amount, date, reference, account, vpa, type β”‚
144
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
145
  β”‚
146
  β–Ό
147
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
148
  β”‚ TIER 2: Rule-Based Mapping (200+ VPA β†’ merchant) β”‚
149
- β”‚ Map: vpa β†’ merchant, merchant β†’ category β”‚
150
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
151
  β”‚
152
  β–Ό
153
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
154
- β”‚ TIER 3: LLM (Optional, for edge cases) β”‚
155
- β”‚ Targeted prompts for: merchant, category only β”‚
156
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
157
  β”‚
158
  β–Ό
@@ -161,66 +179,32 @@ ExtractionResult (Guaranteed Schema)
161
 
162
  ---
163
 
164
- ## πŸ“Š Benchmark Results
165
-
166
- | Metric | Value |
167
- |--------|-------|
168
- | **Field Accuracy** | 94.5% |
169
- | **Latency (Regex)** | <1ms |
170
- | **Latency (LLM)** | ~50ms |
171
- | **Throughput** | 50,000+ msg/sec |
172
- | **Banks Tested** | 5 (HDFC, ICICI, SBI, Axis, Kotak) |
173
-
174
- ---
175
-
176
- ## πŸ’» CLI Usage
177
-
178
- ```bash
179
- # Extract from text
180
- finee extract "Rs.500 debited from A/c 1234"
181
-
182
- # Show version
183
- finee --version
184
-
185
- # Check available backends
186
- finee backends
187
- ```
188
-
189
- ---
190
-
191
  ## πŸ“ Repository Structure
192
 
193
  ```
194
  Finance-Entity-Extractor/
195
- β”œβ”€β”€ src/finee/ # Core package (16 modules)
196
- β”‚ β”œβ”€β”€ extractor.py # Pipeline orchestrator
197
- β”‚ β”œβ”€β”€ regex_engine.py # 50+ regex patterns
198
- β”‚ β”œβ”€β”€ merchants.py # 200+ VPA mappings
199
- β”‚ └── backends/ # MLX, PyTorch, GGUF
200
  β”œβ”€β”€ tests/ # 88 unit tests
201
- β”œβ”€β”€ examples/ # Colab notebook
202
- β”œβ”€β”€ experiments/ # Research notebooks
203
- β”œβ”€β”€ benchmark.py # ⭐ Verify accuracy yourself
204
- β”œβ”€β”€ pyproject.toml
205
- └── README.md
206
  ```
207
 
208
  ---
209
 
210
  ## 🀝 Contributing
211
 
212
- ```bash
213
- git clone https://github.com/Ranjitbehera0034/Finance-Entity-Extractor.git
214
- cd Finance-Entity-Extractor
215
- pip install -e ".[dev]"
216
- pytest tests/
217
- ```
218
 
219
  ---
220
 
221
  ## πŸ“„ License
222
 
223
- MIT License - see [LICENSE](LICENSE)
224
 
225
  ---
226
 
@@ -228,6 +212,6 @@ MIT License - see [LICENSE](LICENSE)
228
 
229
  **Made with ❀️ by Ranjit Behera**
230
 
231
- [PyPI](https://pypi.org/project/finee/) Β· [GitHub](https://github.com/Ranjitbehera0034/Finance-Entity-Extractor) Β· [Hugging Face](https://huggingface.co/Ranjit0034/finance-entity-extractor)
232
 
233
  </div>
 
21
  [![PyPI](https://img.shields.io/pypi/v/finee?style=for-the-badge&logo=pypi&logoColor=white)](https://pypi.org/project/finee/)
22
  [![Tests](https://github.com/Ranjitbehera0034/Finance-Entity-Extractor/actions/workflows/tests.yml/badge.svg)](https://github.com/Ranjitbehera0034/Finance-Entity-Extractor/actions/workflows/tests.yml)
23
  [![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)](https://opensource.org/licenses/MIT)
24
+
25
  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Ranjitbehera0034/Finance-Entity-Extractor/blob/main/examples/demo.ipynb)
26
 
27
+ **Production-grade Finance NER for Indian Banks**
28
  <br>
29
+ *Hybrid Regex + Phi-3 LLM β€’ 94.5% accuracy β€’ <1ms latency*
30
 
31
  </div>
32
 
33
  ---
34
 
35
+ ## πŸ”₯ Hybrid Architecture
36
+
37
+ > **Runs 100% offline using Regex by default.**
38
+ > **Optional 3.8B LLM auto-downloads only for complex edge cases.**
39
+
40
+ | Mode | Latency | Accuracy | Model Download |
41
+ |------|---------|----------|----------------|
42
+ | **Regex (Default)** | <1ms | 87% | ❌ None |
43
+ | **Regex + LLM** | ~50ms | 94.5% | βœ… 7GB (one-time) |
44
+
45
+ ---
46
+
47
+ ## ⚑ Install in 10 Seconds
48
 
49
  ```bash
50
  pip install finee
 
60
  print(r.category) # "food"
61
  ```
62
 
63
+ **Try it now:** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Ranjitbehera0034/Finance-Entity-Extractor/blob/main/examples/demo.ipynb)
64
+
65
+ ---
66
+
67
+ ## 🧠 Enable LLM Mode (For Edge Cases)
68
+
69
+ ```python
70
+ from finee import FinEE
71
+ from finee.schema import ExtractionConfig
72
+
73
+ # Downloads 7GB model once, then runs locally
74
+ extractor = FinEE(ExtractionConfig(use_llm=True))
75
+ result = extractor.extract("Your complex bank message...")
76
+ ```
77
+
78
+ **Supported Backends:**
79
+ - Apple Silicon β†’ MLX (fastest)
80
+ - NVIDIA GPU β†’ PyTorch/CUDA
81
+ - CPU β†’ llama.cpp (GGUF)
82
 
83
  ---
84
 
 
96
  "reference": "534567891234",// string - UPI/NEFT ref
97
  "merchant": "Swiggy", // string - Normalized name
98
  "category": "food", // string - food|shopping|transport|...
99
+ "confidence": 0.95 // float - 0.0 to 1.0
 
 
100
  }
101
  ```
102
 
 
104
 
105
  ## πŸ”¬ Verify Accuracy Yourself
106
 
 
 
107
  ```bash
 
108
  git clone https://github.com/Ranjitbehera0034/Finance-Entity-Extractor.git
109
  cd Finance-Entity-Extractor
110
  pip install finee
 
 
111
  python benchmark.py --all
112
  ```
113
 
 
 
 
 
 
114
  ---
115
 
116
+ ## πŸ’€ Edge Case Handling
117
 
118
+ | Input | Result |
119
+ |-------|--------|
120
+ | `Rs.500.00debited from A/c1234` (no spaces) | βœ… amount=500.0 |
121
+ | `β‚Ή2,500 debited` (Unicode) | βœ… amount=2500.0 |
122
+ | `1.5 Lakh credited` (Lakhs) | βœ… amount=150000.0 |
123
+ | `Rs.500 debited. Bal: Rs.15,000` (multiple) | βœ… amount=500.0 |
124
 
125
+ ---
 
 
 
 
 
 
 
 
126
 
127
+ ## 🏦 Supported Banks
128
+
129
+ | Bank | Status |
130
+ |------|--------|
131
+ | HDFC | βœ… |
132
+ | ICICI | βœ… |
133
+ | SBI | βœ… |
134
+ | Axis | βœ… |
135
+ | Kotak | βœ… |
136
 
137
  ---
138
 
139
+ ## πŸ“Š Benchmark
140
 
141
+ | Metric | Value |
142
+ |--------|-------|
143
+ | **Field Accuracy** | 94.5% (with LLM) |
144
+ | **Regex-only Accuracy** | 87.5% |
145
+ | **Latency (Regex)** | <1ms |
146
+ | **Throughput** | 50,000+ msg/sec |
 
147
 
148
  ---
149
 
 
159
  β”‚
160
  β–Ό
161
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€οΏ½οΏ½β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
162
+ β”‚ TIER 1: Regex Engine (50+ patterns) β”‚
 
163
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
164
  β”‚
165
  β–Ό
166
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
167
  β”‚ TIER 2: Rule-Based Mapping (200+ VPA β†’ merchant) β”‚
 
168
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
169
  β”‚
170
  β–Ό
171
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
172
+ β”‚ TIER 3: Phi-3 LLM (Optional - downloads 7GB model) β”‚
173
+ β”‚ Only called for edge cases β”‚
174
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
175
  β”‚
176
  β–Ό
 
179
 
180
  ---
181
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
  ## πŸ“ Repository Structure
183
 
184
  ```
185
  Finance-Entity-Extractor/
186
+ β”œβ”€β”€ src/finee/ # Core package
 
 
 
 
187
  β”œβ”€β”€ tests/ # 88 unit tests
188
+ β”œβ”€β”€ examples/demo.ipynb # πŸ‘ˆ Try in Colab!
189
+ β”œβ”€β”€ benchmark.py # Verify accuracy
190
+ β”œβ”€β”€ CHANGELOG.md # Release history
191
+ └── CONTRIBUTING.md # How to contribute
 
192
  ```
193
 
194
  ---
195
 
196
  ## 🀝 Contributing
197
 
198
+ See [CONTRIBUTING.md](CONTRIBUTING.md) for:
199
+ - Git Flow branching strategy
200
+ - How to run tests
201
+ - Release process
 
 
202
 
203
  ---
204
 
205
  ## πŸ“„ License
206
 
207
+ MIT License
208
 
209
  ---
210
 
 
212
 
213
  **Made with ❀️ by Ranjit Behera**
214
 
215
+ [PyPI](https://pypi.org/project/finee/) β€’ [GitHub](https://github.com/Ranjitbehera0034/Finance-Entity-Extractor) β€’ [Hugging Face](https://huggingface.co/Ranjit0034/finance-entity-extractor)
216
 
217
  </div>