Ranjit Behera commited on
Commit ·
438d5f9
1
Parent(s): 9101d7e
docs: Update CHANGELOG for v1.1.0 release
Browse files- CHANGELOG.md +36 -4
CHANGELOG.md
CHANGED
|
@@ -9,11 +9,43 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|
| 9 |
### Added
|
| 10 |
- (Next features go here)
|
| 11 |
|
| 12 |
-
|
| 13 |
-
- (Changes to existing features)
|
| 14 |
|
| 15 |
-
##
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
---
|
| 19 |
|
|
|
|
| 9 |
### Added
|
| 10 |
- (Next features go here)
|
| 11 |
|
| 12 |
+
---
|
|
|
|
| 13 |
|
| 14 |
+
## [1.1.0] - 2026-01-12
|
| 15 |
+
### Added
|
| 16 |
+
- **Complete Data Pipeline** (`scripts/data_pipeline/`)
|
| 17 |
+
- `step1_unify.py`: Unifies MBOX, JSON, CSV, XML sources
|
| 18 |
+
- `step2_filter.py`: Removes OTPs, spam, marketing messages
|
| 19 |
+
- `step3_baseline.py`: Tests regex extractor accuracy
|
| 20 |
+
- `step4_label.py`: Creates labeled training data with ground truth
|
| 21 |
+
|
| 22 |
+
- **Synthetic Data Generator**
|
| 23 |
+
- `generate_synthetic.py`: Production-grade grammar-based generator
|
| 24 |
+
- 100K+ realistic Indian bank transactions
|
| 25 |
+
- All major banks (HDFC, ICICI, SBI, Axis, Kotak, PNB, BOB, etc.)
|
| 26 |
+
- Brokerages (Zerodha, Groww, Upstox, Angel One, 5Paisa, etc.)
|
| 27 |
+
- E-commerce, food, travel, utilities, entertainment categories
|
| 28 |
+
- `generate_advanced.py`: Advanced features
|
| 29 |
+
- Markov Chain for realistic message flow
|
| 30 |
+
- Real data calibration from actual samples
|
| 31 |
+
- Multilingual support (Hindi, Tamil, Telugu, Bengali, Kannada)
|
| 32 |
+
- Data augmentation and edge case oversampling
|
| 33 |
+
|
| 34 |
+
- **LLM Fine-tuning Pipeline** (`scripts/finetune.py`)
|
| 35 |
+
- Supports MLX (Apple Silicon) and PyTorch backends
|
| 36 |
+
- LoRA fine-tuning with automatic data preparation
|
| 37 |
+
- Model fusion and evaluation utilities
|
| 38 |
+
|
| 39 |
+
### Performance
|
| 40 |
+
- Trained on 152,519 records (2,419 real + 100K synthetic + 50K multilingual)
|
| 41 |
+
- Val loss: 2.42 → 0.46 (81% reduction)
|
| 42 |
+
- 100% JSON parsing accuracy on test cases
|
| 43 |
+
- Multilingual extraction working (Hindi, Tamil, Telugu, Bengali, Kannada)
|
| 44 |
+
- Fine-tuned model: 7.6GB (Phi-3-mini + LoRA fused)
|
| 45 |
+
|
| 46 |
+
### Models
|
| 47 |
+
- Fine-tuned model: `finetuned-v1/` on Hugging Face
|
| 48 |
+
- LoRA adapters: `lora-adapters/` on Hugging Face
|
| 49 |
|
| 50 |
---
|
| 51 |
|