Update README.md
Browse files
README.md
CHANGED
|
@@ -8,117 +8,12 @@ pinned: true
|
|
| 8 |
short_description: WalletSync DUPLICATE TRANSACTION DETECTION
|
| 9 |
---
|
| 10 |
|
| 11 |
-
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
| 12 |
|
| 13 |
Auto Expense Categorization – Duplicate Detection
|
| 14 |
=================================================
|
| 15 |
|
| 16 |
-
This mini-service connects to the `expense` MongoDB database and surfaces *soft* merge suggestions whenever two or more expense entries look like the same purchase. The rules currently implemented are the ones requested:
|
| 17 |
|
| 18 |
-
|
| 19 |
-
* Timestamp difference within a configurable ±N minutes window (default: 10 min)
|
| 20 |
-
* Merchant names that are either identical once normalised or mapped through a merchant-alias table
|
| 21 |
|
| 22 |
-
Instead of destroying or editing any expense rows, the service writes a merge suggestion into the `merge_suggestions` collection so that an operator (or another automation) can perform the actual merge later on.
|
| 23 |
|
| 24 |
-
|
| 25 |
-
-----------
|
| 26 |
-
|
| 27 |
-
1. Create a virtual environment and install dependencies:
|
| 28 |
-
|
| 29 |
-
```
|
| 30 |
-
python3 -m venv .venv
|
| 31 |
-
.\.venv\Scripts\activate
|
| 32 |
-
python3 -m pip install -r requirements.txt
|
| 33 |
-
```
|
| 34 |
-
|
| 35 |
-
2. Copy `.env.example` to `.env` and set the Mongo connection string if you do not want to rely on the baked-in default.
|
| 36 |
-
|
| 37 |
-
3. Run the detector (the default config scans the last 48 h of data and writes suggestions only). For the historical `transactions` collection you may want to bump the lookback window:
|
| 38 |
-
|
| 39 |
-
```
|
| 40 |
-
python3 -m src.main --minutes 30 --lookback-hours 720
|
| 41 |
-
```
|
| 42 |
-
|
| 43 |
-
You will see log lines such as:
|
| 44 |
-
|
| 45 |
-
```
|
| 46 |
-
INFO DuplicateDetector Identified 2 duplicates, suggestion 673a...
|
| 47 |
-
```
|
| 48 |
-
|
| 49 |
-
API Server
|
| 50 |
-
----------
|
| 51 |
-
|
| 52 |
-
Run the HTTP service with FastAPI/uvicorn:
|
| 53 |
-
|
| 54 |
-
```
|
| 55 |
-
python3 -m uvicorn src.api:app --reload
|
| 56 |
-
```
|
| 57 |
-
|
| 58 |
-
Endpoints:
|
| 59 |
-
|
| 60 |
-
* `GET /health` – readiness probe.
|
| 61 |
-
* `POST /duplicates/detect` – kicks off a scan (body can override `lookback_hours`, `limit`, `amount_pct`, `minutes`).
|
| 62 |
-
* `GET /suggestions?limit=50` – lists recent merge suggestions so the UI can ask “These seem similar. Would you like to merge them?”.
|
| 63 |
-
|
| 64 |
-
Collections
|
| 65 |
-
-----------
|
| 66 |
-
|
| 67 |
-
* `transactions` (default): source data. The detector automatically maps the `date`/`createdAt` timestamp and `note`/`paymentType` merchant fields so you still get near-duplicate detection without reshaping your documents. Entries are only compared if they belong to the same `user`.
|
| 68 |
-
* `merchant_aliases`: optional alias definitions (`name`, `aliases`).
|
| 69 |
-
* `merge_suggestions`: the service writes documents shaped as:
|
| 70 |
-
|
| 71 |
-
```
|
| 72 |
-
{
|
| 73 |
-
"_id": ObjectId(...),
|
| 74 |
-
"candidate_ids": [...],
|
| 75 |
-
"message": "These seem similar. Would you like to merge them?",
|
| 76 |
-
"details": {
|
| 77 |
-
"amount_delta_pct": 0.53,
|
| 78 |
-
"time_delta_minutes": 4.2,
|
| 79 |
-
"merchant_match_rule": "alias"
|
| 80 |
-
},
|
| 81 |
-
"audit": {
|
| 82 |
-
"generated_by": "duplicate-detector",
|
| 83 |
-
"generated_at": ISODate(...)
|
| 84 |
-
},
|
| 85 |
-
"status": "pending"
|
| 86 |
-
}
|
| 87 |
-
```
|
| 88 |
-
|
| 89 |
-
Configuration
|
| 90 |
-
-------------
|
| 91 |
-
|
| 92 |
-
All tunables live in `src/config.py`. Environment variables take precedence, so you can tune tolerances per deployment without editing code.
|
| 93 |
-
|
| 94 |
-
| Variable | Description | Default |
|
| 95 |
-
| --- | --- | --- |
|
| 96 |
-
| `MONGO_URI` | Mongo connection string | Provided URI |
|
| 97 |
-
| `MONGO_DB` | Database name | `expense` |
|
| 98 |
-
| `MONGO_EXPENSE_COLLECTION` | Expenses collection | `transactions` |
|
| 99 |
-
| `MONGO_ALIAS_COLLECTION` | Merchant alias collection | `merchant_aliases` |
|
| 100 |
-
| `MONGO_SUGGESTION_COLLECTION` | Merge-suggestion collection | `merge_suggestions` |
|
| 101 |
-
| `AMOUNT_TOLERANCE_PCT` | Amount delta percentage | `1.0` |
|
| 102 |
-
| `TIME_TOLERANCE_MINUTES` | Time delta minutes | `10` |
|
| 103 |
-
| `DEFAULT_LOOKBACK_HOURS` | How far back to scan | `48` |
|
| 104 |
-
| `TIME_FIELDS` | CSV priority order for timestamps | `date,expense_time,createdAt` |
|
| 105 |
-
| `MERCHANT_FIELDS` | CSV priority order for merchant labels | `merchant,note,paymentType,type,to` |
|
| 106 |
-
| `USER_FIELD` | Source field that stores the user id (inferred automatically) | `user` |
|
| 107 |
-
|
| 108 |
-
Smoke Test
|
| 109 |
-
----------
|
| 110 |
-
|
| 111 |
-
Use the bundled `test.py` script to hit the running API (locally or on the Hugging Face Space) via the base URL:
|
| 112 |
-
|
| 113 |
-
```
|
| 114 |
-
python3 test.py --base-url https://LogicGoInfotechSpaces-duplicate-transaction-detection.hf.space --lookback-hours 720 --limit 5000
|
| 115 |
-
```
|
| 116 |
-
|
| 117 |
-
The script calls `/health`, `/duplicates/detect`, and `/suggestions` in sequence and prints the responses so you can quickly verify the deployment.
|
| 118 |
-
|
| 119 |
-
Next Steps
|
| 120 |
-
----------
|
| 121 |
-
|
| 122 |
-
* Wire this module into your ingestion pipeline so suggestions are generated immediately after a new expense is stored.
|
| 123 |
-
* Surface the `merge_suggestions` collection in your UI to show prompts such as “These seem similar. Would you like to merge them?”
|
| 124 |
-
* Extend `MerchantAliasResolver` to sync aliases from your upstream ERP or ML model.
|
|
|
|
| 8 |
short_description: WalletSync DUPLICATE TRANSACTION DETECTION
|
| 9 |
---
|
| 10 |
|
|
|
|
| 11 |
|
| 12 |
Auto Expense Categorization – Duplicate Detection
|
| 13 |
=================================================
|
| 14 |
|
|
|
|
| 15 |
|
| 16 |
+
|
|
|
|
|
|
|
| 17 |
|
|
|
|
| 18 |
|
| 19 |
+
==
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|