File size: 52,810 Bytes
31f0e50
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
# ScamShield AI: Agentic Honeypot System - FINAL IMPLEMENTATION SPECIFICATION
## India AI Impact Buildathon 2026 - Challenge 2
## Target: TOP 10 from 40,000 Participants

**Author/Team Lead:** Shivam Bhuva (@shivambhuva8866)  
**Date:** January 26, 2026  
**Submission Deadline:** February 5, 2026, 11:59 PM  
**Challenge:** Agentic Honey-Pot for Scam Detection & Intelligence Extraction  
**Testing Mode:** API Endpoint Submission (Mock Scammer API Integration)

---

## EXECUTIVE SUMMARY

This document provides the **production-ready implementation specification** for ScamShield AI's Agentic Honeypot System. Built exclusively for Challenge 2 of the India AI Impact Buildathon 2026, this system autonomously detects scam messages, engages scammers with believable AI personas, and extracts actionable intelligence (bank accounts, UPI IDs, phishing links).

**Key Differentiators:**
-**100% FREE TIER** - Zero paid APIs or services
-**Phased Implementation** - Text-first (Phase 1), Audio-later (Phase 2)
-**API-First Design** - Direct Mock Scammer API integration
-**Bilingual Focus** - English + Hindi only (high accuracy)
-**Structured JSON Outputs** - Competition-ready response format
-**Production-Grade** - LangGraph ReAct agents with state persistence

**AI Usage:** 90% (detection, agentic engagement, extraction)  
**Non-AI:** 10% (API wrappers, data preprocessing)

---

## PROBLEM STATEMENT (Official Challenge 2 Requirements)

**Objective:** Design an autonomous AI honeypot system that:
1. Detects scam messages accurately
2. Actively engages scammers using believable personas
3. Extracts intelligence: bank accounts, UPI IDs, phishing links
4. Integrates with Mock Scammer API for testing
5. Returns structured JSON outputs

**India's Scam Crisis Context:**
- 500,000+ scam calls/messages daily (TRAI 2025)
- ₹60+ crore daily losses
- 3+ spam messages per citizen
- Predominant scams: UPI fraud, fake loans, police/bank impersonation
- 47% Indians affected by or know victims of scam fraud

---

## IMPLEMENTATION ARCHITECTURE

### **PHASE 1: TEXT-BASED HONEYPOT (Priority for Feb 5 Submission)**

```

┌─────────────────────────────────────────────────────────────────┐

│                      INPUT LAYER                                │

│  Mock Scammer API → JSON Message → ScamShield API Endpoint     │

└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐

│                   DETECTION MODULE                              │

│  • IndicBERT (ai4bharat/indic-bert) - Scam Classification      │

│  • Keyword Matching (UPI, OTP, bank, police, arrest)           │

│  • Language Detection (langdetect) - English/Hindi             │

│  • Confidence Scoring (>0.7 = scam trigger)                    │

└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐

│                  HAND-OFF DECISION                              │

│  IF scam_confidence > 0.7:                                      │

│     → Trigger Honeypot Engagement                              │

│  ELSE:                                                          │

│     → Return "not_scam" response                               │

└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐

│               AGENTIC ENGAGEMENT MODULE                         │

│  Framework: LangGraph + ReAct Loop                             │

│  LLM: Groq Llama 3.1 70B (FREE API, 30 req/min)               │

│                                                                 │

│  Agent Components:                                              │

│  ├─ Persona Generator (elderly/gullible/confused)              │

│  ├─ Response Planner (stalling tactics, probing questions)     │

│  ├─ Context Tracker (conversation state in ChromaDB)           │

│  ├─ Safety Monitor (avoid escalation)                          │

│  └─ Termination Logic (max 20 turns or intel extracted)        │

│                                                                 │

│  Engagement Strategy:                                           │

│  • Turn 1-5: Show interest, ask clarifying questions           │

│  • Turn 6-12: Express confusion, request details repeatedly    │

│  • Turn 13-20: Probe for bank/UPI/links with urgency          │

└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐

│            INTELLIGENCE EXTRACTION MODULE                       │

│  • NER: spaCy (en_core_web_sm) for entities                    │

│  • Regex Patterns:                                              │

│    - UPI: [a-zA-Z0-9._]+@[a-zA-Z]+                            │

│    - Bank Account: \d{9,18}                                    │

│    - IFSC: [A-Z]{4}0[A-Z0-9]{6}                               │

│    - Phone: \+91[\s-]?\d{10}|\d{10}                           │

│    - URLs: https?://[^\s]+                                     │

│  • Confidence Scoring per entity (0.0-1.0)                     │

│  • Validation: Check format correctness                        │

└─────────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────────┐

│                  OUTPUT LAYER (JSON)                            │

│  {                                                              │

│    "scam_detected": true,                                       │

│    "confidence": 0.95,                                          │

│    "language": "hindi",                                         │

│    "conversation_transcript": [...],                            │

│    "extracted_intelligence": {                                  │

│      "upi_ids": ["scammer@paytm"],                             │

│      "bank_accounts": ["1234567890"],                          │

│      "ifsc_codes": ["SBIN0001234"],                            │

│      "phone_numbers": ["+919876543210"],                       │

│      "phishing_links": ["http://fake-bank.com"]                │

│    },                                                           │

│    "engagement_turns": 15,                                      │

│    "extraction_confidence": 0.87                                │

│  }                                                              │

└─────────────────────────────────────────────────────────────────┘

```

---

## TECH STACK (100% FREE TIER)

### **1. Core AI/ML Models**

| Component | Model/Library | Source | Free Tier Limits | Why Chosen |
|-----------|--------------|--------|------------------|------------|
| **LLM (Agentic Core)** | Groq Llama 3.1 70B | Groq Cloud API | 30 req/min, 6000/day | Fastest inference (280 tokens/sec), excellent Hindi support |
| **Scam Detection** | ai4bharat/indic-bert | Hugging Face | Unlimited (local) | Best for Hindi-English code-mixed text |
| **NER Extraction** | en_core_web_sm | spaCy | Unlimited (local) | Fast, accurate entity recognition |

| **Language Detection** | langdetect | PyPI | Unlimited (local) | 99%+ accuracy for Hindi/English |

| **Embeddings** | sentence-transformers/all-MiniLM-L6-v2 | Hugging Face | Unlimited (local) | Lightweight, 384-dim embeddings |



### **2. Agentic AI Framework**



```python

# LangGraph + LangChain (Open Source)

langgraph==0.0.20        # State graph orchestration

langchain==0.1.0         # ReAct agent framework

langchain-groq==0.0.1    # Groq LLM integration

```



**Why LangGraph:**

- Built-in state persistence (conversation memory)

- ReAct loop support (Reason → Act → Observe)

- Conditional branching for dynamic engagement

- Multi-turn conversation handling

- Free and open-source



### **3. Vector Database & Storage**



| Component | Tool | Free Tier | Purpose |

|-----------|------|-----------|---------|

| **Vector DB** | ChromaDB | Unlimited (local) | Store conversation embeddings, scam patterns |

| **Relational DB** | PostgreSQL | 1GB (Supabase free) | Conversation logs, scammer profiles |

| **Cache** | Redis | 30MB (Redis Cloud free) | Real-time state, session management |



### **4. API Framework**



```python

# FastAPI + Production Tools

fastapi==0.104.1         # High-performance API

uvicorn==0.24.0          # ASGI server

pydantic==2.5.0          # Data validation

```



### **5. Supporting Libraries**



```python

# NLP & Text Processing

spacy==3.7.2

transformers==4.35.0

torch==2.1.0

sentence-transformers==2.2.2

langdetect==1.0.9



# Vector & Data Storage

chromadb==0.4.18

psycopg2-binary==2.9.9

redis==5.0.1

sqlalchemy==2.0.23



# Utils

python-dotenv==1.0.0

requests==2.31.0

numpy==1.24.3

pandas==2.0.3

```



---



## API ENDPOINT SPECIFICATION



### **1. Competition Submission Endpoint**



**Base URL:** `https://your-api-domain.com/api/v1`



**Endpoint:** `POST /honeypot/engage`



**Request Format:**

```json

{

  "message": "आप जीत गए हैं 10 लाख रुपये! अपना OTP शेयर करें।",

  "session_id": "optional-session-123",
  "language": "auto"
}
```



**Response Format (Scam Detected):**

```json

{

  "status": "success",

  "scam_detected": true,

  "confidence": 0.92,

  "language_detected": "hindi",

  "session_id": "session-123",

  "engagement": {

    "agent_response": "वाह! बहुत अच्छी खबर है। मुझे OTP कहाँ भेजना है?",

    "turn_count": 1,

    "max_turns_reached": false,

    "strategy": "show_interest"

  },

  "extracted_intelligence": {

    "upi_ids": [],

    "bank_accounts": [],

    "ifsc_codes": [],

    "phone_numbers": [],

    "phishing_links": [],

    "extraction_confidence": 0.0

  },

  "conversation_history": [

    {

      "turn": 1,

      "sender": "scammer",

      "message": "आप जीत गए हैं 10 लाख रुपये! अपना OTP शेयर करें।",

      "timestamp": "2026-01-26T10:30:00Z"

    },

    {

      "turn": 1,

      "sender": "agent",

      "message": "वाह! बहुत अच्छी खबर है। मुझे OTP कहाँ भेजना है?",

      "timestamp": "2026-01-26T10:30:02Z"

    }

  ],

  "metadata": {

    "processing_time_ms": 850,

    "model_version": "v1.0.0",

    "detection_model": "indic-bert",

    "engagement_model": "groq-llama-3.1-70b"

  }

}

```

**Response Format (Not Scam):**
```json

{

  "status": "success",

  "scam_detected": false,

  "confidence": 0.15,

  "language_detected": "english",

  "session_id": "session-456",

  "message": "No scam detected. Message appears legitimate."

}

```

### **2. Mock Scammer API Integration**

**How Competition Testing Works:**
1. Competition provides Mock Scammer API endpoint
2. Your system detects scam → sends engagement response
3. Mock Scammer API replies → your system continues conversation
4. Loop continues until intelligence extracted or max turns
5. Your API returns final JSON with extracted data

**Integration Architecture:**
```python

# Your API receives initial message

POST /honeypot/engage

{

  "message": "Send money to UPI: scammer@paytm",

  "mock_scammer_callback": "https://competition-api.com/scammer/reply"

}



# Your system detects scam, engages

# For each turn, call Mock Scammer API:

POST https://competition-api.com/scammer/reply

{

  "session_id": "xyz",

  "agent_message": "Which UPI ID should I use?"

}



# Mock Scammer responds

{

  "scammer_reply": "Use scammer@paytm and send 5000 rupees"

}



# Your system extracts: upi_ids=["scammer@paytm"], continues or terminates

```

### **3. Additional Testing Endpoints**

```python

# Health Check

GET /health

Response: {"status": "healthy", "version": "1.0.0"}



# Batch Processing (if needed)

POST /honeypot/batch

{

  "messages": [

    {"id": "1", "message": "..."},

    {"id": "2", "message": "..."}

  ]

}



# Get Conversation History

GET /honeypot/session/{session_id}

Response: {conversation history JSON}

```

---

## MULTILINGUAL SUPPORT (ENGLISH + HINDI)

### **Language Detection Pipeline**

```python

# Step 1: Detect language

import langdetect

detected_lang = langdetect.detect(message)  # 'en' or 'hi'



# Step 2: Route to appropriate processing

if detected_lang == 'hi':

    # Use IndicBERT for detection

    # Use Hindi persona prompts for engagement

    # Apply Hindi regex patterns

elif detected_lang == 'en':

    # Use IndicBERT (supports English too)

    # Use English persona prompts

    # Apply English regex patterns

```

### **Hindi Support Specifications**

**Models:**
- **Detection:** `ai4bharat/indic-bert` (pre-trained on 12 Indic languages)
- **LLM:** Groq Llama 3.1 70B (strong Hindi capabilities)
- **Alternative:** Gemma-2-9B-it (good Hindi, can run locally)

**Hindi Regex Patterns:**
```python

# Hindi text patterns

HINDI_UPI_KEYWORDS = ['भेजें', 'ट्रांसफर', 'पैसे', 'यूपीआई', 'खाता']

HINDI_SCAM_KEYWORDS = ['जीत', 'ईनाम', 'लॉटरी', 'ओटीपी', 'पुलिस', 'गिरफ्तार']



# Numeric patterns work same (0-9 digits in Hindi text)

UPI_PATTERN = r'[a-zA-Z0-9._]+@[a-zA-Z]+'

ACCOUNT_PATTERN = r'\d{9,18}'

```

**Hindi Persona Examples:**
```

Persona 1 (Elderly):

"अरे वाह! बहुत अच्छा है। लेकिन मुझे समझ नहीं आ रहा, कैसे करूँ?"



Persona 2 (Confused):

"जी हाँ, मैं पैसे भेज दूंगा। पर कौन सा बटन दबाऊं?"



Persona 3 (Eager):

"हाँ हाँ, मुझे पैसे चाहिए। आप कहाँ भेजूं?"

```

### **English Support Specifications**

**Models:**
- Same as Hindi (IndicBERT is multilingual)
- Groq Llama 3.1 70B (native English)

**English Persona Examples:**
```

Persona 1 (Elderly):

"Oh wonderful! But I'm not very good with technology. Can you help me?"



Persona 2 (Confused):

"I want to claim my prize. Where do I send the money again?"



Persona 3 (Eager):

"Yes, I'm ready to transfer. What's your account number?"

```

### **Code-Mixed (Hinglish) Support**

Many Indian scams use code-mixed Hindi-English. IndicBERT handles this well:

```

Input: "Aapne jeeta 10 lakh rupees! Send OTP to claim prize"

Language: hinglish (auto-detected as 'hi' or 'en')

Processing: IndicBERT detects scam patterns in mixed text

Engagement: Respond in same mixing pattern

```

---

## AGENTIC ENGAGEMENT STRATEGY

### **LangGraph ReAct Agent Architecture**

```python

from langgraph.graph import StateGraph, END

from langchain_groq import ChatGroq



# State Definition

class HoneypotState(TypedDict):

    messages: List[dict]

    scam_confidence: float

    turn_count: int

    extracted_intel: dict

    strategy: str

    language: str



# Agent Nodes

def detect_scam(state):

    """Classify if message is scam"""

    # IndicBERT classification

    # Return updated state with confidence



def plan_response(state):

    """Decide engagement strategy"""

    if state['turn_count'] < 5:

        strategy = "show_interest"

    elif state['turn_count'] < 12:

        strategy = "express_confusion"

    else:

        strategy = "probe_details"

    return {"strategy": strategy}



def generate_response(state):

    """LLM generates believable reply"""

    # Groq Llama 3.1 with persona prompt

    # Returns agent message



def extract_intelligence(state):

    """Extract financial details"""

    # spaCy NER + Regex

    # Update extracted_intel



def should_continue(state):

    """Termination logic"""

    if state['turn_count'] >= 20:

        return "end"

    if len(state['extracted_intel']['upi_ids']) > 0:

        return "end"

    return "continue"



# Build Graph

workflow = StateGraph(HoneypotState)

workflow.add_node("detect", detect_scam)

workflow.add_node("plan", plan_response)

workflow.add_node("generate", generate_response)

workflow.add_node("extract", extract_intelligence)



workflow.add_edge("detect", "plan")

workflow.add_edge("plan", "generate")

workflow.add_edge("generate", "extract")

workflow.add_conditional_edges(

    "extract",

    should_continue,

    {

        "continue": "plan",

        "end": END

    }

)



workflow.set_entry_point("detect")

agent = workflow.compile()

```

### **Persona Management**

**Persona Types:**
1. **Elderly Person (60+ years)**
   - Slow to understand technology
   - Trusting and polite
   - Asks basic questions
   - Expresses confusion often

2. **Middle-Aged Eager Victim**
   - Excited about prizes/offers
   - Willing to comply
   - Asks for step-by-step instructions
   - Shows urgency

3. **Young Confused User**
   - Familiar with tech but cautious
   - Asks verification questions
   - Requests proof/links
   - Seeks reassurance

**Persona Selection Logic:**
```python

def select_persona(language, scam_type):

    if "prize" in scam_type or "lottery" in scam_type:

        return "eager_victim"

    elif "police" in scam_type or "arrest" in scam_type:

        return "elderly_fearful"

    else:

        return "confused_user"

```

### **Stalling Tactics**

**Goal:** Keep scammer engaged to extract more information

**Tactics:**
1. **Repeated Clarification:** "I didn't understand, can you repeat?"
2. **Technical Confusion:** "Which button do I press? My phone is old."
3. **Fake Delays:** "Let me find my card. Hold on..."
4. **Partial Compliance:** "I sent something, did you receive?"
5. **Request Verification:** "Can you send me official proof?"

**Turn-by-Turn Strategy:**
```python

ENGAGEMENT_STRATEGY = {

    "turns_1_5": {

        "goal": "Build trust, show interest",

        "tactics": ["express_excitement", "ask_basic_questions"],

        "example": "Really? I won? How do I claim it?"

    },

    "turns_6_12": {

        "goal": "Extract contact/payment info",

        "tactics": ["request_details", "express_confusion"],

        "example": "Should I send money to your account? What's the number?"

    },

    "turns_13_20": {

        "goal": "Force reveal of bank/UPI/links",

        "tactics": ["fake_compliance", "probe_urgently"],

        "example": "I'm ready to transfer. Send me your UPI ID again?"

    }

}

```

---

## INTELLIGENCE EXTRACTION

### **Extraction Targets (Competition Requirements)**

1. **UPI IDs:** `user@paytm`, `9876543210@ybl`, etc.
2. **Bank Account Numbers:** 9-18 digit sequences
3. **IFSC Codes:** 11-character bank codes
4. **Phone Numbers:** +91 or 10-digit Indian numbers
5. **Phishing Links:** URLs to fake websites

### **Extraction Pipeline**

```python

import spacy

import re



nlp = spacy.load("en_core_web_sm")



def extract_intelligence(text):

    intel = {

        "upi_ids": [],

        "bank_accounts": [],

        "ifsc_codes": [],

        "phone_numbers": [],

        "phishing_links": []

    }

    

    # UPI IDs

    upi_pattern = r'\b[a-zA-Z0-9._-]+@[a-zA-Z]+\b'

    intel['upi_ids'] = re.findall(upi_pattern, text)

    

    # Bank Accounts (9-18 digits, with validation)

    account_pattern = r'\b\d{9,18}\b'

    accounts = re.findall(account_pattern, text)

    intel['bank_accounts'] = [acc for acc in accounts if validate_account(acc)]

    

    # IFSC Codes

    ifsc_pattern = r'\b[A-Z]{4}0[A-Z0-9]{6}\b'

    intel['ifsc_codes'] = re.findall(ifsc_pattern, text)

    

    # Phone Numbers

    phone_pattern = r'(?:\+91[\s-]?)?[6-9]\d{9}\b'

    intel['phone_numbers'] = re.findall(phone_pattern, text)

    

    # Phishing Links

    url_pattern = r'https?://[^\s<>"{}|\\^`\[\]]+'

    intel['phishing_links'] = re.findall(url_pattern, text)

    

    # SpaCy NER for additional entities

    doc = nlp(text)

    for ent in doc.ents:

        if ent.label_ == "CARDINAL" and len(ent.text) >= 9:

            # Possible account number

            if ent.text not in intel['bank_accounts']:

                intel['bank_accounts'].append(ent.text)

    

    # Calculate confidence

    confidence = calculate_extraction_confidence(intel)

    

    return intel, confidence



def validate_account(account_number):

    """Basic validation for bank account"""

    if len(account_number) < 9 or len(account_number) > 18:

        return False

    # Add checksum validation if needed

    return True



def calculate_extraction_confidence(intel):

    """Score based on number and quality of extractions"""

    score = 0

    weights = {

        'upi_ids': 0.3,

        'bank_accounts': 0.3,

        'ifsc_codes': 0.2,

        'phone_numbers': 0.1,

        'phishing_links': 0.1

    }

    

    for key, weight in weights.items():

        if len(intel[key]) > 0:

            score += weight

    

    return min(score, 1.0)

```

### **Hindi Text Extraction Challenges**

**Challenge:** Numbers in Hindi text (Devanagari script: ०१२३...)
**Solution:** Devanagari digits to ASCII conversion

```python

def convert_devanagari_to_ascii(text):

    """Convert Devanagari digits to ASCII"""

    devanagari_to_ascii = {

        '०': '0', '१': '1', '२': '2', '३': '3', '४': '4',

        '५': '5', '६': '6', '७': '7', '८': '8', '९': '9'

    }

    for dev, asc in devanagari_to_ascii.items():

        text = text.replace(dev, asc)

    return text



# Apply before extraction

text = convert_devanagari_to_ascii(hindi_text)

intel = extract_intelligence(text)

```

---

## DATABASE & STATE MANAGEMENT

### **PostgreSQL Schema**

```sql

-- Conversations Table

CREATE TABLE conversations (

    id SERIAL PRIMARY KEY,

    session_id VARCHAR(255) UNIQUE NOT NULL,

    language VARCHAR(10) NOT NULL,

    scam_detected BOOLEAN DEFAULT FALSE,

    confidence FLOAT,

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP

);



-- Messages Table

CREATE TABLE messages (

    id SERIAL PRIMARY KEY,

    conversation_id INTEGER REFERENCES conversations(id),

    turn_number INTEGER NOT NULL,

    sender VARCHAR(50) NOT NULL, -- 'scammer' or 'agent'

    message TEXT NOT NULL,

    timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP

);



-- Extracted Intelligence Table

CREATE TABLE extracted_intelligence (

    id SERIAL PRIMARY KEY,

    conversation_id INTEGER REFERENCES conversations(id),

    upi_ids TEXT[], -- PostgreSQL array

    bank_accounts TEXT[],

    ifsc_codes TEXT[],

    phone_numbers TEXT[],

    phishing_links TEXT[],

    extraction_confidence FLOAT,

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP

);



-- Scammer Profiles (for analytics)

CREATE TABLE scammer_profiles (

    id SERIAL PRIMARY KEY,

    phone_hash VARCHAR(64), -- Hashed for privacy

    scam_tactics TEXT[],

    languages_used TEXT[],

    total_conversations INTEGER DEFAULT 1,

    first_seen TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    last_seen TIMESTAMP DEFAULT CURRENT_TIMESTAMP

);

```

### **ChromaDB for Vector Storage**

```python

import chromadb

from sentence_transformers import SentenceTransformer



# Initialize

client = chromadb.Client()

collection = client.create_collection("scam_conversations")

embedder = SentenceTransformer('all-MiniLM-L6-v2')



# Store conversation embeddings

def store_conversation(session_id, messages):

    # Combine messages into single text

    full_text = " ".join([msg['message'] for msg in messages])

    

    # Generate embedding

    embedding = embedder.encode(full_text)

    

    # Store in ChromaDB

    collection.add(

        embeddings=[embedding.tolist()],

        documents=[full_text],

        ids=[session_id],

        metadatas=[{

            "session_id": session_id,

            "turn_count": len(messages),

            "language": messages[0].get('language', 'unknown')

        }]

    )



# Query similar conversations (for learning)

def find_similar_scams(query_text, n_results=5):

    query_embedding = embedder.encode(query_text)

    results = collection.query(

        query_embeddings=[query_embedding.tolist()],

        n_results=n_results

    )

    return results

```

### **Redis for Session State**

```python

import redis

import json



# Initialize Redis

redis_client = redis.Redis(

    host='redis-free-tier.cloud.redislabs.com',

    port=12345,

    password='your-password',

    decode_responses=True

)



# Store session state

def save_session_state(session_id, state):

    redis_client.setex(

        f"session:{session_id}",

        3600,  # 1 hour expiry

        json.dumps(state)

    )



# Retrieve session state

def get_session_state(session_id):

    data = redis_client.get(f"session:{session_id}")

    return json.loads(data) if data else None



# Update turn count

def increment_turn(session_id):

    redis_client.incr(f"session:{session_id}:turns")

```

---

## DEPLOYMENT ARCHITECTURE

### **Free Tier Hosting Options**

| Service | Free Tier | Best For | Limits |
|---------|-----------|----------|--------|
| **Render** | 750 hours/month | FastAPI deployment | Sleep after 15min inactivity |
| **Railway** | $5 credit/month | PostgreSQL + API | 500 hours |
| **Fly.io** | 3 shared VMs | Low-latency API | 160GB transfer/month |
| **Supabase** | 500MB PostgreSQL | Database | 2GB transfer/month |
| **Redis Cloud** | 30MB Redis | Cache/sessions | 30 connections |

### **Recommended Stack for Competition**

```

API Server: Render (or Railway)

├─ FastAPI application

├─ Uvicorn ASGI server

├─ 512MB RAM, 0.1 CPU

└─ Environment: Python 3.11



Database: Supabase PostgreSQL

├─ 500MB storage

├─ 2GB transfer/month

└─ Connection pooling



Cache: Redis Cloud

├─ 30MB storage

├─ Session state management

└─ 30 concurrent connections



Vector DB: ChromaDB (Local)

├─ Embedded in API server

├─ Persistent storage in Docker volume

└─ No external service needed



Model Hosting: Hugging Face (Local)

├─ IndicBERT loaded at startup

├─ spaCy models bundled

└─ 2GB disk space for models



LLM API: Groq Cloud

├─ 30 requests/minute

├─ 6000 requests/day

└─ Zero cost

```

### **Docker Configuration**

```dockerfile

# Dockerfile

FROM python:3.11-slim



WORKDIR /app



# Install system dependencies

RUN apt-get update && apt-get install -y \

    build-essential \

    && rm -rf /var/lib/apt/lists/*



# Copy requirements

COPY requirements.txt .

RUN pip install --no-cache-dir -r requirements.txt



# Download models at build time

RUN python -c "from transformers import AutoModel, AutoTokenizer; \

    AutoModel.from_pretrained('ai4bharat/indic-bert'); \

    AutoTokenizer.from_pretrained('ai4bharat/indic-bert')"

RUN python -m spacy download en_core_web_sm



# Copy application code

COPY . .



# Expose port

EXPOSE 8000



# Run application

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

```

```yaml

# docker-compose.yml (for local testing)

version: '3.8'



services:

  api:

    build: .

    ports:

      - "8000:8000"

    environment:

      - GROQ_API_KEY=${GROQ_API_KEY}

      - POSTGRES_URL=${POSTGRES_URL}

      - REDIS_URL=${REDIS_URL}

    volumes:

      - chromadb_data:/app/chromadb

    depends_on:

      - postgres

      - redis



  postgres:

    image: postgres:15

    environment:

      - POSTGRES_DB=scamshield

      - POSTGRES_USER=admin

      - POSTGRES_PASSWORD=securepass

    volumes:

      - postgres_data:/var/lib/postgresql/data



  redis:

    image: redis:7-alpine

    volumes:

      - redis_data:/data



volumes:

  chromadb_data:

  postgres_data:

  redis_data:

```

---

## GROQ API INTEGRATION

### **Setup & Configuration**

```python

# .env file

GROQ_API_KEY=gsk_your_free_api_key_here

GROQ_MODEL=llama-3.1-70b-versatile



# config.py

import os

from dotenv import load_dotenv



load_dotenv()



GROQ_API_KEY = os.getenv("GROQ_API_KEY")

GROQ_MODEL = os.getenv("GROQ_MODEL", "llama-3.1-70b-versatile")

GROQ_TEMPERATURE = 0.7

GROQ_MAX_TOKENS = 500

GROQ_TIMEOUT = 30

```

### **LangChain Integration**

```python

from langchain_groq import ChatGroq

from langchain.prompts import ChatPromptTemplate



# Initialize Groq LLM

llm = ChatGroq(

    model=GROQ_MODEL,

    api_key=GROQ_API_KEY,

    temperature=GROQ_TEMPERATURE,

    max_tokens=GROQ_MAX_TOKENS

)



# Persona Prompts

ELDERLY_HINDI_PROMPT = """

आप एक 65 वर्षीय व्यक्ति हैं जो टेक्नोलॉजी में बहुत अच्छे नहीं हैं।

आप विनम्र और भरोसेमंद हैं। आप अक्सर सवाल पूछते हैं और भ्रमित होते हैं।



आपका लक्ष्य: घोटालेबाज से बैंक डिटेल्स, UPI ID, या लिंक निकालना, लेकिन 

असली व्यक्ति की तरह व्यवहार करना।



बातचीत का इतिहास:

{conversation_history}



घोटालेबाज का संदेश: {scammer_message}



आपका जवाब (केवल एक संदेश, 1-2 वाक्य):

"""



ELDERLY_ENGLISH_PROMPT = """

You are a 65-year-old person who is not tech-savvy.

You are polite, trusting, and often confused about technology.



Your goal: Extract bank details, UPI IDs, or phishing links from the scammer

while acting like a real elderly person.



Conversation history:

{conversation_history}



Scammer's message: {scammer_message}



Your response (only one message, 1-2 sentences):

"""



# Generate response

def generate_agent_response(language, persona, conversation_history, scammer_message):

    if language == "hindi":

        prompt_template = ELDERLY_HINDI_PROMPT

    else:

        prompt_template = ELDERLY_ENGLISH_PROMPT

    

    prompt = ChatPromptTemplate.from_template(prompt_template)

    chain = prompt | llm

    

    response = chain.invoke({

        "conversation_history": format_conversation(conversation_history),

        "scammer_message": scammer_message

    })

    

    return response.content



def format_conversation(history):

    """Format conversation for prompt"""

    formatted = []

    for msg in history[-5:]:  # Last 5 turns for context

        sender = "Scammer" if msg['sender'] == 'scammer' else "You"

        formatted.append(f"{sender}: {msg['message']}")

    return "\n".join(formatted)

```

### **Rate Limiting & Retry Logic**

```python

import time

from functools import wraps



def rate_limited_groq_call(max_retries=3, backoff=2):

    """Decorator for handling Groq rate limits"""

    def decorator(func):

        @wraps(func)

        def wrapper(*args, **kwargs):

            for attempt in range(max_retries):

                try:

                    return func(*args, **kwargs)

                except Exception as e:

                    if "rate_limit" in str(e).lower():

                        wait_time = backoff ** attempt

                        print(f"Rate limited. Waiting {wait_time}s...")

                        time.sleep(wait_time)

                    else:

                        raise e

            raise Exception("Max retries exceeded")

        return wrapper

    return decorator



@rate_limited_groq_call(max_retries=3)

def call_groq_api(messages):

    """Safe Groq API call with retry"""

    return llm.invoke(messages)

```

---

## EVALUATION METRICS

### **Competition Judging Criteria (Predicted)**

Based on Challenge 2 requirements, judges will likely evaluate:

1. **Scam Detection Accuracy (25%)**
   - True Positive Rate (detecting actual scams)
   - False Positive Rate (avoiding false alarms)
   - Target: >90% accuracy

2. **Engagement Quality (25%)**
   - Naturalness of conversation (believable persona)
   - Number of turns sustained
   - Avoidance of detection by scammer
   - Target: >10 turns average

3. **Intelligence Extraction (30%)**
   - UPI IDs extracted correctly
   - Bank accounts extracted correctly
   - Phishing links identified
   - Extraction accuracy (no false positives)
   - Target: >85% precision, >80% recall

4. **Response Time (10%)**
   - API latency
   - Time to extract intelligence
   - Target: <2s per response

5. **System Robustness (10%)**
   - Handle edge cases (empty messages, very long texts)
   - Error handling
   - API availability

### **Testing Framework**

```python

# test_metrics.py

import pytest

from app.honeypot import HoneypotAgent



def test_scam_detection_accuracy():

    """Test detection accuracy on known scam messages"""

    test_cases = [

        {

            "message": "You won 10 lakh! Send OTP to claim.",

            "expected_scam": True,

            "expected_confidence": 0.9

        },

        {

            "message": "Hi, how are you doing today?",

            "expected_scam": False,

            "expected_confidence": 0.1

        },

        # Add 100+ test cases

    ]

    

    agent = HoneypotAgent()

    correct = 0

    

    for case in test_cases:

        result = agent.detect_scam(case["message"])

        if result["scam_detected"] == case["expected_scam"]:

            correct += 1

    

    accuracy = correct / len(test_cases)

    assert accuracy >= 0.90, f"Accuracy {accuracy} below 90% threshold"



def test_intelligence_extraction():

    """Test extraction accuracy"""

    test_text = """

    Please send 5000 rupees to my UPI: scammer@paytm

    My bank account is 1234567890123 with IFSC SBIN0001234

    Call me at +919876543210 or visit http://fake-bank.com

    """

    

    agent = HoneypotAgent()

    result = agent.extract_intelligence(test_text)

    

    assert "scammer@paytm" in result["upi_ids"]

    assert "1234567890123" in result["bank_accounts"]

    assert "SBIN0001234" in result["ifsc_codes"]

    assert "+919876543210" in result["phone_numbers"]

    assert "http://fake-bank.com" in result["phishing_links"]



def test_response_latency():

    """Test API response time"""

    import time

    agent = HoneypotAgent()

    

    start = time.time()

    result = agent.engage("Send money to win prize!")

    latency = time.time() - start

    

    assert latency < 2.0, f"Latency {latency}s exceeds 2s threshold"

```

### **Monitoring Dashboard**

```python

# metrics.py

from prometheus_client import Counter, Histogram, Gauge



# Define metrics

scam_detection_total = Counter(

    'scam_detection_total',

    'Total number of scam detections',

    ['language', 'result']

)



intelligence_extracted = Counter(

    'intelligence_extracted_total',

    'Total pieces of intelligence extracted',

    ['type']  # upi, bank_account, etc.

)



response_time = Histogram(

    'response_time_seconds',

    'Response time in seconds',

    buckets=[0.1, 0.5, 1.0, 2.0, 5.0]

)



active_sessions = Gauge(

    'active_honeypot_sessions',

    'Number of active honeypot sessions'

)



# Use in code

scam_detection_total.labels(language='hindi', result='scam').inc()

intelligence_extracted.labels(type='upi_id').inc()

response_time.observe(1.2)

```

---

## PHASE 2: AUDIO INTEGRATION (POST-TEXT IMPLEMENTATION)

**Note:** Only proceed with Phase 2 after Phase 1 is fully tested and working.

### **Phase 2 Additions**

```python

# Additional requirements for Phase 2

openai-whisper==20231117   # ASR for audio transcription

torchaudio==2.1.0          # Audio processing

librosa==0.10.1            # Audio features

soundfile==0.12.1          # Audio I/O

pydub==0.25.1              # Audio format conversion

```

### **Audio Endpoint**

```python

# POST /honeypot/engage-audio

# Accept: multipart/form-data or JSON with base64 audio



@app.post("/honeypot/engage-audio")

async def engage_audio(audio_file: UploadFile):

    # Step 1: Save audio temporarily

    temp_path = f"/tmp/{audio_file.filename}"

    with open(temp_path, "wb") as f:

        f.write(await audio_file.read())

    

    # Step 2: Transcribe with Whisper

    import whisper

    model = whisper.load_model("base")

    result = model.transcribe(temp_path, language="hi")  # or "en"

    transcribed_text = result["text"]

    detected_language = result["language"]

    

    # Step 3: Process as text (reuse Phase 1 pipeline)

    response = await engage_text({

        "message": transcribed_text,

        "language": detected_language,

        "source": "audio"

    })

    

    # Step 4: Add audio-specific metadata

    response["audio_metadata"] = {

        "original_filename": audio_file.filename,

        "transcription": transcribed_text,

        "detected_language": detected_language,

        "transcription_confidence": result.get("confidence", 1.0)

    }

    

    return response

```

### **Audio-Specific Scam Detection**

```python

# Voice deepfake detection (Phase 2 only)

from resemblyzer import preprocess_wav, VoiceEncoder



encoder = VoiceEncoder()



def detect_synthetic_voice(audio_path):

    """Detect if voice is AI-generated"""

    wav = preprocess_wav(audio_path)

    embed = encoder.embed_utterance(wav)

    

    # Compare against known synthetic voice embeddings

    # Return confidence score

    

    return {

        "is_synthetic": False,  # Placeholder

        "confidence": 0.95

    }

```

---

## RESPONSIBLE AI & COMPLIANCE

### **Privacy & Data Protection**

**DPDP Act 2023 Compliance:**
1. **Consent:** User must opt-in to honeypot engagement
2. **Data Minimization:** Store only essential data
3. **Anonymization:** Hash PII (phone numbers, etc.)
4. **Retention:** Delete logs after 30 days
5. **Right to Erasure:** Provide deletion API

```python

import hashlib



def anonymize_phone(phone_number):

    """Hash phone numbers for privacy"""

    return hashlib.sha256(phone_number.encode()).hexdigest()



def schedule_data_deletion(session_id, days=30):

    """Schedule automatic data deletion"""

    deletion_date = datetime.now() + timedelta(days=days)

    # Store in deletion_queue table

    db.execute("""

        INSERT INTO deletion_queue (session_id, deletion_date)

        VALUES (%s, %s)

    """, (session_id, deletion_date))

```

### **Safety Guidelines**

**Agent Behavior Rules:**
1. **No Escalation:** Never threaten or provoke scammer
2. **No Personal Info:** Never share real personal details
3. **No Financial Transactions:** Never actually transfer money
4. **Termination:** End conversation if scammer becomes violent
5. **Legal Compliance:** Ensure honeynet operations are legal

```python

def safety_check(agent_message):

    """Ensure agent response is safe"""

    unsafe_patterns = [

        r'I will kill',

        r'real address',

        r'my actual bank',

        # Add more unsafe patterns

    ]

    

    for pattern in unsafe_patterns:

        if re.search(pattern, agent_message, re.IGNORECASE):

            return False, "Unsafe content detected"

    

    return True, "Safe"

```

### **Bias Mitigation**

**Addressing Potential Biases:**
1. **Language Bias:** Equal performance for Hindi and English
2. **Dialect Bias:** Test on multiple Indian accents
3. **Age Bias:** Personas represent diverse age groups
4. **Regional Bias:** Test on North and South Indian scam patterns

```python

def test_language_fairness():

    """Ensure equal accuracy across languages"""

    hindi_accuracy = evaluate_on_dataset("hindi_test_set")

    english_accuracy = evaluate_on_dataset("english_test_set")

    

    difference = abs(hindi_accuracy - english_accuracy)

    assert difference < 0.05, "Language bias detected"

```

---

## WINNING STRATEGY

### **What Makes This Solution Top 10 Material**

1. **Technical Excellence (40%)**
   - ✅ Production-grade LangGraph architecture
   - ✅ State-of-the-art models (IndicBERT, Llama 3.1)
   - ✅ Robust state management (PostgreSQL + Redis + ChromaDB)
   - ✅ Free tier deployment (cost-effective)

2. **Innovation (30%)**
   - ✅ Multi-turn agentic engagement (beyond simple detection)
   - ✅ Dynamic persona adaptation
   - ✅ Intelligent stalling tactics
   - ✅ Real-time intelligence extraction

3. **India-Specific (20%)**
   - ✅ Hindi + English bilingual support
   - ✅ UPI/IFSC/Indian bank patterns
   - ✅ Local scam tactics knowledge
   - ✅ Cultural context in personas

4. **Execution Quality (10%)**
   - ✅ Clean API design
   - ✅ Comprehensive documentation
   - ✅ Testing framework
   - ✅ Production deployment

### **Competition Day Checklist**

**Before Submission (Feb 4):**
- [ ] API deployed and publicly accessible
- [ ] Health check endpoint working
- [ ] Test with 100+ sample scam messages
- [ ] Verify JSON response format matches requirements
- [ ] Check response latency (<2s average)
- [ ] Ensure Groq API key has remaining credits
- [ ] Database backups configured
- [ ] Monitoring dashboard active
- [ ] Documentation complete
- [ ] Demo video recorded (if required)

**Submission Day (Feb 5):**
- [ ] Submit API endpoint URL
- [ ] Verify endpoint is reachable from external networks
- [ ] Monitor logs for incoming test requests
- [ ] Have fallback plan (backup deployment)
- [ ] Team available for support
- [ ] Response to any judge queries within 2 hours

### **Differentiation from Competitors**

Most participants will likely build:
- Simple rule-based detection (keyword matching only)
- Single-turn response (no multi-turn engagement)
- Basic regex extraction (no NER)
- No state management (stateless API)

**Your Edge:**
-**Agentic AI** with LangGraph (complex, adaptive)
-**Multi-turn engagement** (up to 20 turns)
-**State persistence** (remembers conversation)
-**Dynamic personas** (believable, adaptive)
-**Hybrid extraction** (NER + regex + validation)
-**Production architecture** (scalable, monitored)

---

## PROJECT STRUCTURE

```

scamshield-ai/

├── README.md

├── requirements.txt

├── Dockerfile

├── docker-compose.yml

├── .env.example

├── .gitignore


├── app/

│   ├── __init__.py

│   ├── main.py                    # FastAPI app

│   ├── config.py                  # Configuration

│   │

│   ├── api/

│   │   ├── __init__.py

│   │   ├── endpoints.py           # API routes

│   │   └── schemas.py             # Pydantic models

│   │

│   ├── models/

│   │   ├── __init__.py

│   │   ├── detector.py            # IndicBERT scam detection

│   │   ├── extractor.py           # Intelligence extraction

│   │   └── language.py            # Language detection

│   │

│   ├── agent/

│   │   ├── __init__.py

│   │   ├── honeypot.py            # LangGraph agent

│   │   ├── personas.py            # Persona definitions

│   │   ├── prompts.py             # LLM prompts

│   │   └── strategies.py          # Engagement strategies

│   │

│   ├── database/

│   │   ├── __init__.py

│   │   ├── postgres.py            # PostgreSQL connection

│   │   ├── redis_client.py        # Redis connection

│   │   ├── chromadb_client.py     # ChromaDB connection

│   │   └── models.py              # SQLAlchemy models

│   │

│   └── utils/

│       ├── __init__.py

│       ├── preprocessing.py       # Text preprocessing

│       ├── validation.py          # Input validation

│       ├── metrics.py             # Prometheus metrics

│       └── logger.py              # Logging configuration


├── tests/

│   ├── __init__.py

│   ├── test_detection.py

│   ├── test_extraction.py

│   ├── test_engagement.py

│   └── test_api.py


├── scripts/

│   ├── setup_models.py            # Download ML models

│   ├── init_database.py           # Initialize DB schema

│   └── test_deployment.py         # Deployment smoke tests


├── data/

│   ├── test_scam_messages.json    # Test dataset

│   ├── personas.json              # Persona definitions

│   └── scam_patterns.json         # Known scam patterns


└── docs/

    ├── API_DOCUMENTATION.md

    ├── DEPLOYMENT_GUIDE.md

    └── TESTING_GUIDE.md

```

---

## IMPLEMENTATION TIMELINE

### **Week 1 (Jan 26 - Feb 1): Core Development**

**Day 1-2: Project Setup**
- Initialize repository
- Setup virtual environment
- Install dependencies
- Configure PostgreSQL, Redis, ChromaDB
- Obtain Groq API key

**Day 3-4: Detection Module**
- Implement IndicBERT integration
- Build language detection
- Create keyword matching
- Test on sample messages

**Day 5-6: Agentic Module**
- Build LangGraph workflow
- Integrate Groq Llama 3.1
- Implement persona system
- Test engagement loop

**Day 7: Extraction Module**
- Implement spaCy NER
- Build regex patterns
- Create validation logic
- Test on sample extractions

### **Week 2 (Feb 2 - Feb 5): Testing & Deployment**

**Day 8: Integration**
- Connect all modules
- Build FastAPI endpoints
- Implement database operations
- End-to-end testing

**Day 9: Testing**
- Unit tests (80%+ coverage)
- Integration tests
- Load testing (100 req/min)
- Fix bugs

**Day 10: Deployment**
- Deploy to Render/Railway
- Configure environment variables
- Setup monitoring
- Test production endpoint

**Day 11: Buffer & Submission**
- Final testing
- Documentation review
- Submit API endpoint
- Monitor for test requests

---

## SUCCESS METRICS

### **Minimum Viable Product (MVP) Criteria**

**Must Have:**
- [x] Scam detection with >85% accuracy
- [x] Multi-turn engagement (at least 10 turns)
- [x] Extract at least 1 UPI ID or bank account
- [x] API response time <3s
- [x] Handle Hindi and English
- [x] Structured JSON output

**Should Have:**
- [x] >90% detection accuracy
- [x] 15+ turn average engagement
- [x] Extract 2+ intelligence types per conversation
- [x] API response time <2s
- [x] State persistence across sessions
- [x] Monitoring and metrics

**Nice to Have:**
- [ ] >95% detection accuracy
- [ ] 20 turn max engagement
- [ ] Extract all 5 intelligence types
- [ ] API response time <1s
- [ ] Advanced persona adaptation
- [ ] Voice deepfake detection (Phase 2)

---

## RISK MITIGATION

| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| **Groq API rate limits** | High | High | Implement retry logic, use backoff, cache responses |
| **Model loading time** | Medium | Medium | Load models at startup, not per request |
| **Database connection loss** | Low | High | Connection pooling, auto-reconnect, fallback to local storage |
| **Competition API changes** | Medium | High | Flexible schema design, comprehensive testing |
| **Deployment downtime** | Low | Critical | Multiple hosting options, health checks, auto-restart |
| **Hindi detection accuracy** | Medium | Medium | Extensive testing, fallback to English, hybrid approach |

---

## CONCLUSION

This specification provides a **complete, production-ready blueprint** for building a winning Agentic Honeypot System for the India AI Impact Buildathon 2026.

**Key Strengths:**
1.**100% Free Tier** - No cost barriers
2.**Phased Approach** - Text first, audio later
3.**Production Architecture** - Scalable, monitored, robust
4.**India-Specific** - Hindi support, local scam patterns
5.**Competition-Ready** - API-first, JSON outputs
6.**Technically Superior** - LangGraph, state management, multi-turn
7.**Well-Documented** - Clear implementation path

**Expected Ranking:** TOP 10 from 40,000 participants

**Next Steps:**
1. Begin implementation following project structure
2. Test continuously against test dataset
3. Deploy early to identify issues
4. Iterate based on testing feedback
5. Submit before deadline with confidence

**Team Focus:**
- Speed of execution (10 days remaining)
- Quality over features (MVP first)
- Testing, testing, testing
- Documentation for judges
- Monitoring for competition day

---

## APPENDIX

### **A. Sample API Requests/Responses**

See API ENDPOINT SPECIFICATION section above.

### **B. Groq API Key Setup**

1. Visit: https://console.groq.com/
2. Sign up with email
3. Navigate to API Keys section
4. Generate new API key
5. Free tier: 30 requests/minute, 6000/day

### **C. Supabase PostgreSQL Setup**

1. Visit: https://supabase.com/
2. Create account
3. New project → Select free tier
4. Copy connection string
5. Use in DATABASE_URL environment variable



### **D. Redis Cloud Setup**



1. Visit: https://redis.com/try-free/

2. Create account

3. New database → Free 30MB

4. Copy connection details

5. Use in REDIS_URL environment variable

### **E. Testing Commands**

```bash

# Run all tests

pytest tests/ -v



# Run specific test

pytest tests/test_detection.py::test_hindi_scam_detection



# Test API locally

curl -X POST http://localhost:8000/honeypot/engage \

  -H "Content-Type: application/json" \

  -d '{"message": "You won 10 lakh rupees!"}'



# Load test

locust -f tests/load_test.py --host http://localhost:8000

```

### **F. Deployment Commands**

```bash

# Build Docker image

docker build -t scamshield-ai .



# Run locally

docker-compose up



# Deploy to Render

git push render main



# Check logs

render logs --tail 100

```

---

**Document Version:** 1.0  
**Last Updated:** January 26, 2026  
**Status:** Ready for Implementation  
**Approved By:** Shivam Bhuva (Team Lead)

**For Questions/Support:**
- GitHub Issues: [Your Repo]
- Email: shivambhuva8866@gmail.com
- Team Channel: [Your Communication Channel]

---

**END OF DOCUMENT**