File size: 19,758 Bytes
f37d3b7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e4526f9
 
 
 
b2d0640
 
e4526f9
 
 
 
 
 
 
 
 
 
ae22613
e4526f9
 
 
 
 
 
 
 
 
 
 
 
ae22613
 
e4526f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae22613
0214972
6372870
e4526f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70b94cb
 
e4526f9
 
70b94cb
 
 
 
 
 
 
 
 
 
 
 
 
e4526f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
---
title: NyayaSetu
emoji: ⚖️
colorFrom: indigo
colorTo: blue
sdk: docker
pinned: false
---

# NyayaSetu — Indian Legal RAG Agent

Ask questions about Indian Supreme Court judgments (1950–2024).

**Live API:** POST `/query` with `{"query": "your legal question"}`

> Not legal advice. Always consult a qualified advocate.


# NyayaSetu — Indian Legal RAG Agent

> Retrieval-Augmented Generation over 26,688 Supreme Court of India judgments (1950–2024).  
> Ask a legal question. Get a cited answer grounded in real case law.
> 1,025,764 chunks indexed (SC judgments, HC judgments, bare acts, constitution, legal references)
> V2 agent with 3-pass reasoning loop and conversation memory

[![Live Demo](https://img.shields.io/badge/🤗%20HuggingFace-Live%20Demo-blue)](https://huggingface.co/spaces/CaffeinatedCoding/nyayasetu)
[![GitHub Actions](https://github.com/devangmishra1424/nyayasetu/actions/workflows/ci.yml/badge.svg)](https://github.com/devangmishra1424/nyayasetu/actions)
![Python](https://img.shields.io/badge/python-3.11-blue)
![Version](https://img.shields.io/badge/version-1.0-green)

---

> **NOT legal advice.** This is a portfolio project. Always consult a qualified advocate.

---

## What It Does

A user types a legal question. The system:

1. Runs **Named Entity Recognition** (fine-tuned DistilBERT) to extract legal entities — judges, statutes, provisions, case numbers
2. Augments the query with extracted entities and embeds it using **MiniLM** (384-dim)
3. Searches a **FAISS index** of 443,598 judgment chunks for the most relevant excerpts
4. Assembles **1024-token context windows** from the parent judgments around each matched chunk
5. Makes a **single LLM call** (Groq — Llama-3.3-70b) with a strict "answer only from provided excerpts" prompt
6. Runs **deterministic citation verification** — checks whether quoted phrases in the answer appear verbatim in the retrieved context

---

## Architecture

```
User Query


┌─────────────────────────────────────────┐
│  NER Layer (DistilBERT fine-tuned)      │
│  Extracts: JUDGE, COURT, STATUTE,       │
│  PROVISION, CASE_NUMBER, DATE           │
└──────────────────┬──────────────────────┘
                   │ augmented query

┌─────────────────────────────────────────┐
│  Embedding Layer (MiniLM-L6-v2)         │
│  384-dim sentence embedding             │
└──────────────────┬──────────────────────┘
                   │ query vector

┌─────────────────────────────────────────┐
│  FAISS Retrieval (IndexFlatL2)          │
│  443,598 chunks — 26,688 SC judgments   │
│  Memory-mapped — index never fully      │
│  loaded into RAM                        │
└──────────────────┬──────────────────────┘
                   │ top-5 chunks + parent context

┌─────────────────────────────────────────┐
│  LLM Generation (Groq — Llama-3.3-70b) │
│  Single call, strict grounding prompt   │
│  Gemini as fallback                     │
└──────────────────┬──────────────────────┘
                   │ answer

┌─────────────────────────────────────────┐
│  Citation Verification (deterministic)  │
│  Verified ✓ / ⚠ Unverified             │
└──────────────────┬──────────────────────┘


            JSON Response
```

**Deployment:** Docker container on HuggingFace Spaces (port 7860). Models downloaded from HF Hub at startup — not bundled in the image.

---

## Technical Decisions

**Why no LangChain?**
I built the chunking pipeline, FAISS retrieval, agent loop, and citation verification from scratch in plain Python. This means I can debug each component independently and explain exactly what each one does. I know what LangChain abstracts because I built what it abstracts. I am fully prepared to use LangChain or LangGraph in a team setting.

**Why DistilBERT for NER?**
DistilBERT is 40% smaller and 60% faster than BERT with 97% of its performance. For a token classification task like NER, this tradeoff is correct — the speed matters at inference time and the accuracy loss is negligible for legal entity types.

**Why FAISS IndexFlatL2?**
Exact nearest neighbour search over 443,598 vectors. Approximate methods (HNSW, IVF) trade accuracy for speed — unnecessary at this corpus size. Memory mapping keeps the 650MB index off RAM until a query needs it.

**Why MiniLM for embeddings?**
`all-MiniLM-L6-v2` is designed specifically for semantic similarity tasks. 384 dimensions gives a good balance between retrieval quality and index size. Runs entirely on CPU — no GPU dependency at inference time.

**Why a single LLM call per query?**
Multi-step chains add latency, introduce more failure points, and make hallucination harder to trace. One call with a strict grounding prompt is simpler, faster, and easier to debug. The citation verifier is the safety layer, not a second LLM call.

**Why deterministic citation verification?**
NLI-based verification requires loading a second model (~500MB) and adds ~300ms latency per query. For a portfolio project on a free tier, deterministic substring matching after normalisation gives 80% of the value at 0% of the cost. The limitation (paraphrases pass as verified) is documented.

**Why parent document retrieval?**
Chunks are 256 tokens — good for retrieval precision. But 256 tokens is often mid-sentence with no surrounding context. The LLM needs more. The system retrieves a 1024-token window centred on each matched chunk from the full parent judgment, giving the LLM enough context to answer correctly.

---

## Performance

| Metric | Value |
|---|---|
| NER F1 (overall) | 0.777 |
| Index size | 443,598 chunks from 26,688 judgments |
| FAISS index size on disk | ~650MB |
| Embedding dimensions | 384 |
| Typical query latency | 1,000–1,800ms |
| LLM | Groq Llama-3.3-70b-versatile |
| Deployment | HuggingFace Spaces, CPU only, free tier |

Latency breakdown: ~5ms FAISS search, ~50ms NER + embedding, ~900–1500ms Groq API call, ~10ms citation verification.

---

## Live Query Examples

**Health check:**
```
PS> Invoke-RestMethod -Uri "https://caffeinatedcoding-nyayasetu.hf.space/health"

status  service   version
------  -------   -------
ok      NyayaSetu 1.0.0
```

---

**Query: Fundamental rights under the Indian Constitution**
```
PS> Invoke-RestMethod -Uri "https://caffeinatedcoding-nyayasetu.hf.space/query" `
      -Method POST -ContentType "application/json" `
      -Body '{"query": "What are the fundamental rights guaranteed under the Indian Constitution?"}'

query               : What are the fundamental rights guaranteed under the Indian Constitution?
answer              : The fundamental rights guaranteed under the Indian Constitution are divided
                      into seven categories:
                      "right to equality - arts. 14 to 18;
                      right to freedom - arts. 19 to 22;
                      right against exploitation - arts. 23 and 24;
                      right to freedom of religion arts. 25 to 28;
                      cultural and educational rights arts. 29 and 30;
                      right to property - arts. 31, 31 a and 31b;
                      and right to constitutional remedies arts. 32 to 35" (SC_1958_9972).
                      These fundamental rights are "still reserved to the people after the
                      delegation of rights by the people to the institutions of government"
                      (SC_1958_9972).
                      The Constitution "confirms their existence and gives them protection"
                      (SC_2017_2363).

                      NOTE: This is not legal advice. Consult a qualified advocate.

sources             : SC_2017_2363 (Justice K S Puttaswamy Retd And Anr vs Union Of India, 2017)
                      SC_1958_9972 (Basheshar Nath vs The Commissioner Of Income Tax Delhi, 1958)
                      SC_1992_25797 (Life Insurance Corpn Of India vs Prof Manubhai D Shah, 1992)
                      SC_1962_10537 (Prem Chand Garg vs Excise Commissioner U P Allahabad, 1962)
verification_status : Unverified
entities            : STATUTE
num_sources         : 5
truncated           : False
latency_ms          : 1768.34
```

---

**Query: Right to privacy**
```
PS> Invoke-RestMethod -Uri "https://caffeinatedcoding-nyayasetu.hf.space/query" `
      -Method POST -ContentType "application/json" `
      -Body '{"query": "What is the right to privacy in India and how did the Supreme Court rule on it?"}'

query               : What is the right to privacy in India and how did the Supreme Court rule on it?
answer              : The right to privacy in India is "not absolute" and is "subject to certain
                      reasonable restrictions on the basis of compelling social, moral and public
                      interest" as stated in Justice K S Puttaswamy Retd And Anr vs Union Of India
                      And Ors (ID: SC_2017_2363). According to the same judgment, "the right to
                      privacy has been implied in articles 19 (1) (a) and (d) and article 21" of
                      the Constitution.

                      As noted in Distt Registrar Collector vs Canara Bank Etc (ID: SC_2004_4562),
                      "the right to privacy has been widely accepted as implied in our constitution"
                      and is "the right to be let alone".

                      The Supreme Court has ruled that the right to privacy is a fundamental right
                      emanating from Article 21 of the Constitution, as stated in Justice K S
                      Puttaswamy Retd And Anr vs Union Of India And Ors (ID: SC_2017_2363).

                      NOTE: This is not legal advice. Consult a qualified advocate.

sources             : SC_2017_2363 (Justice K S Puttaswamy Retd And Anr vs Union Of India, 2017)
                      SC_2018_24210 (Justice K S Puttaswamy Retd vs Union Of India, 2018)
                      SC_2004_4562 (Distt Registrar Collector vs Canara Bank Etc, 2004)
verification_status : Unverified
entities            : GPE, COURT
num_sources         : 5
truncated           : False
latency_ms          : 1051.71
```

---

**Query: Doctrine of proportionality**
```
PS> Invoke-RestMethod -Uri "https://caffeinatedcoding-nyayasetu.hf.space/query" `
      -Method POST -ContentType "application/json" `
      -Body '{"query": "What is the doctrine of proportionality and how is it applied in fundamental rights cases?"}'

query               : What is the doctrine of proportionality and how is it applied in
                      fundamental rights cases?
answer              : The doctrine of proportionality is a principle that guides the limitation of
                      fundamental rights. As stated in Anuradha Bhasin vs Union Of India
                      (ID: SC_2020_1572), "the proportionality principle, can be easily summarized
                      by lord diplock's aphorism — you must not use a steam hammer to crack a nut,
                      if a nutcracker would do?"

                      According to Justice K S Puttaswamy Retd vs Union Of India (ID: SC_2018_24210),
                      the proportionality test involves four stages: "a legitimate goal stage";
                      "a suitability or rational connection stage"; "a necessity stage"; and
                      "a balancing stage".

                      In Modern Dental College Res Cen Ors vs State Of Madhya Pradesh Ors
                      (ID: SC_2016_19144), "when a law limits a constitutional right, such a
                      limitation is constitutional if it is proportional".

                      NOTE: This is not legal advice. Consult a qualified advocate.

sources             : SC_2020_1572 (Anuradha Bhasin vs Union Of India, 2020)
                      SC_2018_24210 (Justice K S Puttaswamy Retd vs Union Of India, 2018)
                      SC_2016_19144 (Modern Dental College Res Cen vs State Of Madhya Pradesh, 2016)
                      SC_2023_16817 (Ramesh Chandra Sharma vs The State Of Uttar Pradesh, 2023)
verification_status : Unverified
entities            : (none extracted)
num_sources         : 5
truncated           : False
latency_ms          : 1511.71
```

---

**Validation — query too short (fails fast, model never called):**
```
PS> Invoke-RestMethod -Uri "https://caffeinatedcoding-nyayasetu.hf.space/query" `
      -Method POST -ContentType "application/json" `
      -Body '{"query": "help"}'

Invoke-RestMethod : {"detail":"Query too short — minimum 10 characters"}
StatusCode        : 400
```

---

**Out-of-domain query — LLM correctly refuses:**
```
PS> Invoke-RestMethod -Uri "https://caffeinatedcoding-nyayasetu.hf.space/query" `
      -Method POST -ContentType "application/json" `
      -Body '{"query": "Who won the IPL cricket tournament this year?"}'

answer              : The provided Supreme Court judgment excerpts do not contain any information
                      about the IPL cricket tournament or its winners. The excerpts appear to be
                      court judgments with case information, judge names, and dates, but they do
                      not mention the IPL or any related topics.
verification_status : No verifiable claims
entities            : ORG
num_sources         : 5
latency_ms          : 571.68
```

---

## API

**POST /query**
```json
{
  "query": "What is the doctrine of proportionality in fundamental rights cases?"
}
```

Response:
```json
{
  "query": "...",
  "answer": "The doctrine of proportionality... (SC_2018_24210)",
  "sources": [
    {
      "judgment_id": "SC_2018_24210",
      "title": "Justice K S Puttaswamy Retd vs Union Of India",
      "year": "2018",
      "similarity_score": 0.689,
      "excerpt": "..."
    }
  ],
  "verification_status": "Verified",
  "unverified_quotes": [],
  "entities": {"COURT": ["Supreme Court"]},
  "num_sources": 5,
  "truncated": false,
  "latency_ms": 1511.71
}
```

**GET /health**`{"status": "ok", "service": "NyayaSetu", "version": "1.0.0"}`

**GET /** — app info and endpoint list

---

## Project Structure

```
NyayaSetu/
├── preprocessing/
│   ├── clean.py              ← text cleaning, OCR error fixing
│   ├── chunk.py              ← recursive splitter, 256 tokens, 50 overlap
│   ├── embed.py              ← MiniLM batch embedding
│   └── build_index.py        ← FAISS IndexFlatL2 construction
├── src/
│   ├── ner.py                ← DistilBERT NER inference
│   ├── retrieval.py          ← FAISS search + parent context assembly
│   ├── agent.py              ← single-pass query pipeline
│   ├── llm.py                ← Groq API call + tenacity retry
│   └── verify.py             ← deterministic citation verification
├── api/
│   ├── main.py               ← FastAPI, 3 endpoints, model download at startup
│   └── schemas.py            ← Pydantic request/response models
├── tests/
│   ├── test_retriever.py
│   ├── test_agent.py
│   ├── test_verify.py
│   └── test_api.py
├── .github/workflows/ci.yml  ← pytest → lint → docker build → HF deploy → smoke test
└── docker/Dockerfile


```

## V2 Agent Architecture

**Pass 1 — Analyse:** LLM call to understand the message, detect tone/stage, 
build structured fact web, update hypotheses, form targeted FAISS queries.

**Pass 2 — Retrieve:** Parallel FAISS search across 3 queries. No LLM call. ~5ms.

**Pass 3 — Respond:** Dynamically assembled prompt based on tone, stage, and 
format needs + full case state + retrieved context.

**Conversation Memory:** Each session maintains a compressed summary + structured 
fact web (parties, events, documents, amounts, hypotheses) updated every turn.

---

## Setup & Reproduction

```bash
git clone https://github.com/devangmishra1424/nyayasetu
cd nyayasetu

pip install -r requirements.txt

# Set environment variables
export GROQ_API_KEY=your_key_here
export HF_TOKEN=your_token_here

# Models (~2.7GB) download automatically from HF Hub at startup
uvicorn api.main:app --host 0.0.0.0 --port 7860
```

---

## Limitations

**Data scope:** Supreme Court of India judgments only, 1950–2024. No High Court judgments, no legislation, no legal commentary.

**Citation verification:** The verifier does exact substring matching after normalisation. LLM paraphrases pass as Verified even when the underlying claim is correct. Full paraphrase detection would require NLI inference — out of scope for v1.

**Out-of-domain queries:** The similarity threshold blocks most irrelevant queries. Queries that share vocabulary with legal text may still pass through to the LLM, which will correctly report no relevant information found.

**Not a legal database:** This system cannot be used as a substitute for Westlaw, SCC Online, or Indian Kanoon. It is a portfolio demonstration of RAG pipeline engineering.

**v1 — planned improvements:**
- Gradio frontend for non-technical users
- MLflow experiment tracking for NER training runs
- Evidently drift monitoring on query logs
- High Court judgment coverage
- Re-ranking layer (cross-encoder) between FAISS retrieval and LLM call

---

## Bug Log

**Bug 1 — `snapshot_download` with `allow_patterns` fetching 0 files**
The FAISS index files were uploaded to HuggingFace Hub under a `faiss_index/` subfolder. The `snapshot_download` call with `allow_patterns="faiss_index/*"` returned 0 files — it couldn't match the pattern against the subfolder structure. Fixed by switching to `hf_hub_download` with explicit `filename` paths per file. Lesson: `snapshot_download` pattern matching behaves differently for nested paths than expected.

**Bug 2 — L2 distance threshold logic inverted**
The similarity threshold in `retrieval.py` used `if best_score < SIMILARITY_THRESHOLD: return []`. This is correct for cosine similarity (higher = better) but wrong for L2 distance (lower = better). The condition was blocking good legal queries and letting through out-of-domain queries. Fixed by flipping to `if best_score > SIMILARITY_THRESHOLD` and setting threshold to 0.85. Lesson: always verify which direction your distance metric runs before writing threshold logic.

**Bug 3 — `api/__init__.py` contained a shell command**
The `api/__init__.py` file contained `echo ""` — a leftover from a PowerShell command accidentally piped into the file. Python threw a syntax error at startup. Fixed by overwriting with an empty string. Lesson: on Windows, `echo "" > file` writes the shell command into the file. Use `"" | Out-File -FilePath file -Encoding utf8` instead.