File size: 5,457 Bytes
d4c2ad9
23cdeed
 
 
 
d4c2ad9
23cdeed
d4c2ad9
 
 
23cdeed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
title: Pluto Pipeline
emoji: "πŸ“„"
colorFrom: gray
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
---

# Pluto: Real Mode-Switching Extraction Pipeline

Pluto is a document question-answering system built for research and technical documents. Instead of sending an entire paper to one model and hoping for the best, Pluto separates document understanding from query-time reasoning, routes only relevant chunks, extracts structured claims, merges them into an answer, and verifies support before returning the result.

The project includes a FastAPI backend, a one-page dashboard, scoped corpus selection, live pipeline progress streaming, evidence-backed answers, confidence reporting, trace summaries, and a baseline comparison view.

## Why Pluto

Traditional one-shot PDF chat often struggles with long documents, tables, figures, and answer traceability. Pluto is designed to make that workflow more inspectable and more efficient for project-scale document QA.

Key goals:

- query only the relevant parts of a document corpus
- switch model behavior by chunk type and task difficulty
- keep document processing reusable across multiple questions
- surface evidence, agent activity, and confidence to the user
- support scoped queries to one selected corpus document or the full corpus

## What The App Does

- uploads `PDF`, `DOCX/DOC`, `TXT`, and `MD` files into a local corpus
- converts uploaded files to Markdown and chunks them for retrieval
- classifies chunks as text, table, figure, code, references, and more
- runs a staged pipeline: `Route -> Extract -> Merge -> EvidenceCheck`
- streams live status updates through Server-Sent Events
- returns a final answer with sections, evidence, trace, confidence, and gaps
- compares Pluto against a simpler single-model baseline in the benchmark panel

## Architecture

```mermaid
flowchart LR
    A["Frontend Dashboard"] --> B["FastAPI Server"]
    B --> C["Upload + Corpus APIs"]
    B --> D["PipelineRunner"]
    D --> E["S0 Route"]
    D --> F["S1 Extract"]
    D --> G["S2 Merge"]
    D --> H["S3 EvidenceCheck"]
    C --> I["DocIndex"]
    C --> J["Corpus Files"]
    F --> K["ExtractionCache"]
    D --> L["Tracer + MessageBus"]
    B --> M["SSE Progress Stream"]
```

## Pipeline Overview

Pluto operates in two broad phases:

1. Document understanding
2. Query-time extraction and answer synthesis

At query time the main flow is:

1. `S0 Route`
   Picks relevant chunks, applies document scope, and assigns a processing mode.
2. `S1 Extract`
   Extracts structured claims from selected chunks and reuses cached extraction results when possible.
3. `S2 Merge`
   Combines claims into answer sections, open gaps, and key claims.
4. `S3 EvidenceCheck`
   Checks whether synthesized claims are present in retrieved chunk text using token overlap and an optional LLM confirmation call.

## Tech Stack

- Backend: `FastAPI`, `Uvicorn`, `Pydantic`
- Frontend: custom `HTML + CSS + vanilla JavaScript`
- Document parsing: `pdfplumber`, `python-docx`
- Runtime config: `python-dotenv`
- Testing: `pytest`
- Providers: NVIDIA-hosted models when available, with Groq and Mistral fallback paths in the runtime

## Repo Layout

```text
mini-project_3rd_yr-main/
β”œβ”€ Dockerfile
β”œβ”€ README.md
β”œβ”€ pytest.ini
β”œβ”€ hf_space/
└─ mp1/
   β”œβ”€ main.py
   β”œβ”€ requirements.txt
   β”œβ”€ frontend/
   β”œβ”€ pluto/
   β”œβ”€ benchmark/
   β”œβ”€ scripts/
   β”œβ”€ corpus/
   └─ test_*.py
```

Important directories:

- `mp1/frontend/`: dashboard UI
- `mp1/pluto/`: backend server, pipeline, stages, routing, caching, tracing
- `mp1/benchmark/`: Pluto vs baseline comparison logic
- `mp1/corpus/`: local document corpus and generated corpus state
- `mp1/scripts/`: utility scripts such as the one-page PDF generator

## Quick Start

### 1. Install dependencies

```bash
pip install -r mp1/requirements.txt
```

### 2. Create your environment file

Use the example file in [`mp1/.env.example`](mp1/.env.example) and create `mp1/.env`.

Minimum practical setup:

- set `NVIDIA_API_KEY` for the NVIDIA-backed stack
- or set `GROQ_API_KEY` for the fallback stack

### 3. Run the dashboard

```bash
python mp1/main.py --serve --port 8000
```

Open `http://127.0.0.1:8000`.

### 4. Optional CLI run

```bash
python mp1/main.py --query "What is this paper about?" --corpus mp1/corpus --output mp1/output
```

## Environment Variables

Runtime code in the repo references these variables:

- `NVIDIA_API_KEY`
- `NVIDIA_API_KEY_NANO`
- `NVIDIA_API_KEY_SUPER`
- `NVIDIA_API_KEY_VL`
- `NVIDIA_API_KEY_EMBED`
- `NVIDIA_API_KEY_RERANK`
- `NVIDIA_API_KEY_ULTRA`
- `GROQ_API_KEY`
- `MISTRAL_API_KEY`

In practice, the simplest starting point is either:

- one NVIDIA key through `NVIDIA_API_KEY`
- or one Groq key through `GROQ_API_KEY`

## Useful Endpoints

- `POST /api/run`
- `GET /api/stream`
- `POST /api/upload`
- `GET /api/corpus`
- `GET /api/doc-status/{doc_id}`
- `POST /api/compare`

## Tests

A focused local suite used during development:

```bash
pytest mp1/test_server.py mp1/test_route.py mp1/test_merge.py mp1/test_verify.py mp1/test_doc_index.py -q
```

## Notes

- generated runtime artifacts, logs, temp folders, local caches, and secret files are intentionally excluded through `.gitignore`
- `mp1/output/` is treated as generated output, not source code
- corpus metadata such as `mp1/corpus/.doc_index.json` and `mp1/corpus/.extraction_cache.json` is runtime state