File size: 9,255 Bytes
916dea4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
# πŸ“ document-qa-engine documentation

>  **License**: Apache 2.0 Β· **PyPI**: `pip install document-qa-engine`

A Python library and Streamlit application for **Question/Answering on scientific PDF documents** using Retrieval-Augmented Generation (RAG). It uses [GROBID](https://github.com/kermitt2/grobid) for structured text extraction, [ChromaDB](https://www.trychroma.com/) for vector storage, and any OpenAI-compatible LLM for answering.


## Overview

Most PDF Q/A tools feed raw extracted text to an LLM, which is noisy and loses document structure. **document-qa-engine** takes a different approach:

1. **Structured extraction** Sends the PDF to a GROBID server, which returns TEI-XML with separate sections (title, abstract, body paragraphs, figures, back matter) and precise bounding-box coordinates for every paragraph.
2. **Smart chunking** Paragraphs can be kept as-is or merged into larger chunks using token-aware merging, while preserving coordinate metadata.
3. **Vector embeddings** Each chunk is embedded (via a remote API or local model) and stored in an in-memory ChromaDB collection.
4. **Retrieval + LLM answering** User questions are embedded, the most similar chunks are retrieved, and an LLM generates an answer from that context.
5. **PDF highlighting**  The Streamlit frontend highlights the exact PDF regions the LLM used, with a color gradient (orange = most relevant, blue = least relevant).
6. **NER post-processing** *(optional)* LLM responses are scanned for physical quantities (via grobid-quantities) and materials mentions (via grobid-superconductors), then annotated inline.


## Installation

### Option 1: PyPI (library only)

```bash
pip install document-qa-engine
```

### Option 2: From source (full app)

```bash
git clone https://github.com/lfoppiano/document-qa.git
cd document-qa
pip install -r requirements.txt
```

### Option 3: Docker

```bash
# Latest stable release
docker run -p 8501:8501 lfoppiano/document-insights-qa:latest

# Latest development build
docker run -p 8501:8501 lfoppiano/document-insights-qa:latest-develop
```

### Prerequisites

You need access to:

| Service | Required? | Purpose |
|---------|-----------|---------|
| **GROBID server** | βœ… Yes | Parses PDFs into structured text |
| **Embedding API** | βœ… Yes | Converts text to vectors |
| **LLM API** (OpenAI-compatible) | βœ… Yes | Answers questions |
| **grobid-quantities** | ❌ Optional | NER for measurements |
| **grobid-superconductors** | ❌ Optional | NER for materials |



## Configuration

All configuration is through environment variables. Create a `.env` file in the project root:

```env
# ── LLM Endpoints ────────────────────────────────────────
# Each key in API_MODELS maps a model name to its base URL.
PHI_URL=http://localhost:1234/v1          # Phi-4-mini-instruct endpoint
QWEN_URL=http://localhost:1234/v1         # Qwen3-0.6B endpoint
API_KEY=your-llm-api-key                  # Auth key for LLM APIs

# ── Embedding Endpoint ───────────────────────────────────
EMBEDS_URL=http://127.0.0.1:1234/v1      # Embedding service URL
EMBEDS_API_KEY=your-embedding-api-key     # Auth key for embedding API

# ── Defaults ─────────────────────────────────────────────
DEFAULT_MODEL=microsoft/Phi-4-mini-instruct
DEFAULT_EMBEDDING=intfloat/multilingual-e5-large-instruct-modal

# ── GROBID Services ──────────────────────────────────────
GROBID_URL=https://your-grobid-url
GROBID_QUANTITIES_URL=https://your-grobid-quantities-url/
GROBID_MATERIALS_URL=https://your-grobid-superconductors-url/
```

### Variable Reference

| Variable | Description |
|----------|-------------|
| `PHI_URL` | Base URL for the Phi-4-mini-instruct vLLM server (OpenAI-compatible) |
| `QWEN_URL` | Base URL for the Qwen3-0.6B vLLM server (OpenAI-compatible) |
| `API_KEY` | Bearer token for authenticating with the LLM endpoints |
| `EMBEDS_URL` | Base URL for the embedding service (must expose `/embeddings` endpoint) |
| `EMBEDS_API_KEY` | Bearer token for authenticating with the embedding service |
| `DEFAULT_MODEL` | Model name pre-selected in the UI dropdown |
| `DEFAULT_EMBEDDING` | Embedding name pre-selected in the UI dropdown |
| `GROBID_URL` | Full URL to a running GROBID server |
| `GROBID_QUANTITIES_URL` | URL to a grobid-quantities server (for measurement NER) |
| `GROBID_MATERIALS_URL` | URL to a grobid-superconductors server (for materials NER) |

---

## Quick Start β€” Streamlit App

```bash
# 1. Set up environment
cp .env.example .env   # Edit with your endpoints

# 2. Run the app
streamlit run streamlit_app.py
```

Then open `http://localhost:8501`, upload a PDF, and ask questions.

---

## Quick Start β€” As a Python Library

```python
from langchain_openai import ChatOpenAI
from document_qa.custom_embeddings import ModalEmbeddings
from document_qa.document_qa_engine import DocumentQAEngine, DataStorage

# 1. Set up the LLM
llm = ChatOpenAI(
    model="microsoft/Phi-4-mini-instruct",
    temperature=0.0,
    base_url="http://localhost:1234/v1",
    api_key="your-api-key"
)

# 2. Set up embeddings
embeddings = ModalEmbeddings(
    url="http://localhost:1234/v1",
    model_name="intfloat/multilingual-e5-large-instruct",
    api_key="your-embedding-key"
)

# 3. Create the storage and engine
storage = DataStorage(embeddings)
engine = DocumentQAEngine(
    llm=llm,
    data_storage=storage,
    grobid_url="https://lfoppiano-grobid.hf.space/"
)

# 4. Load a PDF (creates in-memory embeddings)
doc_id = engine.create_memory_embeddings(
    pdf_path="path/to/paper.pdf",
    chunk_size=500       # tokens per chunk (-1 = keep paragraphs)
)

# 5. Ask a question
_, answer, coordinates = engine.query_document(
    query="What is the main contribution of this paper?",
    doc_id=doc_id,
    context_size=10      # number of chunks to use as context
)
print(answer)

# 6. Or just retrieve relevant passages (no LLM)
passages, coordinates = engine.query_storage(
    query="What materials were studied?",
    doc_id=doc_id,
    context_size=5
)
for p in passages:
    print(p)
```


## Streamlit App Features

### Query Modes

| Mode | What It Does | When to Use |
|------|-------------|-------------|
| **LLM Q/A** | Retrieves context β†’ sends to LLM β†’ returns a natural language answer | Default β€” for asking questions |
| **Embeddings** | Returns the raw text passages most similar to your question | Debugging β€” to see what context the LLM would receive |
| **Question Coefficient** | Computes `min_similarity - mean_similarity` as a quality estimate | Experimental β€” to predict answer reliability |

### Settings

| Setting | Default | Description |
|---------|---------|-------------|
| Chunk size | `-1` (paragraphs) | Token count per text chunk. `-1` keeps GROBID paragraphs intact. |
| Context size | `10` (paragraphs) / `4` (chunks) | Number of chunks sent to the LLM as context |
| Scroll to context | Off | Auto-scroll the PDF viewer to the most relevant passage |
| NER processing | Off | Run grobid-quantities + grobid-superconductors on LLM responses |

### PDF Annotations

After each query, the PDF viewer highlights the passages used as context:
- **Orange** (warm) = most relevant passage
- **Blue** (cold) = least relevant passage
- **Dotted border** = the single most relevant passage



## Troubleshooting

### SQLite version error

```
streamlit: Your system has an unsupported version of sqlite3.
Chroma requires sqlite3 >= 3.35.0.
```

**Linux fix**: See [this StackOverflow answer](https://stackoverflow.com/questions/76958817/streamlit-your-system-has-an-unsupported-version-of-sqlite3-chroma-requires-sq).
**More info**: [Chroma troubleshooting docs](https://docs.trychroma.com/troubleshooting#sqlite).

### "The information is not provided in the given context"

The LLM couldn't find the answer in the retrieved passages. Try:
1. **Increase context size** β€” use the sidebar slider to retrieve more passages
2. **Decrease chunk size** β€” smaller chunks may match more precisely
3. **Use Embeddings mode** β€” switch to "Embeddings" query mode to see what passages are being retrieved and verify they contain the answer

### MissingSchema error on embeddings

```
requests.exceptions.MissingSchema: Invalid URL
```

Ensure `EMBEDS_URL` in your `.env` starts with `https://` or `http://`. Example:
```env
EMBEDS_URL=https://your-modal-endpoint.modal.run/v1
```

### GROBID connection errors

Make sure your GROBID server is running and accessible:
```bash
curl https://grobid.hf.space/api/isalive
```

If using a local GROBID instance:
```bash
docker run --rm -p 8070:8070 lfoppiano/grobid:0.8.0
# Then set GROBID_URL=http://localhost:8070
```

### Embedding API returning empty results

- Verify the API is running: `curl {EMBEDS_URL}/embeddings`
- Check that `EMBEDS_API_KEY` matches the server's expected key
- Ensure the URL does **not** have a trailing `/embeddings` (the client appends it automatically)

---