document-qa-dev / docs /README.md
lfoppiano's picture
Upload folder using huggingface_hub
916dea4 verified

A newer version of the Streamlit SDK is available: 1.57.0

Upgrade

πŸ“ document-qa-engine documentation

License: Apache 2.0 Β· PyPI: pip install document-qa-engine

A Python library and Streamlit application for Question/Answering on scientific PDF documents using Retrieval-Augmented Generation (RAG). It uses GROBID for structured text extraction, ChromaDB for vector storage, and any OpenAI-compatible LLM for answering.

Overview

Most PDF Q/A tools feed raw extracted text to an LLM, which is noisy and loses document structure. document-qa-engine takes a different approach:

  1. Structured extraction Sends the PDF to a GROBID server, which returns TEI-XML with separate sections (title, abstract, body paragraphs, figures, back matter) and precise bounding-box coordinates for every paragraph.
  2. Smart chunking Paragraphs can be kept as-is or merged into larger chunks using token-aware merging, while preserving coordinate metadata.
  3. Vector embeddings Each chunk is embedded (via a remote API or local model) and stored in an in-memory ChromaDB collection.
  4. Retrieval + LLM answering User questions are embedded, the most similar chunks are retrieved, and an LLM generates an answer from that context.
  5. PDF highlighting The Streamlit frontend highlights the exact PDF regions the LLM used, with a color gradient (orange = most relevant, blue = least relevant).
  6. NER post-processing (optional) LLM responses are scanned for physical quantities (via grobid-quantities) and materials mentions (via grobid-superconductors), then annotated inline.

Installation

Option 1: PyPI (library only)

pip install document-qa-engine

Option 2: From source (full app)

git clone https://github.com/lfoppiano/document-qa.git
cd document-qa
pip install -r requirements.txt

Option 3: Docker

# Latest stable release
docker run -p 8501:8501 lfoppiano/document-insights-qa:latest

# Latest development build
docker run -p 8501:8501 lfoppiano/document-insights-qa:latest-develop

Prerequisites

You need access to:

Service Required? Purpose
GROBID server βœ… Yes Parses PDFs into structured text
Embedding API βœ… Yes Converts text to vectors
LLM API (OpenAI-compatible) βœ… Yes Answers questions
grobid-quantities ❌ Optional NER for measurements
grobid-superconductors ❌ Optional NER for materials

Configuration

All configuration is through environment variables. Create a .env file in the project root:

# ── LLM Endpoints ────────────────────────────────────────
# Each key in API_MODELS maps a model name to its base URL.
PHI_URL=http://localhost:1234/v1          # Phi-4-mini-instruct endpoint
QWEN_URL=http://localhost:1234/v1         # Qwen3-0.6B endpoint
API_KEY=your-llm-api-key                  # Auth key for LLM APIs

# ── Embedding Endpoint ───────────────────────────────────
EMBEDS_URL=http://127.0.0.1:1234/v1      # Embedding service URL
EMBEDS_API_KEY=your-embedding-api-key     # Auth key for embedding API

# ── Defaults ─────────────────────────────────────────────
DEFAULT_MODEL=microsoft/Phi-4-mini-instruct
DEFAULT_EMBEDDING=intfloat/multilingual-e5-large-instruct-modal

# ── GROBID Services ──────────────────────────────────────
GROBID_URL=https://your-grobid-url
GROBID_QUANTITIES_URL=https://your-grobid-quantities-url/
GROBID_MATERIALS_URL=https://your-grobid-superconductors-url/

Variable Reference

Variable Description
PHI_URL Base URL for the Phi-4-mini-instruct vLLM server (OpenAI-compatible)
QWEN_URL Base URL for the Qwen3-0.6B vLLM server (OpenAI-compatible)
API_KEY Bearer token for authenticating with the LLM endpoints
EMBEDS_URL Base URL for the embedding service (must expose /embeddings endpoint)
EMBEDS_API_KEY Bearer token for authenticating with the embedding service
DEFAULT_MODEL Model name pre-selected in the UI dropdown
DEFAULT_EMBEDDING Embedding name pre-selected in the UI dropdown
GROBID_URL Full URL to a running GROBID server
GROBID_QUANTITIES_URL URL to a grobid-quantities server (for measurement NER)
GROBID_MATERIALS_URL URL to a grobid-superconductors server (for materials NER)

Quick Start β€” Streamlit App

# 1. Set up environment
cp .env.example .env   # Edit with your endpoints

# 2. Run the app
streamlit run streamlit_app.py

Then open http://localhost:8501, upload a PDF, and ask questions.


Quick Start β€” As a Python Library

from langchain_openai import ChatOpenAI
from document_qa.custom_embeddings import ModalEmbeddings
from document_qa.document_qa_engine import DocumentQAEngine, DataStorage

# 1. Set up the LLM
llm = ChatOpenAI(
    model="microsoft/Phi-4-mini-instruct",
    temperature=0.0,
    base_url="http://localhost:1234/v1",
    api_key="your-api-key"
)

# 2. Set up embeddings
embeddings = ModalEmbeddings(
    url="http://localhost:1234/v1",
    model_name="intfloat/multilingual-e5-large-instruct",
    api_key="your-embedding-key"
)

# 3. Create the storage and engine
storage = DataStorage(embeddings)
engine = DocumentQAEngine(
    llm=llm,
    data_storage=storage,
    grobid_url="https://lfoppiano-grobid.hf.space/"
)

# 4. Load a PDF (creates in-memory embeddings)
doc_id = engine.create_memory_embeddings(
    pdf_path="path/to/paper.pdf",
    chunk_size=500       # tokens per chunk (-1 = keep paragraphs)
)

# 5. Ask a question
_, answer, coordinates = engine.query_document(
    query="What is the main contribution of this paper?",
    doc_id=doc_id,
    context_size=10      # number of chunks to use as context
)
print(answer)

# 6. Or just retrieve relevant passages (no LLM)
passages, coordinates = engine.query_storage(
    query="What materials were studied?",
    doc_id=doc_id,
    context_size=5
)
for p in passages:
    print(p)

Streamlit App Features

Query Modes

Mode What It Does When to Use
LLM Q/A Retrieves context β†’ sends to LLM β†’ returns a natural language answer Default β€” for asking questions
Embeddings Returns the raw text passages most similar to your question Debugging β€” to see what context the LLM would receive
Question Coefficient Computes min_similarity - mean_similarity as a quality estimate Experimental β€” to predict answer reliability

Settings

Setting Default Description
Chunk size -1 (paragraphs) Token count per text chunk. -1 keeps GROBID paragraphs intact.
Context size 10 (paragraphs) / 4 (chunks) Number of chunks sent to the LLM as context
Scroll to context Off Auto-scroll the PDF viewer to the most relevant passage
NER processing Off Run grobid-quantities + grobid-superconductors on LLM responses

PDF Annotations

After each query, the PDF viewer highlights the passages used as context:

  • Orange (warm) = most relevant passage
  • Blue (cold) = least relevant passage
  • Dotted border = the single most relevant passage

Troubleshooting

SQLite version error

streamlit: Your system has an unsupported version of sqlite3.
Chroma requires sqlite3 >= 3.35.0.

Linux fix: See this StackOverflow answer. More info: Chroma troubleshooting docs.

"The information is not provided in the given context"

The LLM couldn't find the answer in the retrieved passages. Try:

  1. Increase context size β€” use the sidebar slider to retrieve more passages
  2. Decrease chunk size β€” smaller chunks may match more precisely
  3. Use Embeddings mode β€” switch to "Embeddings" query mode to see what passages are being retrieved and verify they contain the answer

MissingSchema error on embeddings

requests.exceptions.MissingSchema: Invalid URL

Ensure EMBEDS_URL in your .env starts with https:// or http://. Example:

EMBEDS_URL=https://your-modal-endpoint.modal.run/v1

GROBID connection errors

Make sure your GROBID server is running and accessible:

curl https://grobid.hf.space/api/isalive

If using a local GROBID instance:

docker run --rm -p 8070:8070 lfoppiano/grobid:0.8.0
# Then set GROBID_URL=http://localhost:8070

Embedding API returning empty results

  • Verify the API is running: curl {EMBEDS_URL}/embeddings
  • Check that EMBEDS_API_KEY matches the server's expected key
  • Ensure the URL does not have a trailing /embeddings (the client appends it automatically)