Spaces:
Running
Running
| # π document-qa-engine documentation | |
| > **License**: Apache 2.0 Β· **PyPI**: `pip install document-qa-engine` | |
| A Python library and Streamlit application for **Question/Answering on scientific PDF documents** using Retrieval-Augmented Generation (RAG). It uses [GROBID](https://github.com/kermitt2/grobid) for structured text extraction, [ChromaDB](https://www.trychroma.com/) for vector storage, and any OpenAI-compatible LLM for answering. | |
| ## Overview | |
| Most PDF Q/A tools feed raw extracted text to an LLM, which is noisy and loses document structure. **document-qa-engine** takes a different approach: | |
| 1. **Structured extraction** Sends the PDF to a GROBID server, which returns TEI-XML with separate sections (title, abstract, body paragraphs, figures, back matter) and precise bounding-box coordinates for every paragraph. | |
| 2. **Smart chunking** Paragraphs can be kept as-is or merged into larger chunks using token-aware merging, while preserving coordinate metadata. | |
| 3. **Vector embeddings** Each chunk is embedded (via a remote API or local model) and stored in an in-memory ChromaDB collection. | |
| 4. **Retrieval + LLM answering** User questions are embedded, the most similar chunks are retrieved, and an LLM generates an answer from that context. | |
| 5. **PDF highlighting** The Streamlit frontend highlights the exact PDF regions the LLM used, with a color gradient (orange = most relevant, blue = least relevant). | |
| 6. **NER post-processing** *(optional)* LLM responses are scanned for physical quantities (via grobid-quantities) and materials mentions (via grobid-superconductors), then annotated inline. | |
| ## Installation | |
| ### Option 1: PyPI (library only) | |
| ```bash | |
| pip install document-qa-engine | |
| ``` | |
| ### Option 2: From source (full app) | |
| ```bash | |
| git clone https://github.com/lfoppiano/document-qa.git | |
| cd document-qa | |
| pip install -r requirements.txt | |
| ``` | |
| ### Option 3: Docker | |
| ```bash | |
| # Latest stable release | |
| docker run -p 8501:8501 lfoppiano/document-insights-qa:latest | |
| # Latest development build | |
| docker run -p 8501:8501 lfoppiano/document-insights-qa:latest-develop | |
| ``` | |
| ### Prerequisites | |
| You need access to: | |
| | Service | Required? | Purpose | | |
| |---------|-----------|---------| | |
| | **GROBID server** | β Yes | Parses PDFs into structured text | | |
| | **Embedding API** | β Yes | Converts text to vectors | | |
| | **LLM API** (OpenAI-compatible) | β Yes | Answers questions | | |
| | **grobid-quantities** | β Optional | NER for measurements | | |
| | **grobid-superconductors** | β Optional | NER for materials | | |
| ## Configuration | |
| All configuration is through environment variables. Create a `.env` file in the project root: | |
| ```env | |
| # ββ LLM Endpoints ββββββββββββββββββββββββββββββββββββββββ | |
| # Each key in API_MODELS maps a model name to its base URL. | |
| PHI_URL=http://localhost:1234/v1 # Phi-4-mini-instruct endpoint | |
| QWEN_URL=http://localhost:1234/v1 # Qwen3-0.6B endpoint | |
| API_KEY=your-llm-api-key # Auth key for LLM APIs | |
| # ββ Embedding Endpoint βββββββββββββββββββββββββββββββββββ | |
| EMBEDS_URL=http://127.0.0.1:1234/v1 # Embedding service URL | |
| EMBEDS_API_KEY=your-embedding-api-key # Auth key for embedding API | |
| # ββ Defaults βββββββββββββββββββββββββββββββββββββββββββββ | |
| DEFAULT_MODEL=microsoft/Phi-4-mini-instruct | |
| DEFAULT_EMBEDDING=intfloat/multilingual-e5-large-instruct-modal | |
| # ββ GROBID Services ββββββββββββββββββββββββββββββββββββββ | |
| GROBID_URL=https://your-grobid-url | |
| GROBID_QUANTITIES_URL=https://your-grobid-quantities-url/ | |
| GROBID_MATERIALS_URL=https://your-grobid-superconductors-url/ | |
| ``` | |
| ### Variable Reference | |
| | Variable | Description | | |
| |----------|-------------| | |
| | `PHI_URL` | Base URL for the Phi-4-mini-instruct vLLM server (OpenAI-compatible) | | |
| | `QWEN_URL` | Base URL for the Qwen3-0.6B vLLM server (OpenAI-compatible) | | |
| | `API_KEY` | Bearer token for authenticating with the LLM endpoints | | |
| | `EMBEDS_URL` | Base URL for the embedding service (must expose `/embeddings` endpoint) | | |
| | `EMBEDS_API_KEY` | Bearer token for authenticating with the embedding service | | |
| | `DEFAULT_MODEL` | Model name pre-selected in the UI dropdown | | |
| | `DEFAULT_EMBEDDING` | Embedding name pre-selected in the UI dropdown | | |
| | `GROBID_URL` | Full URL to a running GROBID server | | |
| | `GROBID_QUANTITIES_URL` | URL to a grobid-quantities server (for measurement NER) | | |
| | `GROBID_MATERIALS_URL` | URL to a grobid-superconductors server (for materials NER) | | |
| --- | |
| ## Quick Start β Streamlit App | |
| ```bash | |
| # 1. Set up environment | |
| cp .env.example .env # Edit with your endpoints | |
| # 2. Run the app | |
| streamlit run streamlit_app.py | |
| ``` | |
| Then open `http://localhost:8501`, upload a PDF, and ask questions. | |
| --- | |
| ## Quick Start β As a Python Library | |
| ```python | |
| from langchain_openai import ChatOpenAI | |
| from document_qa.custom_embeddings import ModalEmbeddings | |
| from document_qa.document_qa_engine import DocumentQAEngine, DataStorage | |
| # 1. Set up the LLM | |
| llm = ChatOpenAI( | |
| model="microsoft/Phi-4-mini-instruct", | |
| temperature=0.0, | |
| base_url="http://localhost:1234/v1", | |
| api_key="your-api-key" | |
| ) | |
| # 2. Set up embeddings | |
| embeddings = ModalEmbeddings( | |
| url="http://localhost:1234/v1", | |
| model_name="intfloat/multilingual-e5-large-instruct", | |
| api_key="your-embedding-key" | |
| ) | |
| # 3. Create the storage and engine | |
| storage = DataStorage(embeddings) | |
| engine = DocumentQAEngine( | |
| llm=llm, | |
| data_storage=storage, | |
| grobid_url="https://lfoppiano-grobid.hf.space/" | |
| ) | |
| # 4. Load a PDF (creates in-memory embeddings) | |
| doc_id = engine.create_memory_embeddings( | |
| pdf_path="path/to/paper.pdf", | |
| chunk_size=500 # tokens per chunk (-1 = keep paragraphs) | |
| ) | |
| # 5. Ask a question | |
| _, answer, coordinates = engine.query_document( | |
| query="What is the main contribution of this paper?", | |
| doc_id=doc_id, | |
| context_size=10 # number of chunks to use as context | |
| ) | |
| print(answer) | |
| # 6. Or just retrieve relevant passages (no LLM) | |
| passages, coordinates = engine.query_storage( | |
| query="What materials were studied?", | |
| doc_id=doc_id, | |
| context_size=5 | |
| ) | |
| for p in passages: | |
| print(p) | |
| ``` | |
| ## Streamlit App Features | |
| ### Query Modes | |
| | Mode | What It Does | When to Use | | |
| |------|-------------|-------------| | |
| | **LLM Q/A** | Retrieves context β sends to LLM β returns a natural language answer | Default β for asking questions | | |
| | **Embeddings** | Returns the raw text passages most similar to your question | Debugging β to see what context the LLM would receive | | |
| | **Question Coefficient** | Computes `min_similarity - mean_similarity` as a quality estimate | Experimental β to predict answer reliability | | |
| ### Settings | |
| | Setting | Default | Description | | |
| |---------|---------|-------------| | |
| | Chunk size | `-1` (paragraphs) | Token count per text chunk. `-1` keeps GROBID paragraphs intact. | | |
| | Context size | `10` (paragraphs) / `4` (chunks) | Number of chunks sent to the LLM as context | | |
| | Scroll to context | Off | Auto-scroll the PDF viewer to the most relevant passage | | |
| | NER processing | Off | Run grobid-quantities + grobid-superconductors on LLM responses | | |
| ### PDF Annotations | |
| After each query, the PDF viewer highlights the passages used as context: | |
| - **Orange** (warm) = most relevant passage | |
| - **Blue** (cold) = least relevant passage | |
| - **Dotted border** = the single most relevant passage | |
| ## Troubleshooting | |
| ### SQLite version error | |
| ``` | |
| streamlit: Your system has an unsupported version of sqlite3. | |
| Chroma requires sqlite3 >= 3.35.0. | |
| ``` | |
| **Linux fix**: See [this StackOverflow answer](https://stackoverflow.com/questions/76958817/streamlit-your-system-has-an-unsupported-version-of-sqlite3-chroma-requires-sq). | |
| **More info**: [Chroma troubleshooting docs](https://docs.trychroma.com/troubleshooting#sqlite). | |
| ### "The information is not provided in the given context" | |
| The LLM couldn't find the answer in the retrieved passages. Try: | |
| 1. **Increase context size** β use the sidebar slider to retrieve more passages | |
| 2. **Decrease chunk size** β smaller chunks may match more precisely | |
| 3. **Use Embeddings mode** β switch to "Embeddings" query mode to see what passages are being retrieved and verify they contain the answer | |
| ### MissingSchema error on embeddings | |
| ``` | |
| requests.exceptions.MissingSchema: Invalid URL | |
| ``` | |
| Ensure `EMBEDS_URL` in your `.env` starts with `https://` or `http://`. Example: | |
| ```env | |
| EMBEDS_URL=https://your-modal-endpoint.modal.run/v1 | |
| ``` | |
| ### GROBID connection errors | |
| Make sure your GROBID server is running and accessible: | |
| ```bash | |
| curl https://grobid.hf.space/api/isalive | |
| ``` | |
| If using a local GROBID instance: | |
| ```bash | |
| docker run --rm -p 8070:8070 lfoppiano/grobid:0.8.0 | |
| # Then set GROBID_URL=http://localhost:8070 | |
| ``` | |
| ### Embedding API returning empty results | |
| - Verify the API is running: `curl {EMBEDS_URL}/embeddings` | |
| - Check that `EMBEDS_API_KEY` matches the server's expected key | |
| - Ensure the URL does **not** have a trailing `/embeddings` (the client appends it automatically) | |
| --- | |