Spaces:

sciencialab
/

document-qa-dev

Build error

App Files Files Community

document-qa-dev / document_qa /deployment /README.md

lfoppiano

Upload folder using huggingface_hub

21fbed0 verified 23 days ago

preview code

Raw

History Blame Contribute Delete

4.48 kB

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade

Modal deployment scripts

This folder contains the Modal apps that serve the LLM and embedding endpoints used by document-qa. Each script is an independent Modal app: deploy the ones you need, then point the matching .env variables at the URLs Modal prints.

Script	Modal app	Serves	Maps to `.env`
`modal_inference_phi.py`	`phi-4-mini-instruct-qa-vllm`	`microsoft/Phi-4-mini-instruct` (vLLM, OpenAI-compatible)	`PHI_URL`
`modal_inference_qwen.py`	`qwen-0.6b-qa-vllm`	`Qwen/Qwen3-0.6B` (vLLM, reasoning)	`QWEN_URL`
`modal_embeddings_multilang.py`	`intfloat-multilingual-e5-large-instruct-embeddings`	`intfloat/multilingual-e5-large-instruct`	`EMBEDS_URL`
`modal_embeddings_en.py`	`intfloat-e5-large-v2-embeddings`	`intfloat/e5-large-v2` (English-only)	`EMBEDS_URL`

Both embedding scripts define a tiny global EmbeddingModel class that delegates to the shared helpers in _embeddings_app.py (cls_kwargs, load_embedding_model, run_embed). The shared module holds the container image and the embedding logic; the model is loaded once per container via @modal.enter(). To add another embedding model, copy one wrapper and change MODEL_NAME / MODEL_REVISION / the app name.

Prerequisites

pip install modal
modal token new      # one-time browser auth

Secrets

The scripts read an API_KEY from a Modal Secret. Create the two secrets once (the value is the bearer token clients must send):

# Used by the inference scripts (phi, qwen)
modal secret create document-qa-api-key API_KEY=<your-llm-token>

# Used by the embedding scripts
modal secret create document-qa-embedding-key API_KEY=<your-embedding-token>

Secret	Used by	Provides
`document-qa-api-key`	`modal_inference_phi.py`, `modal_inference_qwen.py`	`API_KEY` for the vLLM `--api-key` flag
`document-qa-embedding-key`	`modal_embeddings_*.py`	`API_KEY` checked against the `x-api-key` header

Deploy

modal deploy document_qa/deployment/modal_inference_phi.py
modal deploy document_qa/deployment/modal_inference_qwen.py
modal deploy document_qa/deployment/modal_embeddings_multilang.py
# modal deploy document_qa/deployment/modal_embeddings_en.py   # optional English-only

Each deploy prints a public https://<...>.modal.run URL. Copy it into .env:

PHI_URL=https://<account>--phi-4-mini-instruct-qa-vllm-serve.modal.run/v1
QWEN_URL=https://<account>--qwen-0-6b-qa-vllm-serve.modal.run/v1
EMBEDS_URL=https://<account>--embeddings-multilang.modal.run   # English-only: --embeddings-en
API_KEY=<your-llm-token>            # matches document-qa-api-key
EMBEDS_API_KEY=<your-embedding-token>  # matches document-qa-embedding-key

Inference endpoints are OpenAI-compatible vLLM servers, so their URLs end in /v1. Embedding endpoints are a custom form endpoint (see below), so their URL has no /v1 suffix.

Endpoint contracts

Inference (vLLM)

Standard OpenAI Chat Completions API at <PHI_URL|QWEN_URL>, authenticated with the Authorization: Bearer <API_KEY> header. Used by langchain_openai.ChatOpenAI in streamlit_app.py.

Embeddings

A custom POST endpoint consumed by ModalEmbeddings:

Auth: x-api-key: <EMBEDS_API_KEY> header.
Body: form field text with newline-separated strings.
Response: JSON list of L2-normalised vectors, one per input line.

Smoke test:

curl -X POST "$EMBEDS_URL" \
  -H "x-api-key: $EMBEDS_API_KEY" \
  -F $'text=first sentence\nsecond sentence'

Tuning

These knobs live near the top of each script (or in _embeddings_app.py):

Setting	Where	Notes
`gpu`	`@app.function` / `@app.cls`	`A10G` is cheaper; `L40S` is faster. Embeddings default to `L40S`, inference to `A10G`.
`scaledown_window`	decorator	Idle time before a replica is stopped (cost vs. cold starts).
`max_inputs`	`@modal.concurrent`	Concurrent requests per replica — tune to GPU memory.
`LABEL`	`modal_embeddings_*.py`	Pins the public URL (`--<label>.modal.run`). Without it Modal truncates the long auto-name and appends a random hash.
`FAST_BOOT`	`modal_inference_phi.py`	`--enforce-eager` for faster cold starts vs. peak throughput.