Spaces:

sciencialab
/

document-qa-dev

Build error

App Files Files Community

document-qa-dev / document_qa /deployment /README.md

lfoppiano

Upload folder using huggingface_hub

21fbed0 verified 24 days ago

preview code

Raw

History Blame Contribute Delete

4.48 kB

	# Modal deployment scripts

	This folder contains the [Modal](https://modal.com) apps that serve the LLM and
	embedding endpoints used by document-qa. Each script is an independent Modal app:
	deploy the ones you need, then point the matching `.env` variables at the URLs
	Modal prints.

	\| Script \| Modal app \| Serves \| Maps to `.env` \|
	\|--------\|-----------\|--------\|----------------\|
	\| `modal_inference_phi.py` \| `phi-4-mini-instruct-qa-vllm` \| `microsoft/Phi-4-mini-instruct` (vLLM, OpenAI-compatible) \| `PHI_URL` \|
	\| `modal_inference_qwen.py` \| `qwen-0.6b-qa-vllm` \| `Qwen/Qwen3-0.6B` (vLLM, reasoning) \| `QWEN_URL` \|
	\| `modal_embeddings_multilang.py` \| `intfloat-multilingual-e5-large-instruct-embeddings` \| `intfloat/multilingual-e5-large-instruct` \| `EMBEDS_URL` \|
	\| `modal_embeddings_en.py` \| `intfloat-e5-large-v2-embeddings` \| `intfloat/e5-large-v2` (English-only) \| `EMBEDS_URL` \|

	> Both embedding scripts define a tiny global `EmbeddingModel` class that delegates
	> to the shared helpers in `_embeddings_app.py` (`cls_kwargs`, `load_embedding_model`,
	> `run_embed`). The shared module holds the container image and the embedding logic;
	> the model is loaded once per container via `@modal.enter()`. To add another
	> embedding model, copy one wrapper and change `MODEL_NAME` / `MODEL_REVISION` / the
	> app name.

	## Prerequisites

	```bash
	pip install modal
	modal token new # one-time browser auth
	```

	## Secrets

	The scripts read an `API_KEY` from a Modal [Secret](https://modal.com/docs/guide/secrets).
	Create the two secrets once (the value is the bearer token clients must send):

	```bash
	# Used by the inference scripts (phi, qwen)
	modal secret create document-qa-api-key API_KEY=<your-llm-token>

	# Used by the embedding scripts
	modal secret create document-qa-embedding-key API_KEY=<your-embedding-token>
	```

	\| Secret \| Used by \| Provides \|
	\|--------\|---------\|----------\|
	\| `document-qa-api-key` \| `modal_inference_phi.py`, `modal_inference_qwen.py` \| `API_KEY` for the vLLM `--api-key` flag \|
	\| `document-qa-embedding-key` \| `modal_embeddings_*.py` \| `API_KEY` checked against the `x-api-key` header \|

	## Deploy

	```bash
	modal deploy document_qa/deployment/modal_inference_phi.py
	modal deploy document_qa/deployment/modal_inference_qwen.py
	modal deploy document_qa/deployment/modal_embeddings_multilang.py
	# modal deploy document_qa/deployment/modal_embeddings_en.py # optional English-only
	```

	Each deploy prints a public `https://<...>.modal.run` URL. Copy it into `.env`:

	```env
	PHI_URL=https://<account>--phi-4-mini-instruct-qa-vllm-serve.modal.run/v1
	QWEN_URL=https://<account>--qwen-0-6b-qa-vllm-serve.modal.run/v1
	EMBEDS_URL=https://<account>--embeddings-multilang.modal.run # English-only: --embeddings-en
	API_KEY=<your-llm-token> # matches document-qa-api-key
	EMBEDS_API_KEY=<your-embedding-token> # matches document-qa-embedding-key
	```

	> Inference endpoints are OpenAI-compatible vLLM servers, so their URLs end in
	> `/v1`. Embedding endpoints are a custom form endpoint (see below), so their
	> URL has no `/v1` suffix.

	## Endpoint contracts

	### Inference (vLLM)

	Standard OpenAI Chat Completions API at `<PHI_URL\|QWEN_URL>`, authenticated with the
	`Authorization: Bearer <API_KEY>` header. Used by `langchain_openai.ChatOpenAI` in
	`streamlit_app.py`.

	### Embeddings

	A custom `POST` endpoint consumed by
	[`ModalEmbeddings`](../custom_embeddings.py):

	- Auth: `x-api-key: <EMBEDS_API_KEY>` header.
	- Body: form field `text` with newline-separated strings.
	- Response: JSON list of L2-normalised vectors, one per input line.

	Smoke test:

	```bash
	curl -X POST "$EMBEDS_URL" \
	-H "x-api-key: $EMBEDS_API_KEY" \
	-F $'text=first sentence\nsecond sentence'
	```

	## Tuning

	These knobs live near the top of each script (or in `_embeddings_app.py`):

	\| Setting \| Where \| Notes \|
	\|---------\|-------\|-------\|
	\| `gpu` \| `@app.function` / `@app.cls` \| `A10G` is cheaper; `L40S` is faster. Embeddings default to `L40S`, inference to `A10G`. \|
	\| `scaledown_window` \| decorator \| Idle time before a replica is stopped (cost vs. cold starts). \|
	\| `max_inputs` \| `@modal.concurrent` \| Concurrent requests per replica — tune to GPU memory. \|
	\| `LABEL` \| `modal_embeddings_*.py` \| Pins the public URL (`--<label>.modal.run`). Without it Modal truncates the long auto-name and appends a random hash. \|
	\| `FAST_BOOT` \| `modal_inference_phi.py` \| `--enforce-eager` for faster cold starts vs. peak throughput. \|