Spaces:
Build error
A newer version of the Streamlit SDK is available: 1.58.0
Modal deployment scripts
This folder contains the Modal apps that serve the LLM and
embedding endpoints used by document-qa. Each script is an independent Modal app:
deploy the ones you need, then point the matching .env variables at the URLs
Modal prints.
| Script | Modal app | Serves | Maps to .env |
|---|---|---|---|
modal_inference_phi.py |
phi-4-mini-instruct-qa-vllm |
microsoft/Phi-4-mini-instruct (vLLM, OpenAI-compatible) |
PHI_URL |
modal_inference_qwen.py |
qwen-0.6b-qa-vllm |
Qwen/Qwen3-0.6B (vLLM, reasoning) |
QWEN_URL |
modal_embeddings_multilang.py |
intfloat-multilingual-e5-large-instruct-embeddings |
intfloat/multilingual-e5-large-instruct |
EMBEDS_URL |
modal_embeddings_en.py |
intfloat-e5-large-v2-embeddings |
intfloat/e5-large-v2 (English-only) |
EMBEDS_URL |
Both embedding scripts define a tiny global
EmbeddingModelclass that delegates to the shared helpers in_embeddings_app.py(cls_kwargs,load_embedding_model,run_embed). The shared module holds the container image and the embedding logic; the model is loaded once per container via@modal.enter(). To add another embedding model, copy one wrapper and changeMODEL_NAME/MODEL_REVISION/ the app name.
Prerequisites
pip install modal
modal token new # one-time browser auth
Secrets
The scripts read an API_KEY from a Modal Secret.
Create the two secrets once (the value is the bearer token clients must send):
# Used by the inference scripts (phi, qwen)
modal secret create document-qa-api-key API_KEY=<your-llm-token>
# Used by the embedding scripts
modal secret create document-qa-embedding-key API_KEY=<your-embedding-token>
| Secret | Used by | Provides |
|---|---|---|
document-qa-api-key |
modal_inference_phi.py, modal_inference_qwen.py |
API_KEY for the vLLM --api-key flag |
document-qa-embedding-key |
modal_embeddings_*.py |
API_KEY checked against the x-api-key header |
Deploy
modal deploy document_qa/deployment/modal_inference_phi.py
modal deploy document_qa/deployment/modal_inference_qwen.py
modal deploy document_qa/deployment/modal_embeddings_multilang.py
# modal deploy document_qa/deployment/modal_embeddings_en.py # optional English-only
Each deploy prints a public https://<...>.modal.run URL. Copy it into .env:
PHI_URL=https://<account>--phi-4-mini-instruct-qa-vllm-serve.modal.run/v1
QWEN_URL=https://<account>--qwen-0-6b-qa-vllm-serve.modal.run/v1
EMBEDS_URL=https://<account>--embeddings-multilang.modal.run # English-only: --embeddings-en
API_KEY=<your-llm-token> # matches document-qa-api-key
EMBEDS_API_KEY=<your-embedding-token> # matches document-qa-embedding-key
Inference endpoints are OpenAI-compatible vLLM servers, so their URLs end in
/v1. Embedding endpoints are a custom form endpoint (see below), so their URL has no/v1suffix.
Endpoint contracts
Inference (vLLM)
Standard OpenAI Chat Completions API at <PHI_URL|QWEN_URL>, authenticated with the
Authorization: Bearer <API_KEY> header. Used by langchain_openai.ChatOpenAI in
streamlit_app.py.
Embeddings
A custom POST endpoint consumed by
ModalEmbeddings:
- Auth:
x-api-key: <EMBEDS_API_KEY>header. - Body: form field
textwith newline-separated strings. - Response: JSON list of L2-normalised vectors, one per input line.
Smoke test:
curl -X POST "$EMBEDS_URL" \
-H "x-api-key: $EMBEDS_API_KEY" \
-F $'text=first sentence\nsecond sentence'
Tuning
These knobs live near the top of each script (or in _embeddings_app.py):
| Setting | Where | Notes |
|---|---|---|
gpu |
@app.function / @app.cls |
A10G is cheaper; L40S is faster. Embeddings default to L40S, inference to A10G. |
scaledown_window |
decorator | Idle time before a replica is stopped (cost vs. cold starts). |
max_inputs |
@modal.concurrent |
Concurrent requests per replica — tune to GPU memory. |
LABEL |
modal_embeddings_*.py |
Pins the public URL (--<label>.modal.run). Without it Modal truncates the long auto-name and appends a random hash. |
FAST_BOOT |
modal_inference_phi.py |
--enforce-eager for faster cold starts vs. peak throughput. |