Spaces:
Build error
Build error
| # Modal deployment scripts | |
| This folder contains the [Modal](https://modal.com) apps that serve the LLM and | |
| embedding endpoints used by document-qa. Each script is an independent Modal app: | |
| deploy the ones you need, then point the matching `.env` variables at the URLs | |
| Modal prints. | |
| | Script | Modal app | Serves | Maps to `.env` | | |
| |--------|-----------|--------|----------------| | |
| | `modal_inference_phi.py` | `phi-4-mini-instruct-qa-vllm` | `microsoft/Phi-4-mini-instruct` (vLLM, OpenAI-compatible) | `PHI_URL` | | |
| | `modal_inference_qwen.py` | `qwen-0.6b-qa-vllm` | `Qwen/Qwen3-0.6B` (vLLM, reasoning) | `QWEN_URL` | | |
| | `modal_embeddings_multilang.py` | `intfloat-multilingual-e5-large-instruct-embeddings` | `intfloat/multilingual-e5-large-instruct` | `EMBEDS_URL` | | |
| | `modal_embeddings_en.py` | `intfloat-e5-large-v2-embeddings` | `intfloat/e5-large-v2` (English-only) | `EMBEDS_URL` | | |
| > Both embedding scripts define a tiny global `EmbeddingModel` class that delegates | |
| > to the shared helpers in `_embeddings_app.py` (`cls_kwargs`, `load_embedding_model`, | |
| > `run_embed`). The shared module holds the container image and the embedding logic; | |
| > the model is loaded **once per container** via `@modal.enter()`. To add another | |
| > embedding model, copy one wrapper and change `MODEL_NAME` / `MODEL_REVISION` / the | |
| > app name. | |
| ## Prerequisites | |
| ```bash | |
| pip install modal | |
| modal token new # one-time browser auth | |
| ``` | |
| ## Secrets | |
| The scripts read an `API_KEY` from a Modal [Secret](https://modal.com/docs/guide/secrets). | |
| Create the two secrets once (the value is the bearer token clients must send): | |
| ```bash | |
| # Used by the inference scripts (phi, qwen) | |
| modal secret create document-qa-api-key API_KEY=<your-llm-token> | |
| # Used by the embedding scripts | |
| modal secret create document-qa-embedding-key API_KEY=<your-embedding-token> | |
| ``` | |
| | Secret | Used by | Provides | | |
| |--------|---------|----------| | |
| | `document-qa-api-key` | `modal_inference_phi.py`, `modal_inference_qwen.py` | `API_KEY` for the vLLM `--api-key` flag | | |
| | `document-qa-embedding-key` | `modal_embeddings_*.py` | `API_KEY` checked against the `x-api-key` header | | |
| ## Deploy | |
| ```bash | |
| modal deploy document_qa/deployment/modal_inference_phi.py | |
| modal deploy document_qa/deployment/modal_inference_qwen.py | |
| modal deploy document_qa/deployment/modal_embeddings_multilang.py | |
| # modal deploy document_qa/deployment/modal_embeddings_en.py # optional English-only | |
| ``` | |
| Each deploy prints a public `https://<...>.modal.run` URL. Copy it into `.env`: | |
| ```env | |
| PHI_URL=https://<account>--phi-4-mini-instruct-qa-vllm-serve.modal.run/v1 | |
| QWEN_URL=https://<account>--qwen-0-6b-qa-vllm-serve.modal.run/v1 | |
| EMBEDS_URL=https://<account>--embeddings-multilang.modal.run # English-only: --embeddings-en | |
| API_KEY=<your-llm-token> # matches document-qa-api-key | |
| EMBEDS_API_KEY=<your-embedding-token> # matches document-qa-embedding-key | |
| ``` | |
| > **Inference endpoints** are OpenAI-compatible vLLM servers, so their URLs end in | |
| > `/v1`. **Embedding endpoints** are a custom form endpoint (see below), so their | |
| > URL has no `/v1` suffix. | |
| ## Endpoint contracts | |
| ### Inference (vLLM) | |
| Standard OpenAI Chat Completions API at `<PHI_URL|QWEN_URL>`, authenticated with the | |
| `Authorization: Bearer <API_KEY>` header. Used by `langchain_openai.ChatOpenAI` in | |
| `streamlit_app.py`. | |
| ### Embeddings | |
| A custom `POST` endpoint consumed by | |
| [`ModalEmbeddings`](../custom_embeddings.py): | |
| - **Auth**: `x-api-key: <EMBEDS_API_KEY>` header. | |
| - **Body**: form field `text` with newline-separated strings. | |
| - **Response**: JSON list of L2-normalised vectors, one per input line. | |
| Smoke test: | |
| ```bash | |
| curl -X POST "$EMBEDS_URL" \ | |
| -H "x-api-key: $EMBEDS_API_KEY" \ | |
| -F $'text=first sentence\nsecond sentence' | |
| ``` | |
| ## Tuning | |
| These knobs live near the top of each script (or in `_embeddings_app.py`): | |
| | Setting | Where | Notes | | |
| |---------|-------|-------| | |
| | `gpu` | `@app.function` / `@app.cls` | `A10G` is cheaper; `L40S` is faster. Embeddings default to `L40S`, inference to `A10G`. | | |
| | `scaledown_window` | decorator | Idle time before a replica is stopped (cost vs. cold starts). | | |
| | `max_inputs` | `@modal.concurrent` | Concurrent requests per replica — tune to GPU memory. | | |
| | `LABEL` | `modal_embeddings_*.py` | Pins the public URL (`--<label>.modal.run`). Without it Modal truncates the long auto-name and appends a random hash. | | |
| | `FAST_BOOT` | `modal_inference_phi.py` | `--enforce-eager` for faster cold starts vs. peak throughput. | | |