Spaces:
Runtime error
Runtime error
| # PregoPal Modal Deploy Guide | |
| > Final update: 2026-06-12 | |
| > Covers both deploy.py (pre-built wheel) and deploy_omni.py (llama.cpp-omni full-duplex) | |
| --- | |
| ## 1. Deploy Solutions Comparison | |
| | Aspect | Solution A (deploy.py) | Solution B (deploy_omni.py) | | |
| |--------|----------------------|------------------------------| | |
| | Engine | llama-cpp-python (mainline) | llama.cpp-omni (OpenBMB fork) | | |
| | Capabilities | Text + Vision | Text + Vision + Audio + TTS + token2wav | | |
| | Build method | Pre-built wheel (5s) | Source compile (~77min) | | |
| | Image build time | 45s | ~77min | | |
| | GPU | T4 | T4 / L4 | | |
| | Deploy mode | @asgi_app() | @asgi_app() + llama-server subprocess | | |
| | Status | Deployed (stable) | Compiled + inference verified on T4 | | |
| | App name | prego-pal-minicpm | prego-pal-minicpm-omni | | |
| **Solution B core difference**: Uses OpenBMB branch llama-server binary, supports multiple mmproj (vision + audio) and full-duplex TTS/STT. | |
| --- | |
| ## 2. Solution A: Pre-built wheel (deploy.py) | |
| ### Image Definition | |
| ```python | |
| _image = ( | |
| Image.debian_slim(python_version="3.11") | |
| .pip_install("fastapi", "uvicorn[standard]", "httpx", "numpy", "Pillow") | |
| .pip_install( | |
| "llama-cpp-python", | |
| extra_index_url="https://ggml-org.github.io/llama-cpp-python/whl/cu121", | |
| ) | |
| .run_commands('python -c "import llama_cpp; print('OK')"') | |
| ) | |
| ``` | |
| ### Key Parameters | |
| - Model: MiniCPM-o-4_5-Q4_K_M.gguf + vision mmproj | |
| - LLM: n_gpu_layers=-1, n_ctx=8192 | |
| - GPU: T4 (16GB VRAM) | |
| - Concurrency: 10 | |
| - Idle timeout: 300s | |
| ### Deploy & Test | |
| ```bash | |
| modal deploy modal_deploy.deploy | |
| modal run modal_deploy.deploy::test_inference | |
| curl https://andrew-jiabin--prego-pal-minicpm-serve.modal.run/health | |
| ``` | |
| --- | |
| ## 3. Solution B: llama.cpp-omni source compile (deploy_omni.py) | |
| > **⚠️ 这是已编译成功的 Production 版本。deploy_omni.py 及其配套的 llamacpp_omni/ 源码目录严禁随意修改或删除。** | |
| > 任何修改必须经过完整编译验证(`modal run --detach -m modal_deploy.deploy_omni::diagnose_volume` 编译通过 + 模型加载正常),确认无误后方可合并。 | |
| > 当前成功的 Image ID: `im-AxWdR31ZWeEDIfXIPcxAhZ`(2026-06-11 编译) | |
| ### Architecture | |
| ``` | |
| FastAPI (ASGI) <-> llama-server (subprocess, 127.0.0.1:8081) | |
| | | |
| Modal Volume: GGUF models (vision + audio + TTS + token2wav) | |
| ``` | |
| ### Source | |
| llama.cpp-omni (OpenBMB fork, no longer public). Using tc-mb/llama.cpp-omni fork. | |
| Stored in: modal_deploy/llamacpp_omni/ (3411 files, 156MB). | |
| ### Build Dependencies (Debian 12 + CUDA 12.4) | |
| ```bash | |
| # NVIDIA apt repo setup | |
| curl -L -o /tmp/cuda-keyring.deb https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb | |
| dpkg -i /tmp/cuda-keyring.deb | |
| apt-get update | |
| apt-get install -y cuda-toolkit-12-4 cuda-compiler-12-4 cuda-driver-dev-12-4 | |
| ``` | |
| ### CMake Flags | |
| | Flag | Purpose | | |
| |------|---------| | |
| | `-DGGML_CUDA=ON` | CUDA backend | | |
| | `-DLLAMA_BUILD_SERVER=ON` | Build llama-server | | |
| | `-DLLAMA_CUDA_FORCE_MMQ=ON` | Simplify matrix multiply templates (faster compile) | | |
| | `-DGGML_CUDA_NO_VMM=ON` | Skip libcuda.so.1 linking at build time | | |
| | `-DCMAKE_CUDA_ARCHITECTURES='75;89'` | T4(sm_75) + L4(sm_89) | | |
| ### llama-server Launch Args | |
| ```bash | |
| llama-server \ | |
| -m /models/MiniCPM-o-4_5-gguf/MiniCPM-o-4_5-Q4_K_M.gguf \ | |
| --mmproj /models/MiniCPM-o-4_5-gguf/vision/MiniCPM-o-4_5-vision-F16.gguf \ | |
| # Note: --mmproj only accepts ONE file (vision). Audio encoder is | |
| # loaded via /v1/stream/omni_prefill API at runtime, NOT via --mmproj. | |
| --voxcpm2-base-lm /models/MiniCPM-o-4_5-gguf/tts/MiniCPM-o-4_5-tts-F16.gguf \ | |
| --voxcpm2-acoustic /models/MiniCPM-o-4_5-gguf/tts/MiniCPM-o-4_5-projector-F16.gguf \ | |
| --host 127.0.0.1 --port 8081 \ | |
| -ngl 99 -c 8192 --no-mmap --jinja | |
| ``` | |
| ### Deploy & Test | |
| ```bash | |
| modal deploy modal_deploy.deploy_omni | |
| modal run -m modal_deploy.deploy_omni::diagnose_volume | |
| modal run --detach -m modal_deploy.deploy_omni::test_inference | |
| ``` | |
| ### Test Results (2026-06-11, T4) | |
| **Status**: ✅ Deployed to production on T4. | |
| **Production URL**: `https://andrew-jiabin--prego-pal-minicpm-omni-serve.modal.run` | |
| | Metric | Value | | |
| |--------|-------| | |
| | Cold start (model load) | ~9.5s | | |
| | Chinese inference | 1.1s, 34 tokens, correct content | | |
| | English inference | 1.3s, 50 tokens (truncated by max_tokens) | | |
| | GPU memory (model) | ~11.6 GB | | |
| | GPU memory (mmproj) | ~1.1 GB | | |
| | VRAM left | ~2-3 GB | | |
| **Example outputs**: | |
| - **Chinese**: "你好,我是MiniCPM系列模型,由面壁智能和OpenBMB开源社区开发。..." | |
| - **English**: Empty (finish_reason=length, 50 tokens of special tokens) | |
| **Notes**: | |
| - English inference returns empty content when max_tokens is low; the model generates special tokens first (e.g. `<|im_start|>assistant`). Use higher max_tokens or better prompt to get valid English output. | |
| - `--jinja` flag is required for correct chat template formatting. | |
| ### API Endpoints | |
| | Endpoint | Method | Description | | |
| |----------|--------|-------------| | |
| | /v1/chat/completions | POST | OpenAI compatible (chat, multimodal, streaming) | | |
| | /v1/audio/speech | POST | TTS: text -> speech WAV | | |
| | /v1/audio/speech/stream | POST | Streaming TTS | | |
| | /v1/audio/transcriptions | POST | STT: speech -> text | | |
| | /v1/embeddings | POST | Embeddings | | |
| | /health | GET | Health check (includes audio/vision/TTS status) | | |
| | /v1/models | GET | Model list | | |
| --- | |
| ## 4. Volume Management | |
| ### Volume Info | |
| - Name: `minicpm-o-4_5-models` | |
| - Mount: `/models` | |
| - Structure: `/MiniCPM-o-4_5-gguf/` | |
| ### File Inventory (10 files, ~7.8 GB total) | |
| | File | Size | Description | | |
| |------|------|-------------| | |
| | MiniCPM-o-4_5-Q4_K_M.gguf | 4.68 GB | Main model (Q4_K_M quantized) | | |
| | vision/MiniCPM-o-4_5-vision-F16.gguf | 1.02 GB | Vision projector | | |
| | audio/MiniCPM-o-4_5-audio-F16.gguf | 0.61 GB | Audio encoder | | |
| | tts/MiniCPM-o-4_5-tts-F16.gguf | 1.08 GB | TTS BaseLM | | |
| | tts/MiniCPM-o-4_5-projector-F16.gguf | 0.01 GB | TTS projector | | |
| | token2wav-gguf/encoder.gguf | 0.14 GB | Token2wav encoder | | |
| | token2wav-gguf/flow_extra.gguf | 0.01 GB | Flow extra | | |
| | token2wav-gguf/flow_matching.gguf | 0.43 GB | Flow matching | | |
| | token2wav-gguf/hifigan2.gguf | 0.08 GB | HiFiGAN decoder | | |
| | token2wav-gguf/prompt_cache.gguf | 0.20 GB | Prompt cache | | |
| ### Commands | |
| ```bash | |
| modal volume put minicpm-o-4_5-models ./models/MiniCPM-o-4_5-gguf / | |
| modal volume ls minicpm-o-4_5-models | |
| ``` | |
| --- | |
| ## 5. CLI Quick Reference | |
| ```bash | |
| # Deploy | |
| modal deploy modal_deploy.deploy # Solution A | |
| modal deploy modal_deploy.deploy_omni # Solution B | |
| # Dev (hot-reload, solution A only) | |
| modal serve modal_deploy.deploy | |
| # Volume diagnostics | |
| modal run -m modal_deploy.deploy_omni::diagnose_volume | |
| # Inference testing | |
| modal run -m modal_deploy.deploy_omni::test_inference | |
| # Detach mode (avoids disconnect killing build) | |
| modal run --detach -m modal_deploy.deploy_omni::diagnose_volume | |
| # Logs | |
| modal app logs prego-pal-minicpm | |
| modal app logs prego-pal-minicpm-omni | |
| # List apps & clean up stopped apps | |
| modal app list | |
| ``` | |
| --- | |
| ## 6. Pitfalls Log | |
| ### 6.1 GitHub access blocked in Modal build containers | |
| **Problem**: `git clone` from Modal build container fails (no auth). | |
| **Fix**: Use `Image.add_local_dir()` with `copy=True` to ship source from local machine. | |
| ### 6.2 Debian-slim has no curl/wget | |
| **Problem**: Minimal base image lacks download tools. | |
| **Fix**: `apt_install("curl")` before `run_commands` that need downloads. | |
| ### 6.3 libcuda.so.1 linking failure at build time | |
| **Problem**: Build container has no NVIDIA driver, linker cannot find libcuda.so.1. | |
| **Root cause**: ggml-cuda.so links CUDA::cuda_driver (libcuda.so.1) for VMM support. | |
| **Fix**: `-DGGML_CUDA_NO_VMM=ON` disables CUDA driver linking. Runtime Modal GPU container provides the real libcuda.so.1. | |
| ### 6.4 CMAKE_CUDA_ARCHITECTURES syntax | |
| **Problem**: `75,89` (comma-separated) causes cmake error. | |
| **Fix**: `-DCMAKE_CUDA_ARCHITECTURES='75;89'` with semicolons and single quotes. | |
| ### 6.5 add_local_dir + run_commands ordering | |
| **Problem**: Modal errors if `run_commands` comes after `add_local_dir` without `copy=True`. | |
| **Fix**: `.add_local_dir(..., copy=True)` then `.run_commands(...)`. | |
| ### 6.6 Network disconnect kills running build | |
| **Problem**: Local terminal disconnect terminates remote build. | |
| **Workaround**: `modal run --detach` submits to cloud and returns immediately. | |
| Monitor progress on Modal web UI. | |
| --- | |
| ## 7. Cost Estimation | |
| | GPU | Price/hour | Per request (2s) | Monthly 1000 req | | |
| |-----|-----------|-----------------|-----------------| | |
| | T4 (Solution A+B) | $0.50/hr | $0.00028 | $0.28 | | |
| | L4 (optional) | $0.42/hr | $0.00023 | $0.23 | | |
| | A100 (baseline) | $1.50/hr | $0.00083 | $0.83 | | |
| --- | |
| --- | |
| ## 8. 🔴 Production Script Protect | |
| The following files are **PRODUCTION ASSETS** — do not modify or delete without full rebuild verification: | |
| | File | Reason | | |
| |------|--------| | |
| | `modal_deploy/deploy_omni.py` | Main deploy script for full-duplex Solution B. Build proven working. | | |
| | `modal_deploy/llamacpp_omni/` | llama.cpp-omni source (3411 files). Exact checkout used for successful build. | | |
| | `modal_deploy/build_llama_server.sh` | Scratchpad local cross-compile attempt (unused, but committed). | | |
| **Verification required before merge**: `modal run --detach -m modal_deploy.deploy_omni::diagnose_volume` must pass. | |
| --- | |
| ## 9. File Map | |
| | File | Description | | |
| |------|-------------| | |
| | modal_deploy/deploy.py | Solution A: FastAPI + llama-cpp-python (deployed) | | |
| | modal_deploy/deploy_omni.py | Solution B: FastAPI + llama-server subprocess | | |
| | modal_deploy/deploy_backup.py | Solution A backup | | |
| | modal_deploy/llamacpp_omni/ | llama.cpp-omni source (3411 files, 156MB) | | |
| | modal_deploy/client.py | Python API client (OpenAI compatible) | | |
| | modal_deploy/README.md | Deploy instructions | | |
| | modal_deploy/build_llama_server.sh | Local cross-compile script (unused) | | |
| | docs/README_modal_deploy.md | This file - comprehensive guide | | |