Spaces:
Runtime error
A newer version of the Gradio SDK is available: 6.19.0
PregoPal Modal Deploy Guide
Final update: 2026-06-12 Covers both deploy.py (pre-built wheel) and deploy_omni.py (llama.cpp-omni full-duplex)
1. Deploy Solutions Comparison
| Aspect | Solution A (deploy.py) | Solution B (deploy_omni.py) |
|---|---|---|
| Engine | llama-cpp-python (mainline) | llama.cpp-omni (OpenBMB fork) |
| Capabilities | Text + Vision | Text + Vision + Audio + TTS + token2wav |
| Build method | Pre-built wheel (5s) | Source compile (~77min) |
| Image build time | 45s | ~77min |
| GPU | T4 | T4 / L4 |
| Deploy mode | @asgi_app() | @asgi_app() + llama-server subprocess |
| Status | Deployed (stable) | Compiled + inference verified on T4 |
| App name | prego-pal-minicpm | prego-pal-minicpm-omni |
Solution B core difference: Uses OpenBMB branch llama-server binary, supports multiple mmproj (vision + audio) and full-duplex TTS/STT.
2. Solution A: Pre-built wheel (deploy.py)
Image Definition
_image = (
Image.debian_slim(python_version="3.11")
.pip_install("fastapi", "uvicorn[standard]", "httpx", "numpy", "Pillow")
.pip_install(
"llama-cpp-python",
extra_index_url="https://ggml-org.github.io/llama-cpp-python/whl/cu121",
)
.run_commands('python -c "import llama_cpp; print('OK')"')
)
Key Parameters
- Model: MiniCPM-o-4_5-Q4_K_M.gguf + vision mmproj
- LLM: n_gpu_layers=-1, n_ctx=8192
- GPU: T4 (16GB VRAM)
- Concurrency: 10
- Idle timeout: 300s
Deploy & Test
modal deploy modal_deploy.deploy
modal run modal_deploy.deploy::test_inference
curl https://andrew-jiabin--prego-pal-minicpm-serve.modal.run/health
3. Solution B: llama.cpp-omni source compile (deploy_omni.py)
⚠️ 这是已编译成功的 Production 版本。deploy_omni.py 及其配套的 llamacpp_omni/ 源码目录严禁随意修改或删除。 任何修改必须经过完整编译验证(
modal run --detach -m modal_deploy.deploy_omni::diagnose_volume编译通过 + 模型加载正常),确认无误后方可合并。 当前成功的 Image ID:im-AxWdR31ZWeEDIfXIPcxAhZ(2026-06-11 编译)
Architecture
FastAPI (ASGI) <-> llama-server (subprocess, 127.0.0.1:8081)
|
Modal Volume: GGUF models (vision + audio + TTS + token2wav)
Source
llama.cpp-omni (OpenBMB fork, no longer public). Using tc-mb/llama.cpp-omni fork. Stored in: modal_deploy/llamacpp_omni/ (3411 files, 156MB).
Build Dependencies (Debian 12 + CUDA 12.4)
# NVIDIA apt repo setup
curl -L -o /tmp/cuda-keyring.deb https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i /tmp/cuda-keyring.deb
apt-get update
apt-get install -y cuda-toolkit-12-4 cuda-compiler-12-4 cuda-driver-dev-12-4
CMake Flags
| Flag | Purpose |
|---|---|
-DGGML_CUDA=ON |
CUDA backend |
-DLLAMA_BUILD_SERVER=ON |
Build llama-server |
-DLLAMA_CUDA_FORCE_MMQ=ON |
Simplify matrix multiply templates (faster compile) |
-DGGML_CUDA_NO_VMM=ON |
Skip libcuda.so.1 linking at build time |
-DCMAKE_CUDA_ARCHITECTURES='75;89' |
T4(sm_75) + L4(sm_89) |
llama-server Launch Args
llama-server \
-m /models/MiniCPM-o-4_5-gguf/MiniCPM-o-4_5-Q4_K_M.gguf \
--mmproj /models/MiniCPM-o-4_5-gguf/vision/MiniCPM-o-4_5-vision-F16.gguf \
# Note: --mmproj only accepts ONE file (vision). Audio encoder is
# loaded via /v1/stream/omni_prefill API at runtime, NOT via --mmproj.
--voxcpm2-base-lm /models/MiniCPM-o-4_5-gguf/tts/MiniCPM-o-4_5-tts-F16.gguf \
--voxcpm2-acoustic /models/MiniCPM-o-4_5-gguf/tts/MiniCPM-o-4_5-projector-F16.gguf \
--host 127.0.0.1 --port 8081 \
-ngl 99 -c 8192 --no-mmap --jinja
Deploy & Test
modal deploy modal_deploy.deploy_omni
modal run -m modal_deploy.deploy_omni::diagnose_volume
modal run --detach -m modal_deploy.deploy_omni::test_inference
Test Results (2026-06-11, T4)
Status: ✅ Deployed to production on T4.
Production URL: https://andrew-jiabin--prego-pal-minicpm-omni-serve.modal.run
| Metric | Value |
|---|---|
| Cold start (model load) | ~9.5s |
| Chinese inference | 1.1s, 34 tokens, correct content |
| English inference | 1.3s, 50 tokens (truncated by max_tokens) |
| GPU memory (model) | ~11.6 GB |
| GPU memory (mmproj) | ~1.1 GB |
| VRAM left | ~2-3 GB |
Example outputs:
- Chinese: "你好,我是MiniCPM系列模型,由面壁智能和OpenBMB开源社区开发。..."
- English: Empty (finish_reason=length, 50 tokens of special tokens)
Notes:
- English inference returns empty content when max_tokens is low; the model generates special tokens first (e.g.
<|im_start|>assistant). Use higher max_tokens or better prompt to get valid English output. --jinjaflag is required for correct chat template formatting.
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
| /v1/chat/completions | POST | OpenAI compatible (chat, multimodal, streaming) |
| /v1/audio/speech | POST | TTS: text -> speech WAV |
| /v1/audio/speech/stream | POST | Streaming TTS |
| /v1/audio/transcriptions | POST | STT: speech -> text |
| /v1/embeddings | POST | Embeddings |
| /health | GET | Health check (includes audio/vision/TTS status) |
| /v1/models | GET | Model list |
4. Volume Management
Volume Info
- Name:
minicpm-o-4_5-models - Mount:
/models - Structure:
/MiniCPM-o-4_5-gguf/
File Inventory (10 files, ~7.8 GB total)
| File | Size | Description |
|---|---|---|
| MiniCPM-o-4_5-Q4_K_M.gguf | 4.68 GB | Main model (Q4_K_M quantized) |
| vision/MiniCPM-o-4_5-vision-F16.gguf | 1.02 GB | Vision projector |
| audio/MiniCPM-o-4_5-audio-F16.gguf | 0.61 GB | Audio encoder |
| tts/MiniCPM-o-4_5-tts-F16.gguf | 1.08 GB | TTS BaseLM |
| tts/MiniCPM-o-4_5-projector-F16.gguf | 0.01 GB | TTS projector |
| token2wav-gguf/encoder.gguf | 0.14 GB | Token2wav encoder |
| token2wav-gguf/flow_extra.gguf | 0.01 GB | Flow extra |
| token2wav-gguf/flow_matching.gguf | 0.43 GB | Flow matching |
| token2wav-gguf/hifigan2.gguf | 0.08 GB | HiFiGAN decoder |
| token2wav-gguf/prompt_cache.gguf | 0.20 GB | Prompt cache |
Commands
modal volume put minicpm-o-4_5-models ./models/MiniCPM-o-4_5-gguf /
modal volume ls minicpm-o-4_5-models
5. CLI Quick Reference
# Deploy
modal deploy modal_deploy.deploy # Solution A
modal deploy modal_deploy.deploy_omni # Solution B
# Dev (hot-reload, solution A only)
modal serve modal_deploy.deploy
# Volume diagnostics
modal run -m modal_deploy.deploy_omni::diagnose_volume
# Inference testing
modal run -m modal_deploy.deploy_omni::test_inference
# Detach mode (avoids disconnect killing build)
modal run --detach -m modal_deploy.deploy_omni::diagnose_volume
# Logs
modal app logs prego-pal-minicpm
modal app logs prego-pal-minicpm-omni
# List apps & clean up stopped apps
modal app list
6. Pitfalls Log
6.1 GitHub access blocked in Modal build containers
Problem: git clone from Modal build container fails (no auth).
Fix: Use Image.add_local_dir() with copy=True to ship source from local machine.
6.2 Debian-slim has no curl/wget
Problem: Minimal base image lacks download tools.
Fix: apt_install("curl") before run_commands that need downloads.
6.3 libcuda.so.1 linking failure at build time
Problem: Build container has no NVIDIA driver, linker cannot find libcuda.so.1.
Root cause: ggml-cuda.so links CUDA::cuda_driver (libcuda.so.1) for VMM support.
Fix: -DGGML_CUDA_NO_VMM=ON disables CUDA driver linking. Runtime Modal GPU container provides the real libcuda.so.1.
6.4 CMAKE_CUDA_ARCHITECTURES syntax
Problem: 75,89 (comma-separated) causes cmake error.
Fix: -DCMAKE_CUDA_ARCHITECTURES='75;89' with semicolons and single quotes.
6.5 add_local_dir + run_commands ordering
Problem: Modal errors if run_commands comes after add_local_dir without copy=True.
Fix: .add_local_dir(..., copy=True) then .run_commands(...).
6.6 Network disconnect kills running build
Problem: Local terminal disconnect terminates remote build.
Workaround: modal run --detach submits to cloud and returns immediately.
Monitor progress on Modal web UI.
7. Cost Estimation
| GPU | Price/hour | Per request (2s) | Monthly 1000 req |
|---|---|---|---|
| T4 (Solution A+B) | $0.50/hr | $0.00028 | $0.28 |
| L4 (optional) | $0.42/hr | $0.00023 | $0.23 |
| A100 (baseline) | $1.50/hr | $0.00083 | $0.83 |
8. 🔴 Production Script Protect
The following files are PRODUCTION ASSETS — do not modify or delete without full rebuild verification:
| File | Reason |
|---|---|
modal_deploy/deploy_omni.py |
Main deploy script for full-duplex Solution B. Build proven working. |
modal_deploy/llamacpp_omni/ |
llama.cpp-omni source (3411 files). Exact checkout used for successful build. |
modal_deploy/build_llama_server.sh |
Scratchpad local cross-compile attempt (unused, but committed). |
Verification required before merge: modal run --detach -m modal_deploy.deploy_omni::diagnose_volume must pass.
9. File Map
| File | Description |
|---|---|
| modal_deploy/deploy.py | Solution A: FastAPI + llama-cpp-python (deployed) |
| modal_deploy/deploy_omni.py | Solution B: FastAPI + llama-server subprocess |
| modal_deploy/deploy_backup.py | Solution A backup |
| modal_deploy/llamacpp_omni/ | llama.cpp-omni source (3411 files, 156MB) |
| modal_deploy/client.py | Python API client (OpenAI compatible) |
| modal_deploy/README.md | Deploy instructions |
| modal_deploy/build_llama_server.sh | Local cross-compile script (unused) |
| docs/README_modal_deploy.md | This file - comprehensive guide |