Spaces:

build-small-hackathon
/

PregoPal

Runtime error

App Files Files Community

PregoPal / docs /README_modal_deploy.md

J.B-Lin

refactor: deprecate llama-omni-server, use llama-server built-in omni endpoints

6451051 23 days ago

preview code

Raw

History Blame Contribute Delete

9.96 kB

	# PregoPal Modal Deploy Guide

	> Final update: 2026-06-12
	> Covers both deploy.py (pre-built wheel) and deploy_omni.py (llama.cpp-omni full-duplex)

	---

	## 1. Deploy Solutions Comparison

	\| Aspect \| Solution A (deploy.py) \| Solution B (deploy_omni.py) \|
	\|--------\|----------------------\|------------------------------\|
	\| Engine \| llama-cpp-python (mainline) \| llama.cpp-omni (OpenBMB fork) \|
	\| Capabilities \| Text + Vision \| Text + Vision + Audio + TTS + token2wav \|
	\| Build method \| Pre-built wheel (5s) \| Source compile (~77min) \|
	\| Image build time \| 45s \| ~77min \|
	\| GPU \| T4 \| T4 / L4 \|
	\| Deploy mode \| @asgi_app() \| @asgi_app() + llama-server subprocess \|
	\| Status \| Deployed (stable) \| Compiled + inference verified on T4 \|
	\| App name \| prego-pal-minicpm \| prego-pal-minicpm-omni \|

	Solution B core difference: Uses OpenBMB branch llama-server binary, supports multiple mmproj (vision + audio) and full-duplex TTS/STT.

	---

	## 2. Solution A: Pre-built wheel (deploy.py)

	### Image Definition
	```python
	_image = (
	Image.debian_slim(python_version="3.11")
	.pip_install("fastapi", "uvicorn[standard]", "httpx", "numpy", "Pillow")
	.pip_install(
	"llama-cpp-python",
	extra_index_url="https://ggml-org.github.io/llama-cpp-python/whl/cu121",
	)
	.run_commands('python -c "import llama_cpp; print('OK')"')
	)
	```

	### Key Parameters
	- Model: MiniCPM-o-4_5-Q4_K_M.gguf + vision mmproj
	- LLM: n_gpu_layers=-1, n_ctx=8192
	- GPU: T4 (16GB VRAM)
	- Concurrency: 10
	- Idle timeout: 300s

	### Deploy & Test
	```bash
	modal deploy modal_deploy.deploy
	modal run modal_deploy.deploy::test_inference
	curl https://andrew-jiabin--prego-pal-minicpm-serve.modal.run/health
	```

	---

	## 3. Solution B: llama.cpp-omni source compile (deploy_omni.py)

	> ⚠️ 这是已编译成功的 Production 版本。deploy_omni.py 及其配套的 llamacpp_omni/ 源码目录严禁随意修改或删除。
	> 任何修改必须经过完整编译验证（`modal run --detach -m modal_deploy.deploy_omni::diagnose_volume` 编译通过 + 模型加载正常），确认无误后方可合并。
	> 当前成功的 Image ID: `im-AxWdR31ZWeEDIfXIPcxAhZ`（2026-06-11 编译）

	### Architecture
	```
	FastAPI (ASGI) <-> llama-server (subprocess, 127.0.0.1:8081)
	\|
	Modal Volume: GGUF models (vision + audio + TTS + token2wav)
	```

	### Source
	llama.cpp-omni (OpenBMB fork, no longer public). Using tc-mb/llama.cpp-omni fork.
	Stored in: modal_deploy/llamacpp_omni/ (3411 files, 156MB).

	### Build Dependencies (Debian 12 + CUDA 12.4)
	```bash
	# NVIDIA apt repo setup
	curl -L -o /tmp/cuda-keyring.deb https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
	dpkg -i /tmp/cuda-keyring.deb
	apt-get update
	apt-get install -y cuda-toolkit-12-4 cuda-compiler-12-4 cuda-driver-dev-12-4
	```

	### CMake Flags
	\| Flag \| Purpose \|
	\|------\|---------\|
	\| `-DGGML_CUDA=ON` \| CUDA backend \|
	\| `-DLLAMA_BUILD_SERVER=ON` \| Build llama-server \|
	\| `-DLLAMA_CUDA_FORCE_MMQ=ON` \| Simplify matrix multiply templates (faster compile) \|
	\| `-DGGML_CUDA_NO_VMM=ON` \| Skip libcuda.so.1 linking at build time \|
	\| `-DCMAKE_CUDA_ARCHITECTURES='75;89'` \| T4(sm_75) + L4(sm_89) \|

	### llama-server Launch Args
	```bash
	llama-server \
	-m /models/MiniCPM-o-4_5-gguf/MiniCPM-o-4_5-Q4_K_M.gguf \
	--mmproj /models/MiniCPM-o-4_5-gguf/vision/MiniCPM-o-4_5-vision-F16.gguf \
	# Note: --mmproj only accepts ONE file (vision). Audio encoder is
	# loaded via /v1/stream/omni_prefill API at runtime, NOT via --mmproj.
	--voxcpm2-base-lm /models/MiniCPM-o-4_5-gguf/tts/MiniCPM-o-4_5-tts-F16.gguf \
	--voxcpm2-acoustic /models/MiniCPM-o-4_5-gguf/tts/MiniCPM-o-4_5-projector-F16.gguf \
	--host 127.0.0.1 --port 8081 \
	-ngl 99 -c 8192 --no-mmap --jinja
	```

	### Deploy & Test
	```bash
	modal deploy modal_deploy.deploy_omni
	modal run -m modal_deploy.deploy_omni::diagnose_volume
	modal run --detach -m modal_deploy.deploy_omni::test_inference
	```

	### Test Results (2026-06-11, T4)

	Status: ✅ Deployed to production on T4.

	Production URL: `https://andrew-jiabin--prego-pal-minicpm-omni-serve.modal.run`

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Cold start (model load) \| ~9.5s \|
	\| Chinese inference \| 1.1s, 34 tokens, correct content \|
	\| English inference \| 1.3s, 50 tokens (truncated by max_tokens) \|
	\| GPU memory (model) \| ~11.6 GB \|
	\| GPU memory (mmproj) \| ~1.1 GB \|
	\| VRAM left \| ~2-3 GB \|

	Example outputs:
	- Chinese: "你好，我是MiniCPM系列模型，由面壁智能和OpenBMB开源社区开发。..."
	- English: Empty (finish_reason=length, 50 tokens of special tokens)

	Notes:
	- English inference returns empty content when max_tokens is low; the model generates special tokens first (e.g. `<\|im_start\|>assistant`). Use higher max_tokens or better prompt to get valid English output.
	- `--jinja` flag is required for correct chat template formatting.

	### API Endpoints
	\| Endpoint \| Method \| Description \|
	\|----------\|--------\|-------------\|
	\| /v1/chat/completions \| POST \| OpenAI compatible (chat, multimodal, streaming) \|
	\| /v1/audio/speech \| POST \| TTS: text -> speech WAV \|
	\| /v1/audio/speech/stream \| POST \| Streaming TTS \|
	\| /v1/audio/transcriptions \| POST \| STT: speech -> text \|
	\| /v1/embeddings \| POST \| Embeddings \|
	\| /health \| GET \| Health check (includes audio/vision/TTS status) \|
	\| /v1/models \| GET \| Model list \|

	---

	## 4. Volume Management

	### Volume Info
	- Name: `minicpm-o-4_5-models`
	- Mount: `/models`
	- Structure: `/MiniCPM-o-4_5-gguf/`

	### File Inventory (10 files, ~7.8 GB total)

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| MiniCPM-o-4_5-Q4_K_M.gguf \| 4.68 GB \| Main model (Q4_K_M quantized) \|
	\| vision/MiniCPM-o-4_5-vision-F16.gguf \| 1.02 GB \| Vision projector \|
	\| audio/MiniCPM-o-4_5-audio-F16.gguf \| 0.61 GB \| Audio encoder \|
	\| tts/MiniCPM-o-4_5-tts-F16.gguf \| 1.08 GB \| TTS BaseLM \|
	\| tts/MiniCPM-o-4_5-projector-F16.gguf \| 0.01 GB \| TTS projector \|
	\| token2wav-gguf/encoder.gguf \| 0.14 GB \| Token2wav encoder \|
	\| token2wav-gguf/flow_extra.gguf \| 0.01 GB \| Flow extra \|
	\| token2wav-gguf/flow_matching.gguf \| 0.43 GB \| Flow matching \|
	\| token2wav-gguf/hifigan2.gguf \| 0.08 GB \| HiFiGAN decoder \|
	\| token2wav-gguf/prompt_cache.gguf \| 0.20 GB \| Prompt cache \|

	### Commands
	```bash
	modal volume put minicpm-o-4_5-models ./models/MiniCPM-o-4_5-gguf /
	modal volume ls minicpm-o-4_5-models
	```

	---

	## 5. CLI Quick Reference

	```bash
	# Deploy
	modal deploy modal_deploy.deploy # Solution A
	modal deploy modal_deploy.deploy_omni # Solution B

	# Dev (hot-reload, solution A only)
	modal serve modal_deploy.deploy

	# Volume diagnostics
	modal run -m modal_deploy.deploy_omni::diagnose_volume

	# Inference testing
	modal run -m modal_deploy.deploy_omni::test_inference

	# Detach mode (avoids disconnect killing build)
	modal run --detach -m modal_deploy.deploy_omni::diagnose_volume

	# Logs
	modal app logs prego-pal-minicpm
	modal app logs prego-pal-minicpm-omni

	# List apps & clean up stopped apps
	modal app list
	```

	---

	## 6. Pitfalls Log

	### 6.1 GitHub access blocked in Modal build containers
	Problem: `git clone` from Modal build container fails (no auth).
	Fix: Use `Image.add_local_dir()` with `copy=True` to ship source from local machine.

	### 6.2 Debian-slim has no curl/wget
	Problem: Minimal base image lacks download tools.
	Fix: `apt_install("curl")` before `run_commands` that need downloads.

	### 6.3 libcuda.so.1 linking failure at build time
	Problem: Build container has no NVIDIA driver, linker cannot find libcuda.so.1.
	Root cause: ggml-cuda.so links CUDA::cuda_driver (libcuda.so.1) for VMM support.
	Fix: `-DGGML_CUDA_NO_VMM=ON` disables CUDA driver linking. Runtime Modal GPU container provides the real libcuda.so.1.

	### 6.4 CMAKE_CUDA_ARCHITECTURES syntax
	Problem: `75,89` (comma-separated) causes cmake error.
	Fix: `-DCMAKE_CUDA_ARCHITECTURES='75;89'` with semicolons and single quotes.

	### 6.5 add_local_dir + run_commands ordering
	Problem: Modal errors if `run_commands` comes after `add_local_dir` without `copy=True`.
	Fix: `.add_local_dir(..., copy=True)` then `.run_commands(...)`.

	### 6.6 Network disconnect kills running build
	Problem: Local terminal disconnect terminates remote build.
	Workaround: `modal run --detach` submits to cloud and returns immediately.
	Monitor progress on Modal web UI.

	---

	## 7. Cost Estimation

	\| GPU \| Price/hour \| Per request (2s) \| Monthly 1000 req \|
	\|-----\|-----------\|-----------------\|-----------------\|
	\| T4 (Solution A+B) \| $0.50/hr \| $0.00028 \| $0.28 \|
	\| L4 (optional) \| $0.42/hr \| $0.00023 \| $0.23 \|
	\| A100 (baseline) \| $1.50/hr \| $0.00083 \| $0.83 \|

	---

	---

	## 8. 🔴 Production Script Protect

	The following files are PRODUCTION ASSETS — do not modify or delete without full rebuild verification:

	\| File \| Reason \|
	\|------\|--------\|
	\| `modal_deploy/deploy_omni.py` \| Main deploy script for full-duplex Solution B. Build proven working. \|
	\| `modal_deploy/llamacpp_omni/` \| llama.cpp-omni source (3411 files). Exact checkout used for successful build. \|
	\| `modal_deploy/build_llama_server.sh` \| Scratchpad local cross-compile attempt (unused, but committed). \|

	Verification required before merge: `modal run --detach -m modal_deploy.deploy_omni::diagnose_volume` must pass.

	---

	## 9. File Map

	\| File \| Description \|
	\|------\|-------------\|
	\| modal_deploy/deploy.py \| Solution A: FastAPI + llama-cpp-python (deployed) \|
	\| modal_deploy/deploy_omni.py \| Solution B: FastAPI + llama-server subprocess \|
	\| modal_deploy/deploy_backup.py \| Solution A backup \|
	\| modal_deploy/llamacpp_omni/ \| llama.cpp-omni source (3411 files, 156MB) \|
	\| modal_deploy/client.py \| Python API client (OpenAI compatible) \|
	\| modal_deploy/README.md \| Deploy instructions \|
	\| modal_deploy/build_llama_server.sh \| Local cross-compile script (unused) \|
	\| docs/README_modal_deploy.md \| This file - comprehensive guide \|