Spaces:

build-small-hackathon
/

PregoPal

Runtime error

App Files Files Community

PregoPal / docs /README_modal_deploy.md

J.B-Lin

refactor: deprecate llama-omni-server, use llama-server built-in omni endpoints

6451051 23 days ago

preview code

Raw

History Blame Contribute Delete

9.96 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

PregoPal Modal Deploy Guide

Final update: 2026-06-12 Covers both deploy.py (pre-built wheel) and deploy_omni.py (llama.cpp-omni full-duplex)

1. Deploy Solutions Comparison

Aspect	Solution A (deploy.py)	Solution B (deploy_omni.py)
Engine	llama-cpp-python (mainline)	llama.cpp-omni (OpenBMB fork)
Capabilities	Text + Vision	Text + Vision + Audio + TTS + token2wav
Build method	Pre-built wheel (5s)	Source compile (~77min)
Image build time	45s	~77min
GPU	T4	T4 / L4
Deploy mode	@asgi_app()	@asgi_app() + llama-server subprocess
Status	Deployed (stable)	Compiled + inference verified on T4
App name	prego-pal-minicpm	prego-pal-minicpm-omni

Solution B core difference: Uses OpenBMB branch llama-server binary, supports multiple mmproj (vision + audio) and full-duplex TTS/STT.

2. Solution A: Pre-built wheel (deploy.py)

Image Definition

_image = (
    Image.debian_slim(python_version="3.11")
    .pip_install("fastapi", "uvicorn[standard]", "httpx", "numpy", "Pillow")
    .pip_install(
        "llama-cpp-python",
        extra_index_url="https://ggml-org.github.io/llama-cpp-python/whl/cu121",
    )
    .run_commands('python -c "import llama_cpp; print('OK')"')
)

Key Parameters

Model: MiniCPM-o-4_5-Q4_K_M.gguf + vision mmproj
LLM: n_gpu_layers=-1, n_ctx=8192
GPU: T4 (16GB VRAM)
Concurrency: 10
Idle timeout: 300s

Deploy & Test

modal deploy modal_deploy.deploy
modal run modal_deploy.deploy::test_inference
curl https://andrew-jiabin--prego-pal-minicpm-serve.modal.run/health

3. Solution B: llama.cpp-omni source compile (deploy_omni.py)

⚠️ 这是已编译成功的 Production 版本。deploy_omni.py 及其配套的 llamacpp_omni/ 源码目录严禁随意修改或删除。 任何修改必须经过完整编译验证（modal run --detach -m modal_deploy.deploy_omni::diagnose_volume 编译通过 + 模型加载正常），确认无误后方可合并。当前成功的 Image ID: im-AxWdR31ZWeEDIfXIPcxAhZ（2026-06-11 编译）

Architecture

FastAPI (ASGI) <-> llama-server (subprocess, 127.0.0.1:8081)
                   |
            Modal Volume: GGUF models (vision + audio + TTS + token2wav)

Source

llama.cpp-omni (OpenBMB fork, no longer public). Using tc-mb/llama.cpp-omni fork. Stored in: modal_deploy/llamacpp_omni/ (3411 files, 156MB).

Build Dependencies (Debian 12 + CUDA 12.4)

# NVIDIA apt repo setup
curl -L -o /tmp/cuda-keyring.deb https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i /tmp/cuda-keyring.deb
apt-get update
apt-get install -y cuda-toolkit-12-4 cuda-compiler-12-4 cuda-driver-dev-12-4

CMake Flags

Flag	Purpose
`-DGGML_CUDA=ON`	CUDA backend
`-DLLAMA_BUILD_SERVER=ON`	Build llama-server
`-DLLAMA_CUDA_FORCE_MMQ=ON`	Simplify matrix multiply templates (faster compile)
`-DGGML_CUDA_NO_VMM=ON`	Skip libcuda.so.1 linking at build time
`-DCMAKE_CUDA_ARCHITECTURES='75;89'`	T4(sm_75) + L4(sm_89)

llama-server Launch Args

llama-server \
  -m /models/MiniCPM-o-4_5-gguf/MiniCPM-o-4_5-Q4_K_M.gguf \
  --mmproj /models/MiniCPM-o-4_5-gguf/vision/MiniCPM-o-4_5-vision-F16.gguf \
  # Note: --mmproj only accepts ONE file (vision). Audio encoder is
  # loaded via /v1/stream/omni_prefill API at runtime, NOT via --mmproj.
  --voxcpm2-base-lm /models/MiniCPM-o-4_5-gguf/tts/MiniCPM-o-4_5-tts-F16.gguf \
  --voxcpm2-acoustic /models/MiniCPM-o-4_5-gguf/tts/MiniCPM-o-4_5-projector-F16.gguf \
  --host 127.0.0.1 --port 8081 \
  -ngl 99 -c 8192 --no-mmap --jinja

Deploy & Test

modal deploy modal_deploy.deploy_omni
modal run -m modal_deploy.deploy_omni::diagnose_volume
modal run --detach -m modal_deploy.deploy_omni::test_inference

Test Results (2026-06-11, T4)

Status: ✅ Deployed to production on T4.

Production URL: https://andrew-jiabin--prego-pal-minicpm-omni-serve.modal.run

Metric	Value
Cold start (model load)	~9.5s
Chinese inference	1.1s, 34 tokens, correct content
English inference	1.3s, 50 tokens (truncated by max_tokens)
GPU memory (model)	~11.6 GB
GPU memory (mmproj)	~1.1 GB
VRAM left	~2-3 GB

Example outputs:

Chinese: "你好，我是MiniCPM系列模型，由面壁智能和OpenBMB开源社区开发。..."
English: Empty (finish_reason=length, 50 tokens of special tokens)

Notes:

English inference returns empty content when max_tokens is low; the model generates special tokens first (e.g. <|im_start|>assistant). Use higher max_tokens or better prompt to get valid English output.
--jinja flag is required for correct chat template formatting.

API Endpoints

Endpoint	Method	Description
/v1/chat/completions	POST	OpenAI compatible (chat, multimodal, streaming)
/v1/audio/speech	POST	TTS: text -> speech WAV
/v1/audio/speech/stream	POST	Streaming TTS
/v1/audio/transcriptions	POST	STT: speech -> text
/v1/embeddings	POST	Embeddings
/health	GET	Health check (includes audio/vision/TTS status)
/v1/models	GET	Model list

4. Volume Management

Volume Info

Name: minicpm-o-4_5-models
Mount: /models
Structure: /MiniCPM-o-4_5-gguf/

File Inventory (10 files, ~7.8 GB total)

File	Size	Description
MiniCPM-o-4_5-Q4_K_M.gguf	4.68 GB	Main model (Q4_K_M quantized)
vision/MiniCPM-o-4_5-vision-F16.gguf	1.02 GB	Vision projector
audio/MiniCPM-o-4_5-audio-F16.gguf	0.61 GB	Audio encoder
tts/MiniCPM-o-4_5-tts-F16.gguf	1.08 GB	TTS BaseLM
tts/MiniCPM-o-4_5-projector-F16.gguf	0.01 GB	TTS projector
token2wav-gguf/encoder.gguf	0.14 GB	Token2wav encoder
token2wav-gguf/flow_extra.gguf	0.01 GB	Flow extra
token2wav-gguf/flow_matching.gguf	0.43 GB	Flow matching
token2wav-gguf/hifigan2.gguf	0.08 GB	HiFiGAN decoder
token2wav-gguf/prompt_cache.gguf	0.20 GB	Prompt cache

Commands

modal volume put minicpm-o-4_5-models ./models/MiniCPM-o-4_5-gguf /
modal volume ls minicpm-o-4_5-models

5. CLI Quick Reference

# Deploy
modal deploy modal_deploy.deploy                # Solution A
modal deploy modal_deploy.deploy_omni           # Solution B

# Dev (hot-reload, solution A only)
modal serve modal_deploy.deploy

# Volume diagnostics
modal run -m modal_deploy.deploy_omni::diagnose_volume

# Inference testing
modal run -m modal_deploy.deploy_omni::test_inference

# Detach mode (avoids disconnect killing build)
modal run --detach -m modal_deploy.deploy_omni::diagnose_volume

# Logs
modal app logs prego-pal-minicpm
modal app logs prego-pal-minicpm-omni

# List apps & clean up stopped apps
modal app list

6. Pitfalls Log

6.1 GitHub access blocked in Modal build containers

Problem: git clone from Modal build container fails (no auth). Fix: Use Image.add_local_dir() with copy=True to ship source from local machine.

6.2 Debian-slim has no curl/wget

Problem: Minimal base image lacks download tools. Fix: apt_install("curl") before run_commands that need downloads.

6.3 libcuda.so.1 linking failure at build time

Problem: Build container has no NVIDIA driver, linker cannot find libcuda.so.1. Root cause: ggml-cuda.so links CUDA::cuda_driver (libcuda.so.1) for VMM support. Fix: -DGGML_CUDA_NO_VMM=ON disables CUDA driver linking. Runtime Modal GPU container provides the real libcuda.so.1.

6.4 CMAKE_CUDA_ARCHITECTURES syntax

Problem: 75,89 (comma-separated) causes cmake error. Fix: -DCMAKE_CUDA_ARCHITECTURES='75;89' with semicolons and single quotes.

6.5 add_local_dir + run_commands ordering

Problem: Modal errors if run_commands comes after add_local_dir without copy=True. Fix: .add_local_dir(..., copy=True) then .run_commands(...).

6.6 Network disconnect kills running build

Problem: Local terminal disconnect terminates remote build. Workaround: modal run --detach submits to cloud and returns immediately. Monitor progress on Modal web UI.

7. Cost Estimation

GPU	Price/hour	Per request (2s)	Monthly 1000 req
T4 (Solution A+B)	$0.50/hr	$0.00028	$0.28
L4 (optional)	$0.42/hr	$0.00023	$0.23
A100 (baseline)	$1.50/hr	$0.00083	$0.83

8. 🔴 Production Script Protect

The following files are PRODUCTION ASSETS — do not modify or delete without full rebuild verification:

File	Reason
`modal_deploy/deploy_omni.py`	Main deploy script for full-duplex Solution B. Build proven working.
`modal_deploy/llamacpp_omni/`	llama.cpp-omni source (3411 files). Exact checkout used for successful build.
`modal_deploy/build_llama_server.sh`	Scratchpad local cross-compile attempt (unused, but committed).

Verification required before merge: modal run --detach -m modal_deploy.deploy_omni::diagnose_volume must pass.

9. File Map

File	Description
modal_deploy/deploy.py	Solution A: FastAPI + llama-cpp-python (deployed)
modal_deploy/deploy_omni.py	Solution B: FastAPI + llama-server subprocess
modal_deploy/deploy_backup.py	Solution A backup
modal_deploy/llamacpp_omni/	llama.cpp-omni source (3411 files, 156MB)
modal_deploy/client.py	Python API client (OpenAI compatible)
modal_deploy/README.md	Deploy instructions
modal_deploy/build_llama_server.sh	Local cross-compile script (unused)
docs/README_modal_deploy.md	This file - comprehensive guide