LegalContractAnalyzer / model_serving /SERVE_YOUR_OWN_MODEL.md
tmdeptrai3012's picture
deploy 2025-08-11 10:52:08
e09a92d verified

1. Installation

1.1 Install FastChat

FastChat is the backend server that can run multiple model workers and serve them via the OpenAI-compatible API.

# Create and activate virtual environment (optional but recommended)
conda create -n fastchat python=3.10 -y
conda activate fastchat

# Install FastChat
pip install fschat

Tip: If you want GPU acceleration, make sure PyTorch with CUDA is installed before installing FastChat:

pip install torch --index-url https://download.pytorch.org/whl/cu121

1.2 Install ngrok

ngrok will allow you to expose your FastChat API to the internet.

curl -sSL https://ngrok-agent.s3.amazonaws.com/ngrok.asc \
  | sudo tee /etc/apt/trusted.gpg.d/ngrok.asc >/dev/null \
  && echo "deb https://ngrok-agent.s3.amazonaws.com bookworm main" \
  | sudo tee /etc/apt/sources.list.d/ngrok.list \
  && sudo apt update \
  && sudo apt install ngrok

If you have troubles downloading ngrok, try visiting their official website: https://ngrok.com/downloads/

Log into ngrok and get your auth token:

ngrok config add-authtoken <YOUR_AUTH_TOKEN>

2. 🖥️ Configurable FastChat Run Script

In the folder /model_serving, check out the file serve_models.sh and make it executable:

chmod +x serve_models.sh

3. Usage Examples

Run with defaults (Qwen3-0.6B + Qwen3-Embedding-0.6B)

./model_serving/serve_models.sh

Run with custom models, ports, and ngrok URL

./model_serving/serve_models.sh Qwen/Qwen2-7B Qwen2-7B 21010 \
                  Qwen/Qwen2-Embedding Qwen2-Embedding 21011 \
                  8000 https://mycustomtunnel.ngrok-free.app

This will:

  • Run Qwen2-7B chat model on port 21010.
  • Run Qwen2-Embedding embedding model on port 21011.
  • Serve API on port 8000.
  • Tunnel via the given ngrok URL.

4. 🔍 Testing the API

List all models:

curl https://YOUR_NGROK_URL/v1/models

Or you may access it via a browser, for example: https://glowing-workable-arachnid.ngrok-free.app/v1/models

Get embeddings:

curl https://YOUR_NGROK_URL/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-Embedding-0.6B",
    "input": "FastChat is running two models now!"
  }'

Chat completion:

curl https://YOUR_NGROK_URL/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-0.6B",
    "messages": [{"role": "user", "content": "Hello from FastChat!"}]
  }'

5. Notes

  • Always set different ports for each worker.
  • --worker-address must match the worker’s host:port so FastChat doesn’t overwrite registrations.
  • Ngrok free plan requires reserving the subdomain before you can set a fixed --url. You may go on ngrok website to claim your own free subdomain to use, otherwise, whenever you start a tunnel, it will be a random public url.
  • Contact me if you need help ;) I'll be glad to help.