Spaces:

tmdeptrai3012
/

LegalContractAnalyzer

Configuration error

App Files Files Community

LegalContractAnalyzer / model_serving /SERVE_YOUR_OWN_MODEL.md

tmdeptrai3012

deploy 2025-08-11 10:52:08

e09a92d verified 9 months ago

preview code

raw

history blame contribute delete

3.05 kB

1. Installation

1.1 Install FastChat

FastChat is the backend server that can run multiple model workers and serve them via the OpenAI-compatible API.

# Create and activate virtual environment (optional but recommended)
conda create -n fastchat python=3.10 -y
conda activate fastchat

# Install FastChat
pip install fschat

Tip: If you want GPU acceleration, make sure PyTorch with CUDA is installed before installing FastChat:

pip install torch --index-url https://download.pytorch.org/whl/cu121

1.2 Install ngrok

ngrok will allow you to expose your FastChat API to the internet.

curl -sSL https://ngrok-agent.s3.amazonaws.com/ngrok.asc \
  | sudo tee /etc/apt/trusted.gpg.d/ngrok.asc >/dev/null \
  && echo "deb https://ngrok-agent.s3.amazonaws.com bookworm main" \
  | sudo tee /etc/apt/sources.list.d/ngrok.list \
  && sudo apt update \
  && sudo apt install ngrok

If you have troubles downloading ngrok, try visiting their official website: https://ngrok.com/downloads/

Log into ngrok and get your auth token:

ngrok config add-authtoken <YOUR_AUTH_TOKEN>

2. 🖥️ Configurable FastChat Run Script

In the folder /model_serving, check out the file serve_models.sh and make it executable:

chmod +x serve_models.sh

3. Usage Examples

Run with defaults (Qwen3-0.6B + Qwen3-Embedding-0.6B)

./model_serving/serve_models.sh

Run with custom models, ports, and ngrok URL

./model_serving/serve_models.sh Qwen/Qwen2-7B Qwen2-7B 21010 \
                  Qwen/Qwen2-Embedding Qwen2-Embedding 21011 \
                  8000 https://mycustomtunnel.ngrok-free.app

This will:

Run Qwen2-7B chat model on port 21010.
Run Qwen2-Embedding embedding model on port 21011.
Serve API on port 8000.
Tunnel via the given ngrok URL.

4. 🔍 Testing the API

List all models:

curl https://YOUR_NGROK_URL/v1/models

Or you may access it via a browser, for example: https://glowing-workable-arachnid.ngrok-free.app/v1/models

Get embeddings:

curl https://YOUR_NGROK_URL/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-Embedding-0.6B",
    "input": "FastChat is running two models now!"
  }'

Chat completion:

curl https://YOUR_NGROK_URL/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-0.6B",
    "messages": [{"role": "user", "content": "Hello from FastChat!"}]
  }'

5. Notes

Always set different ports for each worker.
--worker-address must match the worker’s host:port so FastChat doesn’t overwrite registrations.
Ngrok free plan requires reserving the subdomain before you can set a fixed --url. You may go on ngrok website to claim your own free subdomain to use, otherwise, whenever you start a tunnel, it will be a random public url.
Contact me if you need help ;) I'll be glad to help.