Instructions to use OpenMOSS-Team/moss-moon-003-sft-plugin-int4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OpenMOSS-Team/moss-moon-003-sft-plugin-int4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="OpenMOSS-Team/moss-moon-003-sft-plugin-int4", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("OpenMOSS-Team/moss-moon-003-sft-plugin-int4", trust_remote_code=True, device_map="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use OpenMOSS-Team/moss-moon-003-sft-plugin-int4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "OpenMOSS-Team/moss-moon-003-sft-plugin-int4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenMOSS-Team/moss-moon-003-sft-plugin-int4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/OpenMOSS-Team/moss-moon-003-sft-plugin-int4

SGLang

How to use OpenMOSS-Team/moss-moon-003-sft-plugin-int4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "OpenMOSS-Team/moss-moon-003-sft-plugin-int4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenMOSS-Team/moss-moon-003-sft-plugin-int4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "OpenMOSS-Team/moss-moon-003-sft-plugin-int4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenMOSS-Team/moss-moon-003-sft-plugin-int4",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use OpenMOSS-Team/moss-moon-003-sft-plugin-int4 with Docker Model Runner:
```
docker model run hf.co/OpenMOSS-Team/moss-moon-003-sft-plugin-int4
```

show error:'<' not supported between instances of 'tuple' and 'float'

by hua0424 - opened Apr 26, 2023

Discussion

hua0424

Apr 26, 2023

I run this project by Streamlit. I can see the page, but got error when I press "send" button.

ps: I follow the guide in readme.
That's the error stack print on console and web page:

TypeError: '<' not supported between instances of 'tuple' and 'float'
Traceback:
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 561, in _run_script
self._session_state.on_script_will_rerun(rerun_data.widget_states)
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/streamlit/runtime/state/safe_session_state.py", line 68, in on_script_will_rerun
self._state.on_script_will_rerun(latest_widget_states)
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/streamlit/runtime/state/session_state.py", line 476, in on_script_will_rerun
self._call_callbacks()
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/streamlit/runtime/state/session_state.py", line 489, in _call_callbacks
self._new_widget_state.call_callback(wid)
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/streamlit/runtime/state/session_state.py", line 244, in call_callback
callback(*args, **kwargs)
File "moss_web_demo_streamlit.py", line 69, in generate_answer
generated_ids = model.generate(
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/transformers/generation/utils.py", line 1518, in generate
return self.greedy_search(
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/transformers/generation/utils.py", line 2285, in greedy_search
outputs = self(
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/local/modeling_moss.py", line 678, in forward
transformer_outputs = self.transformer(
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/local/modeling_moss.py", line 545, in forward
outputs = block(
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/local/modeling_moss.py", line 270, in forward
attn_outputs = self.attn(
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/local/modeling_moss.py", line 164, in forward
qkv = self.qkv_proj(hidden_states)
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/local/quantization.py", line 367, in forward
out = QuantLinearFunction.apply(x.reshape(-1, x.shape[-1]), self.qweight, self.scales,
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 94, in decorate_fwd
return fwd(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/local/quantization.py", line 279, in forward
output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq)
File "/root/.cache/huggingface/modules/transformers_modules/local/quantization.py", line 250, in matmul248
matmul_248_kernel[grid](input, qweight, output,
File "/usr/local/app/jupyterlab/moss/MOSS/models/custom_autotune.py", line 93, in run
self.cache[key] = builtins.min(timings, key=timings.get)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment