Instructions to use OpenMOSS-Team/moss-moon-003-sft-plugin-int4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use OpenMOSS-Team/moss-moon-003-sft-plugin-int4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="OpenMOSS-Team/moss-moon-003-sft-plugin-int4", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("OpenMOSS-Team/moss-moon-003-sft-plugin-int4", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use OpenMOSS-Team/moss-moon-003-sft-plugin-int4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "OpenMOSS-Team/moss-moon-003-sft-plugin-int4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenMOSS-Team/moss-moon-003-sft-plugin-int4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/OpenMOSS-Team/moss-moon-003-sft-plugin-int4
- SGLang
How to use OpenMOSS-Team/moss-moon-003-sft-plugin-int4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "OpenMOSS-Team/moss-moon-003-sft-plugin-int4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenMOSS-Team/moss-moon-003-sft-plugin-int4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "OpenMOSS-Team/moss-moon-003-sft-plugin-int4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "OpenMOSS-Team/moss-moon-003-sft-plugin-int4", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use OpenMOSS-Team/moss-moon-003-sft-plugin-int4 with Docker Model Runner:
docker model run hf.co/OpenMOSS-Team/moss-moon-003-sft-plugin-int4
show error:'<' not supported between instances of 'tuple' and 'float'
I run this project by Streamlit. I can see the page, but got error when I press "send" button.
ps: I follow the guide in readme.
That's the error stack print on console and web page:
TypeError: '<' not supported between instances of 'tuple' and 'float'
Traceback:
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 561, in _run_script
self._session_state.on_script_will_rerun(rerun_data.widget_states)
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/streamlit/runtime/state/safe_session_state.py", line 68, in on_script_will_rerun
self._state.on_script_will_rerun(latest_widget_states)
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/streamlit/runtime/state/session_state.py", line 476, in on_script_will_rerun
self._call_callbacks()
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/streamlit/runtime/state/session_state.py", line 489, in _call_callbacks
self._new_widget_state.call_callback(wid)
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/streamlit/runtime/state/session_state.py", line 244, in call_callback
callback(*args, **kwargs)
File "moss_web_demo_streamlit.py", line 69, in generate_answer
generated_ids = model.generate(
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/transformers/generation/utils.py", line 1518, in generate
return self.greedy_search(
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/transformers/generation/utils.py", line 2285, in greedy_search
outputs = self(
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/local/modeling_moss.py", line 678, in forward
transformer_outputs = self.transformer(
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/local/modeling_moss.py", line 545, in forward
outputs = block(
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/local/modeling_moss.py", line 270, in forward
attn_outputs = self.attn(
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/local/modeling_moss.py", line 164, in forward
qkv = self.qkv_proj(hidden_states)
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/local/quantization.py", line 367, in forward
out = QuantLinearFunction.apply(x.reshape(-1, x.shape[-1]), self.qweight, self.scales,
File "/usr/local/lib/miniconda3/envs/moss/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 94, in decorate_fwd
return fwd(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/local/quantization.py", line 279, in forward
output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq)
File "/root/.cache/huggingface/modules/transformers_modules/local/quantization.py", line 250, in matmul248
matmul_248_kernel[grid](input, qweight, output,
File "/usr/local/app/jupyterlab/moss/MOSS/models/custom_autotune.py", line 93, in run
self.cache[key] = builtins.min(timings, key=timings.get)