Instructions to use Motif-Technologies/Motif-2-12.7B-Reasoning with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Motif-Technologies/Motif-2-12.7B-Reasoning with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Motif-Technologies/Motif-2-12.7B-Reasoning", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Motif-Technologies/Motif-2-12.7B-Reasoning", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Motif-Technologies/Motif-2-12.7B-Reasoning with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Motif-Technologies/Motif-2-12.7B-Reasoning"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Motif-Technologies/Motif-2-12.7B-Reasoning",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Motif-Technologies/Motif-2-12.7B-Reasoning

SGLang

How to use Motif-Technologies/Motif-2-12.7B-Reasoning with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Motif-Technologies/Motif-2-12.7B-Reasoning" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Motif-Technologies/Motif-2-12.7B-Reasoning",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Motif-Technologies/Motif-2-12.7B-Reasoning" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Motif-Technologies/Motif-2-12.7B-Reasoning",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Motif-Technologies/Motif-2-12.7B-Reasoning with Docker Model Runner:
```
docker model run hf.co/Motif-Technologies/Motif-2-12.7B-Reasoning
```

VLLM 지원 안하는건가요?

by 2c6829 - opened Dec 10, 2025

Discussion

2c6829

Dec 10, 2025

한국기업이 만들었다 해서 써보려 하는데요.
google cloab pro 에서 L4 대여해서 써보려 하는데 잘 안되네요

!pip uninstall -y vllm
!pip install vllm==0.10.2

from vllm import LLM, SamplingParams
import torch

L4 GPU는 bfloat16과 궁합이 가장 좋습니다.

llm = LLM(
model="Motif-Technologies/Motif-2-12.7B-Reasoning",
trust_remote_code=True,
dtype="bfloat16", # L4 성능 최적화 (bfloat16)
gpu_memory_utilization=0.9, # GPU 메모리 90% 사용
kv_cache_dtype="fp8", # ★ 핵심: KV Cache를 FP8로 저장하여 메모리 절약 및 가속
enable_chunked_prefill=False
)

대화 메시지 설정 (Chat 템플릿 적용 필요 시 tokenizer 활용)

prompts = [
"안녕 너의 이름은 뭐니?"
]

생성 설정

sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=256)

실행

outputs = llm.generate(prompts, sampling_params)

결과 출력

for output in outputs:
generated_text = output.outputs[0].text
print(f"질문: {output.prompt}")
print(f"답변: {generated_text}")

여기서 오류가 발생합니다. 혹시 제가 잘못 돌린건지, 아니면 지원이 애초에 안되는건지 확인해주실 수 있나요?

[오류내용]

AssertionError Traceback (most recent call last)
/tmp/ipython-input-346849919.py in <cell line: 0>()
3
4 # L4 GPU는 bfloat16과 궁합이 가장 좋습니다.
----> 5 llm = LLM(
6 model="Motif-Technologies/Motif-2-12.7B-Reasoning",
7 trust_remote_code=True,

22 frames
/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py in init(self, num_heads, head_size, scale, num_kv_heads, alibi_slopes, cache_config, quant_config, logits_soft_cap, per_layer_sliding_window, use_mla, prefix, attn_type, kv_sharing_target_layer_name, attn_backend, **extra_impl_args)
112 if num_kv_heads is None:
113 num_kv_heads = num_heads
--> 114 assert num_heads % num_kv_heads == 0,
115 f"num_heads ({num_heads}) is not "
116 f"divisible by num_kv_heads ({num_kv_heads})"

AssertionError: num_heads (40) is not divisible by num_kv_heads (16)

wyldecat

Motif Technologies org Dec 10, 2025

@2c6829
안녕하세요. 아직 vllm에서 공식 지원하고 있지는 않습니다.
혹시 docker image가 사용 가능한 환경이시면 아래와 같이 docker image 받으셔서 사용해보실 수 있습니다.
docker pull ghcr.io/motiftechnologies/vllm:cuda-latest

docker image 사용이 어려우실 수 있으니,
L4에서 사용하실 수 있도록 whl file을 오늘중으로 별도 업로드 해드리겠습니다.

2c6829

Dec 10, 2025

오 아니요 4080 rtx 환경에서 fp8, 4bit 양자화 해서 사용하고 싶습니다. (집에 하나 있는 컴퓨터가 그거라서요. 지금 ai 가지고 간단한 서비스를 만들어보려 하는데, 오늘 base모델 내주셨다는 논문을 재밌게 봤습니다.) 그래서 써보려 하는거거든요.
우선 L4(colab)에서 사용할수 있는 내용을 업로드 해주신다니 그거 먼저 기다리겠습니다.
답변주셔서 너무 감사합니다!!!

wyldecat

Motif Technologies org Dec 10, 2025

@2c6829

L4, 4080의 gpu architecture를 포함해 지원하는 whl file을 업로드 하였습니다.
(pip install https://github.com/MotifTechnologies/vllm/releases/download/v0.10.1/vllm-0.10.1rc2.dev2507+g0e2a3d8ec.cu130-cp38-abi3-linux_x86_64.whl 과 같이 사용하실 수 있습니다.)
aws의 ec2 G6 instance로 L4 GPU에서 동작하는것을 테스트 하였습니다만, colab VM에서 테스트는 못해봤습니다.

현재 사용 가능한 방법에는 3가지가 있습니다.

docker image 사용

test된 cuda runtime과 기타 python package까지 전부 install 되어 있어, 제일 권장되는 방식입니다.
docker pull ghcr.io/motiftechnologies/vllm:cuda-latest

whl file 사용

pip install https://github.com/MotifTechnologies/vllm/releases/download/v0.10.1/vllm-0.10.1rc2.dev2507+g0e2a3d8ec.cu130-cp38-abi3-linux_x86_64.whl
CUDA 13.0, torch 2.9.1 환경에서 build 하였습니다.
혹시 dependency 관련 오류가 발생한다면, 먼저 pip install --no-deps ... --no-deps option을 사용하여 vllm을 install 한 뒤
requirements 들을 별도로 설치하시면 됩니다.

source build

git clone https://github.com/MotifTechnologies/vllm.git
사용하시려는 CUDA runtime이 13.0이 아니거나, 특별한 dependency가 필요하시다면
- 직접 소스빌드 하시는것이 필요합니다.
uv 환경을 통해 build 하시는것을 권장드립니다.

공식 vLLM의 PR이 리뷰중에 있으니, 다소 불편하시더라도 당장은 위와 같이 사용해주시면 감사하겠습니다.
또 필요한게 있으시면 편하게 알려주세요 🤗

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment