Instructions to use WeiboAI/VibeThinker-1.5B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use WeiboAI/VibeThinker-1.5B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="WeiboAI/VibeThinker-1.5B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("WeiboAI/VibeThinker-1.5B")
model = AutoModelForCausalLM.from_pretrained("WeiboAI/VibeThinker-1.5B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Local Apps Settings

vLLM

How to use WeiboAI/VibeThinker-1.5B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "WeiboAI/VibeThinker-1.5B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "WeiboAI/VibeThinker-1.5B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/WeiboAI/VibeThinker-1.5B

SGLang

How to use WeiboAI/VibeThinker-1.5B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "WeiboAI/VibeThinker-1.5B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "WeiboAI/VibeThinker-1.5B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "WeiboAI/VibeThinker-1.5B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "WeiboAI/VibeThinker-1.5B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use WeiboAI/VibeThinker-1.5B with Docker Model Runner:
```
docker model run hf.co/WeiboAI/VibeThinker-1.5B
```

Running VibeThinker-1.5B on android Samsung Tablet — Edge AI in Action

by Javedalam - opened Nov 12, 2025

Discussion

Javedalam

Nov 12, 2025

Model: VibeThinker-1.5B (Qwen 2.5 Math finetune)

Quantization: 4-bit GGUF

Inference engine: llama-server under Termux

Temperature: 0.2

System prompt:

“You are a concise solver. Always stop after giving a single line beginning with ‘Final Answer:’. Never explain or continue reasoning.”

With this setup, the model successfully solved the differential equation

y'' - y = e^x,\quad y(0)=0,\quad y'(0)=1,

y(x)=\tfrac14e^{x}-\tfrac14e^{-x}+\tfrac12x e^{x}.

At a generation rate of roughly 3 tokens per second, VibeThinker-1.5B handled both the mathematical reasoning and logical structure smoothly. For a model under 1.2 billion parameters, this performance is remarkable. It demonstrates that, with improved quantization and refined prompting, Edge AI on mobile devices has become a practical reality—bringing private, on-device reasoning to everyday hardware.

Javedalam changed discussion title from Running VibeThinker-1.5B on a Samsung Tablet — Edge AI in Action to Running VibeThinker-1.5B on an android Samsung Tablet — Edge AI in Action Nov 12, 2025

Javedalam changed discussion title from Running VibeThinker-1.5B on an android Samsung Tablet — Edge AI in Action to Running VibeThinker-1.5B on android Samsung Tablet — Edge AI in Action Nov 12, 2025

darwin2025

Nov 19, 2025

Hi, what's the chat UI do you use here?

Edge-Quant

Nov 20, 2025

Hi, what's the chat UI do you use here?

Official llama.cpp web ui

SiddhJagani

Nov 25, 2025

here is Dockerfile for it:

FROM archlinux:latest

ENV DEBIAN_FRONTEND=noninteractive

# passed from space environment
ARG MODEL_ID="unsloth/gemma-3-270m-it-GGUF"
ARG QUANT="Q8_0"
ARG SERVED_NAME="Gemma 270m"
ARG PARALLEL=4
ARG CTX_SIZE="4096"
ARG EMBEDDING_ONLY=0
ARG RERANK_ONLY=0

# llama.cpp env configs
ENV LLAMA_ARG_HF_REPO="${MODEL_ID}"
ENV LLAMA_ARG_CTX_SIZE="${CTX_SIZE}"
ENV LLAMA_ARG_BATCH=512
ENV LLAMA_ARG_N_PARALLEL="${PARALLEL}"
ENV LLAMA_ARG_FLASH_ATTN=on
# ENV LLAMA_ARG_CACHE_TYPE_K="q8_0"
# ENV LLAMA_ARG_CACHE_TYPE_V="q4_1"
ENV LLAMA_ARG_MLOCK=1
ENV LLAMA_ARG_N_GPU_LAYERS=0
ENV LLAMA_ARG_HOST="0.0.0.0"
ENV LLAMA_ARG_PORT=7860
ENV LLAMA_ARG_ALIAS="${SERVED_NAME}"
ENV LLAMA_ARG_EMBEDDINGS=${EMBEDDING_ONLY}
ENV LLAMA_ARG_RERANKING=${RERANK_ONLY}
ENV LLAMA_ARG_ENDPOINT_METRICS=1

RUN pacman -Syu --noconfirm --overwrite '*'
RUN pacman -S base-devel git git-lfs cmake curl openblas openblas64 blas64-openblas python gcc-libs glibc --noconfirm --overwrite '*'

RUN mkdir -p /app && mkdir -p /.cache
# cache dir for llama.cpp to download models
RUN chmod -R 777 /.cache

WORKDIR /app
RUN git clone --depth 1 --single-branch --branch master https://github.com/ggml-org/llama.cpp.git
# RUN git clone https://github.com/ikawrakow/ik_llama.cpp.git llama.cpp
WORKDIR /app/llama.cpp
RUN cmake -B build \
          -DGGML_LTO=ON \
          -DLLAMA_CURL=ON \
          -DLLAMA_BUILD_SERVER=ON \
          -DLLAMA_BUILD_EXAMPLES=ON \
          -DGGML_ALL_WARNINGS=OFF \
          -DGGML_ALL_WARNINGS_3RD_PARTY=OFF \
          -DGGML_BLAS=ON \
          -DGGML_BLAS_VENDOR=OpenBLAS \
          -DGGML_NATIVE=ON \
          -DGGML_LLAMAFILE=ON \
          -Wno-dev \
          -DCMAKE_BUILD_TYPE=Release
RUN cmake --build build --config Release --target llama-server -j $(nproc)

WORKDIR /app

EXPOSE 7860

CMD ["/app/llama.cpp/build/bin/llama-server", "--verbose-prompt", "--prio", "3"]

Jameswhitmore1122

Nov 28, 2025

Really impressive work getting VibeThinker-1.5B running so smoothly on a Samsung tablet. Solving a differential equation correctly at ~3 tokens/sec on 4-bit GGUF shows how far edge AI has come. This is a great example of practical, private on-device reasoning, excited to see where mobile-first inference goes next.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment