Instructions to use Menlo/Jan-nano with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Menlo/Jan-nano with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Menlo/Jan-nano")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Menlo/Jan-nano")
model = AutoModelForCausalLM.from_pretrained("Menlo/Jan-nano")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Menlo/Jan-nano with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Menlo/Jan-nano"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Menlo/Jan-nano",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Menlo/Jan-nano

SGLang

How to use Menlo/Jan-nano with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Menlo/Jan-nano" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Menlo/Jan-nano",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Menlo/Jan-nano" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Menlo/Jan-nano",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Menlo/Jan-nano with Docker Model Runner:
```
docker model run hf.co/Menlo/Jan-nano
```

Jan-nano Local Deployment Issues - Lack of Reasoning and Poor MCP Performance

#12

by Yuuutong - opened Jun 30, 2025

Discussion

Yuuutong

Jun 30, 2025

Discussion Post: Jan-nano Local Deployment Issues - Lack of Reasoning and Poor MCP Performance

Hello everyone! I recently deployed the Jan-nano model locally, but I’ve encountered some issues during testing. I’d greatly appreciate your insights and guidance. Below are the specific problems I’m facing, along with my observations and questions.

Problem Description

Discrepancy Between Online and Local Inference
- When using the online API, the model behaves as expected, showing reasoning steps (e.g., step-by-step analysis, logical deduction), which aligns with the expected output.
- However, when deploying Jan-nano locally, the model does not perform reasoning and directly generates responses, leading to suboptimal performance on tasks requiring logical inference.
- Question: Is there a missing configuration or parameter in the local deployment? Do I need to explicitly enable a "reasoning mode" or adjust the inference pipeline?
Poor MCP Performance
- The MCP (possibly a plugin or inference mode) performs significantly worse in the local deployment compared to Qwen3-8b when using the "reasoning mode."
- Question: Could this be due to model architecture differences, training data, or parameter settings? Are there specific adjustments I can make to the MCP configuration?

Steps I’ve Already Taken

Verified that the local deployment version of Jan-nano matches the online API version.
Checked the model’s configuration files and found no obvious discrepancies.
Experimented with inference parameters (e.g., temperature, top_p) but saw no significant improvement.
Local deployment environment: Python 3.10 + CUDA 11.8, with hardware matching the online service.

What I’m Looking For

Insights from others who have deployed Jan-nano locally and encountered similar issues.
Guidance on enabling "reasoning mode" or adjusting inference parameters.
Analysis of potential causes for the MCP performance gap and strategies to address it.

Thank you for your time and expertise!
If you have examples of configurations, parameter explanations, or relevant documentation, I’d be incredibly grateful. Looking forward to your responses! 😊

alandao

Menlo Research org Jul 1, 2025

•

edited Jul 1, 2025

Hi Jan-nano is a 4b (not 8b) non-reasoning model.

so the offline behavior is correct.

I think on the online API they support both, but at the end of the day we trained the model to not think.

Yuuutong

Jul 2, 2025

Hi Jan-nano is a 4b (not 8b) non-reasoning model.

so the offline behavior is correct.

I think on the online API they support both, but at the end of the day we trained the model to not think.

Hi @alandao ,
Thank you so much for your clear reply! That definitely clears up why I was seeing different behaviors between the online and local versions.
Just to clarify, my mention of an "8b model" in the original post was referring to Qwen3-8b, which I was using as a benchmark for comparison.
I understand now that Jan-nano is a 4b non-reasoning model and its behavior in my local deployment is correct. What I'm still trying to understand is the extent of the performance difference on our MCP task. The drop in accuracy compared to a reasoning model like Qwen3-8b was larger than I had anticipated.
Is such a significant performance gap expected when a non-reasoning model is applied to tasks that might implicitly benefit from the underlying capabilities of a reasoning model?
Thanks again for your help

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment