Instructions to use Menlo/Jan-nano with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Menlo/Jan-nano with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Menlo/Jan-nano") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Menlo/Jan-nano") model = AutoModelForCausalLM.from_pretrained("Menlo/Jan-nano") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Menlo/Jan-nano with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Menlo/Jan-nano" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Menlo/Jan-nano", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Menlo/Jan-nano
- SGLang
How to use Menlo/Jan-nano with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Menlo/Jan-nano" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Menlo/Jan-nano", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Menlo/Jan-nano" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Menlo/Jan-nano", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Menlo/Jan-nano with Docker Model Runner:
docker model run hf.co/Menlo/Jan-nano
Jan-nano Local Deployment Issues - Lack of Reasoning and Poor MCP Performance
Discussion Post: Jan-nano Local Deployment Issues - Lack of Reasoning and Poor MCP Performance
Hello everyone! I recently deployed the Jan-nano model locally, but I’ve encountered some issues during testing. I’d greatly appreciate your insights and guidance. Below are the specific problems I’m facing, along with my observations and questions.
Problem Description
Discrepancy Between Online and Local Inference
- When using the online API, the model behaves as expected, showing reasoning steps (e.g., step-by-step analysis, logical deduction), which aligns with the expected output.
- However, when deploying Jan-nano locally, the model does not perform reasoning and directly generates responses, leading to suboptimal performance on tasks requiring logical inference.
- Question: Is there a missing configuration or parameter in the local deployment? Do I need to explicitly enable a "reasoning mode" or adjust the inference pipeline?
Poor MCP Performance
- The MCP (possibly a plugin or inference mode) performs significantly worse in the local deployment compared to Qwen3-8b when using the "reasoning mode."
- Question: Could this be due to model architecture differences, training data, or parameter settings? Are there specific adjustments I can make to the MCP configuration?
Steps I’ve Already Taken
- Verified that the local deployment version of Jan-nano matches the online API version.
- Checked the model’s configuration files and found no obvious discrepancies.
- Experimented with inference parameters (e.g., temperature, top_p) but saw no significant improvement.
- Local deployment environment: Python 3.10 + CUDA 11.8, with hardware matching the online service.
What I’m Looking For
- Insights from others who have deployed Jan-nano locally and encountered similar issues.
- Guidance on enabling "reasoning mode" or adjusting inference parameters.
- Analysis of potential causes for the MCP performance gap and strategies to address it.
Thank you for your time and expertise!
If you have examples of configurations, parameter explanations, or relevant documentation, I’d be incredibly grateful. Looking forward to your responses! 😊
Hi Jan-nano is a 4b (not 8b) non-reasoning model.
so the offline behavior is correct.
I think on the online API they support both, but at the end of the day we trained the model to not think.
Hi Jan-nano is a 4b (not 8b) non-reasoning model.
so the offline behavior is correct.
I think on the online API they support both, but at the end of the day we trained the model to not think.
Hi @alandao ,
Thank you so much for your clear reply! That definitely clears up why I was seeing different behaviors between the online and local versions.
Just to clarify, my mention of an "8b model" in the original post was referring to Qwen3-8b, which I was using as a benchmark for comparison.
I understand now that Jan-nano is a 4b non-reasoning model and its behavior in my local deployment is correct. What I'm still trying to understand is the extent of the performance difference on our MCP task. The drop in accuracy compared to a reasoning model like Qwen3-8b was larger than I had anticipated.
Is such a significant performance gap expected when a non-reasoning model is applied to tasks that might implicitly benefit from the underlying capabilities of a reasoning model?
Thanks again for your help