Instructions to use Menlo/Jan-nano-128k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Menlo/Jan-nano-128k with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Menlo/Jan-nano-128k") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Menlo/Jan-nano-128k") model = AutoModelForCausalLM.from_pretrained("Menlo/Jan-nano-128k") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Menlo/Jan-nano-128k with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Menlo/Jan-nano-128k" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Menlo/Jan-nano-128k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Menlo/Jan-nano-128k
- SGLang
How to use Menlo/Jan-nano-128k with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Menlo/Jan-nano-128k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Menlo/Jan-nano-128k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Menlo/Jan-nano-128k" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Menlo/Jan-nano-128k", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Menlo/Jan-nano-128k with Docker Model Runner:
docker model run hf.co/Menlo/Jan-nano-128k
Curious - Yarn setting different from Qwen3 repo for 128k?
Note the yarn setting for 4B Qwen 3 as per Qwen's repo is:
"rope_scaling": {
"rope_type": "yarn",
"factor": 4.0,
"original_max_position_embeddings": 32768
}
Noticed yours is different?
"rope_scaling": {
"factor": 3.2,
"original_max_position_embeddings": 40960,
"rope_type": "yarn"
},
Does this impact performance?
Hi our test result is coming from
"rope_scaling": {
"factor": 3.2,
"original_max_position_embeddings": 40960,
"rope_type": "yarn"
},
There should be no issue with current config don't worry
We have benchmarked everything using this config, you should get the same result with this config.
We're re-benchmarking the config from Qwen team, we're a bit confused atm but it should affect nothing from performance perspective.
Will update result soon! If the result is better will use the new config, else this should be fine.
Thank you for quick update.
I have used Yarn to extend the Qwen3s to 320k ... but if your method works better - all the better!