Instructions to use Menlo/Jan-nano-128k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Menlo/Jan-nano-128k with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Menlo/Jan-nano-128k")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("Menlo/Jan-nano-128k")
model = AutoModelForMultimodalLM.from_pretrained("Menlo/Jan-nano-128k")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Menlo/Jan-nano-128k with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Menlo/Jan-nano-128k"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Menlo/Jan-nano-128k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Menlo/Jan-nano-128k

SGLang

How to use Menlo/Jan-nano-128k with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Menlo/Jan-nano-128k" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Menlo/Jan-nano-128k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Menlo/Jan-nano-128k" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Menlo/Jan-nano-128k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Menlo/Jan-nano-128k with Docker Model Runner:
```
docker model run hf.co/Menlo/Jan-nano-128k
```

Curious - Yarn setting different from Qwen3 repo for 128k?

by DavidAU - opened Jun 25, 2025

Discussion

DavidAU

Jun 25, 2025

Note the yarn setting for 4B Qwen 3 as per Qwen's repo is:

"rope_scaling": {
"rope_type": "yarn",
"factor": 4.0,
"original_max_position_embeddings": 32768
}

Noticed yours is different?

"rope_scaling": {
"factor": 3.2,
"original_max_position_embeddings": 40960,
"rope_type": "yarn"
},

Does this impact performance?

alandao

Menlo Research org Jun 25, 2025

Hi our test result is coming from

"rope_scaling": {
"factor": 3.2,
"original_max_position_embeddings": 40960,
"rope_type": "yarn"
},

alandao

Menlo Research org Jun 25, 2025

•

edited Jun 25, 2025

There should be no issue with current config don't worry

We have benchmarked everything using this config, you should get the same result with this config.

We're re-benchmarking the config from Qwen team, we're a bit confused atm but it should affect nothing from performance perspective.

Will update result soon! If the result is better will use the new config, else this should be fine.

DavidAU

Jun 25, 2025

Thank you for quick update.

I have used Yarn to extend the Qwen3s to 320k ... but if your method works better - all the better!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment