vllm

Recommended docker run commands runs Qwen serving + after fixes fails with "not a valid safetensors repo"

#7
by vadimkantorov - opened

How do I force it to serve Leanstral? (also the example in README should include maybe --gpus all or mention this)

Should the running example in README be modified to docker run -it --gpus all mistralllm/vllm-ms4:latest --model mistralai/Leanstral-2603 --max-model-len 262144?

sudo docker run --gpus all  -it mistralllm/vllm-ms4:latest

(APIServer pid=1) INFO 04-03 15:12:04 [utils.py:297]
(APIServer pid=1) INFO 04-03 15:12:04 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=1) INFO 04-03 15:12:04 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.0.0
(APIServer pid=1) INFO 04-03 15:12:04 [utils.py:297]   █▄█▀ █     █     █     █  model   Qwen/Qwen3-0.6B
(APIServer pid=1) INFO 04-03 15:12:04 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 04-03 15:12:04 [utils.py:297]
(APIServer pid=1) INFO 04-03 15:12:04 [utils.py:233] non-default args: {}
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 726/726 [00:00<00:00, 5.70MB/s]
(APIServer pid=1) Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(APIServer pid=1) [2026-04-03 15:12:05] WARNING _http.py:916: Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
(APIServer pid=1) INFO 04-03 15:12:11 [model.py:533] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=1) INFO 04-03 15:12:11 [model.py:1582] Using max model len 40960
(APIServer pid=1) INFO 04-03 15:12:11 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=1) INFO 04-03 15:12:11 [vllm.py:754] Asynchronous scheduling is enabled.
tokenizer_config.json: 9.73kB [00:00, 40.3MB/s]
vocab.json: 2.78MB [00:00, 146MB/s]
merges.txt: 1.67MB [00:00, 161MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:00<00:00, 18.6MB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 239/239 [00:00<00:00, 1.99MB/s]
(EngineCore pid=438) INFO 04-03 15:12:20 [core.py:103] Initializing a V1 LLM engine (v0.0.0) with config: model='Qwen/Qwen3-0.6B', speculative_config=None, tokenizer='Qwen/Qwen3-0.6B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_paralle

Fixing the run command as (on 8xH100 node):

docker run -it --gpus all mistralllm/vllm-ms4:latest --model mistralai/Leanstral-2603 \
  --max-model-len 200000 \
  --tensor-parallel-size 4 \
  --attention-backend FLASH_ATTN_MLA \
  --tool-call-parser mistral \
  --enable-auto-tool-choice \
  --reasoning-parser mistral

gives the following bad-looking message from vllm:

(APIServer pid=1) INFO 04-03 15:33:35 [model.py:533] Resolved architecture: PixtralForConditionalGeneration
(APIServer pid=1) ERROR 04-03 15:33:35 [repo_utils.py:47] Error retrieving safetensors: 'mistralai/Leanstral-2603' is not a safetensors repo. Couldn't find 'model.safetensors.index.json' or 'model.safetensors' files., retrying 1 of 2
(APIServer pid=1) ERROR 04-03 15:33:38 [repo_utils.py:45] Error retrieving safetensors: 'mistralai/Leanstral-2603' is not a safetensors repo. Couldn't find 'model.safetensors.index.json' or 'model.safetensors' files.

But then it still proceeds

The repo indeed does not have model.safetensors.index.json (but it does have consolidated.safetensors.index.json - should it be renamed in the repo to model.safetensors.index.json?)

I had to add --config-format mistral --load-format mistral to the docker run command, but then it complains with:

[2026-04-03 16:11:53] WARNING _http.py:916: Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
params.json: 1.51kB [00:00, 7.08MB/s]
Unrecognized keys in `rope_parameters` for 'rope_type'='yarn': {'apply_yarn_scaling'}
`rope_parameters`'s factor field must be a float >= 1, got 128
`rope_parameters`'s beta_fast field must be a float, got 32
`rope_parameters`'s beta_slow field must be a float, got 1
Unrecognized keys in `rope_parameters` for 'rope_type'='yarn': {'apply_yarn_scaling'}
`rope_parameters`'s factor field must be a float >= 1, got 128
`rope_parameters`'s beta_fast field must be a float, got 32
`rope_parameters`'s beta_slow field must be a float, got 1

which also is quite suspicious

To make use of all 8 GPUs, should I run two copies of the docker (each using 4 GPUs)? Or do you also provide any sort of proxy, allowing to multiplex requests to several vllm copies running inside a single docker?

vadimkantorov changed discussion title from The docker run mistralllm/vllm-ms4:latest runs Qwen model serving for some reason to Recommended docker run commands runs Qwen serving + after fixes fails with "not a valid safetensors repo"

I'm also a bit worried for Unrecognized keys in rope_parameters for 'rope_type'='yarn': {'apply_yarn_scaling'}

apply_yarn_scaling is actually not found in params.json, then why would loader complain about it?

Probably related: https://discuss.huggingface.co/t/hf-transformers-config-issue-for-mistral-large-3-models/171369/4

Maybe some sort of yarn/transformers version mismatch in Docker after key remapping?

Mistral AI_ org

hey sorry for the late response. So basically there is nothing to worry about that which sounds counterintuitive. These errors are due to the fact that our config is mapped to Transformers internally of vLLM which sometimes don't have the right type from Transformers perspective. The errors you're seeing are not detrimental to the model usage as the model behaves properly, they're more like a warning than an error. In more recent releases than the docker has they are mostly gone except for this one:

Unrecognized keys in rope_parameters for 'rope_type'='yarn': {'apply_yarn_scaling'}

but it will be gone after a PR is merged.

Mistral AI_ org

oh the last one was actually merged an hour ago thanks to Harry from 🤗
https://github.com/vllm-project/vllm/commit/edcc37a8cee26813fe868b9fc267c3cba5818ff7

@juliendenize Could the container be also updated to the most recent versions?

HF Transformers is also fixing autocasting rope ints to floats :https://github.com/huggingface/transformers/pull/45289

Looking forward to having the Mistral-4/Leanstral fixes in HF/vllm released versions

Mistral AI_ org

we're actually not looking to update the container but fixing everything upstream. The last PR to ensure guidance is in will be submitted very shortly and merged soon !

@juliendenize The vllm PRs are merged. Are there any known vllm / transformers mainline commit hashes that should be fine? Thanks!

Mistral AI_ org

@vadimkantorov updated the model card, thanks for hanging in there :)

juliendenize changed discussion status to closed

Sign up or log in to comment