Text Generation
Transformers
Safetensors
PyTorch
nemotron_h
nvidia
conversational
custom_code

Model response is not generating opening <think> tag

#3
by bmiles - opened

Hi I'm having an issue where the model response is not generating the tag. I am running the model with vLLM and docker see below.

docker run --runtime nvidia --gpus all \
           -v ~/.cache/huggingface:/root/.cache/huggingface \
           --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
           -p 8000:8000 \
           --ipc=host \
           vllm/vllm-openai:latest \
           --model nvidia/NVIDIA-Nemotron-Nano-9B-v2-NVFP4 \
           --tensor-parallel-size 1 \
           --max-num-seqs 64 \
           --max-model-len 131072 \
           --trust-remote-code \
           --mamba_ssm_cache_dtype float32

Example response:

curl -s http://xxx:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/NVIDIA-Nemotron-Nano-9B-v2-NVFP4",
    "messages": [
      {"role": "system", "content": "/think"},
      {"role": "user", "content": "Hello"}
    ],
    "add_generation_prompt": true
  }' | jq -r '.choices[0].message.content'
Okay, the user just said "Hello". I should respond politely. Let me greet them back and ask how I can assist. Keep it friendly and open-ended.
</think>

Hello! How can I assist you today? 😊

I also see the following in the server logs:

(APIServer pid=1) INFO 11-23 09:10:02 [chat_utils.py:560] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.`

Sign up or log in to comment