how to use vllm to serve google/translategemma-4b-it & how to use it?
I tried to deploy google/translategemma-4b-it using vLLM, but encountered the following error:
(APIServer pid=311829) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=311829) Value error, rope_parameters should have a 'rope_type' key [type=value_error, input_value=ArgsKwargs((), {'model': ...rocessor_plugin': None}), input_type=ArgsKwargs]
(APIServer pid=311829) For further information visit https://errors.pydantic.dev/2.12/v/value_error
After checking GitHub, I found other users reported similar issues. Later, I successfully deployed the model using sglang, but the inference format of translategemma-4b-it is incompatible with the OpenAI API format. How has the community resolved such compatibility issues?
OpenAI API format:
messages = [{"role": "user", "content": f"Translate from English to Chinese:{content}"}]
translategemma-4b-it format:
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"source_lang_code": "cs",
"target_lang_code": "de-DE",
"text": "V nejhorším případě i k prasknutí čočky.",
}
],
}
]
Thank you! It seems the issue with TranslateGemma in vLLM is still being worked on. In the meantime, I found another version on Hugging Face that works well with vLLM + OpenAI API:
https://huggingface.co/Infomaniak-AI/vllm-translategemma-27b-it
啟動的時候不要直接執行vLLM serve google/translategemma-4b-it,把它的參數加上,我的是加了一些參數就可以運行了
vllm serve google/translategemma-4b-it --dtype bfloat16 --max-model-len 512 --gpu-memory-utilization 0.8 --optimization-level 0
If you want the 4b version we upload the vllm compatible: https://huggingface.co/Infomaniak-AI/vllm-translategemma-4b-it
Hi @ggmarks ,
translategemma uses gemma-3 style configuration and it's format is different as you have already pointed out.
You need to wait for specific release from vLLM in which they add the support for this.
For now, you can download the model locally and try to flatten the 'rope_parameter' as it is currently nested.
Or you can use the community-tuned model you mentioned already, in which they are doing the same thing along with handling chat template.
Thank you all for your valuable suggestions.
what are the vllm ,flash-attn and torch versions.
can anyone share working stack
I tested on vllm v0.14.0 and CUDA 13.0. It works