fp8 quant for qwen3-vl-embedding models, nearly half memory decrease, speedup 30%, vllm serve can run