Why GLM3 is better than GLM4 on LVEval benchmark?

#48

by AnaRhisT - opened Jun 20, 2024

Jun 20, 2024

Hi,

I'm testing both chatglm3-6b (32K ctx length) and glm-4-9b-chat (128K ctx length) on LVEval (I use 32K ctx length on glm4 as well),
and the results of ChatGLM3 are much better than GLM4.

Any ideas why is it happening?

davidlvxin

Z.ai org Jun 20, 2024

We haven't tested this dataset, so could you please verify if GLM-4-9b-chat performs well for everyday long-text usage (e.g., document Q&A)? Here is a demo (https://github.com/THUDM/GLM-4/blob/main/composite_demo/README_en.md), you can try it. If incorrect usage has been ruled out as the cause, then look further down.

Unlike GLM3, GLM-4-9b-chat has been deeply optimized for user scenarios such as daily document Q&A. This might impact its performance on some benchmarks. However, we believe that optimizations closer to actual user scenarios should result in a better overall experience. We look forward to hearing if GLM-4-9b-chat enhances your experience.

THUDM-Space changed discussion status to closed Jan 2, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment