Why are all the benchmarks against last gen models?

by Bits360 - opened 7 days ago

Discussion

Bits360

7 days ago

Benchmarking against old models is lazy at best and trying to hide the model's capabilities at most likely.

Yan0096

7 days ago

•

edited 7 days ago

Hi @Bits360 ,

Thanks for the feedback. We would like to clarify that the primary goal of this work is not to claim a new SOTA (State-of-the-Art) result, but to explore a fundamental architectural question in native multimodality: how language-style discrete autoregressive modeling can naturally extend to vision and audio within a shared token interface.

That being said, we have evaluated LongCat-Next on several challenging and specialized benchmarks. For instance, our model shows very competitive performance on OmniDocBench and CharXivRQ. These tasks specifically focus on complex document understanding and reasoning—areas where unified multimodal modeling is particularly demanding.

We encourage you to test the model yourself using our demo site or the provided inference scripts to see its capabilities firsthand beyond the static benchmark numbers.

As mentioned in our conclusion, our main contribution lies in the methodology and the feasibility of this unified learning paradigm. We hope this work offers a different perspective on multimodal modeling and provides insights for the community toward building truly unified foundation models.

Bits360

6 days ago

I do understand that, The model seems impressive, one of the few open models for handling audio, one of the only sparse image models, but making a non-SOTA model means you should benchmark against budget, not irrelevant but often expensive models.
Yeah, you can make your benchmarks look better by benchmarking against older models, but people arent using bagel, gemini 2.5 flash image, or flux 1 dev, they are using z-image turbo, gemini 3 flash image, flux 2 klein 4/9b.
Same with the vision models, I understand not benchmarking against gpt5.4 mini because it practically just came out, but it doesnt take 4 months of prep time to make benchmarks, making gpt5.1-5.3 more appropriate. Or alternatively to benchmark against a closer model class AND to make the model look better, GPT5 mini, which never had a gpt5 mini 5.1-5.3. Its also weird not to include qwen3.5, which released a month ago. Etc.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment