Eval requests

#512
by zelk12 - opened

jsuheb/HeresyAgent-8B
yasserrmd/Neuro-Orchestrator-8B

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16

nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base
nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-BF16 - I can't quantize via gguf-my-repo, it gives an error. | квантовать через gguf-my-repo не получаеться, выдаёт ошибку.

nvidia/NVIDIA-Nemotron-Nano-9B-v2
nvidia/NVIDIA-Nemotron-Nano-9B-v2-Base

nvidia/Llama-3.1-Nemotron-Nano-VL-8B-V1 - I can't quantize via gguf-my-repo, it gives an error. | квантовать через gguf-my-repo не получаеться, выдаёт ошибку.

nvidia/Nemotron-Cascade-14B-Thinking

nvidia/Nemotron-Cascade-8B-Thinking
nvidia/Nemotron-Cascade-8B

nvidia/Nemotron-Elastic-12B

nvidia/Cosmos-Reason2-8B

nvidia/Cosmos-Reason2-2B

nvidia/Cosmos-Reason1-7B

nvidia/Qwen2.5-VL-7B-Surg-CholecT50

nvidia/Nemotron-Research-Reasoning-Qwen-1.5B

nvidia/Nemotron-Flash-3B-Instruct - I can't quantize via gguf-my-repo, it gives an error. | квантовать через gguf-my-repo не получаеться, выдаёт ошибку.
nvidia/Nemotron-Flash-3B - I can't quantize via gguf-my-repo, it gives an error. | квантовать через gguf-my-repo не получаеться, выдаёт ошибку.

nvidia/Nemotron-Flash-1B - I can't quantize via gguf-my-repo, it gives an error. | квантовать через gguf-my-repo не получаеться, выдаёт ошибку.



Some models may not be suitable for testing.
Some models do not have quanta yet.
Some parts may not be clear how to launch.

Also, while I was looking through the models, I came across:
nvidia/Llama-3.3-Nemotron-70B-Reward-Principle

In theory, it could be tried to be adapted for testing.


Часть моделей может быть не подходящими для теста.
У части моделей ещё нет квантов.
Часть может быть не понятно как запустить.

Также пока просматривал модели наткнулся на:
nvidia/Llama-3.3-Nemotron-70B-Reward-Principle

Её по идее можно попробовать приспособить, для тестирования.

@DontPlanToEnd
The question is, do you test models in a fully deterministic mode or not?
And what do you think about such a check?

In theory, this would ensure repeatability of test results and also demonstrate the capabilities of the model itself, independent of samplers.


Такой вопрос, вы проверяете модели в полностью детерминированном режиме или нет?
И что думаете о такой проверке?

По идее это дало бы повторяемость результатов проверок. А также показало бы возможности самой модели как таковой в отрыве от сэмплеров.

The question is, do you test models in a fully deterministic mode or not?

I try to test models with mostly deterministic settings, but can't do fully deterministic since that can cause repetition issues. Thinking models kind of inherently need randomness in order to come up with multiple solutions and think through which is the best. And the creative writing section also requires randomness.

Also, I'm currently using vllm batching which is pretty hard if not impossible to make fully deterministic.

And here's the question: what samplers and parameters are there for vllm batch processing?

And also by deterministic generation.
In theory, this could also be used as a separate test, or not?
Just for example, what I often observe is that some models make mistakes in words even with deterministic generation, and without it, even more so.

Regarding thought patterns, here's a question: we can clearly determine what the pattern is doing—generating the main message or thinking—so why hasn't anyone created two separate samplers for this? Otherwise, it's just that in thought, the pattern might have written one thing, but during generation, due to random sampling, it might have written something completely different. Or is this too rare?


А такой вопрос, а какие при пакетной обработке vllm есть семплеры и параметры?

А и по детерминированной генерации.
Можно это ведь по идее, тоже использовать как отдельный тест, или нет?
Просто к примеру, что я часто наблюдаю, некоторые модели в словах допускают ошибки даже при детерминированной генерации, а без неё, тем более.

А по поводу моделей мышления такой вопрос, мы же, можем точно определить, что сейчас делает модель, генерирует основное сообщение или мыслит, почему никто не сделал для этого 2 отдельных семплера? Просто иначе получается что в мышлении модель написала одно, но в процессе генерации, из-за случайного семплирования пошла писать совершенно другое. Или такое слишком редко встречается?

Yeah a model doing something like having high temperature during thinking then low temperature when deciding on its final answer would be a good idea. I guess it would need to detect when the tag is outputted then alter the sampler settings. I think there are some models that dynamically adjust temperature, I think glm and o1/o3.

vllm has a bunch of standard parameters like temperature, top_p, top_k, min_p, and seed. Since batching generates responses to many prompts at once, it doesn't have a deterministic focus on one output, and kinda throws all the prompts into a pot, which causes it to calculate floating-point operations in different orders, introducing microscopic rounding errors that prevent perfect reproducibility. There are settings you can change to minimize this, but as long as your using batching to massively speed up processing time I don't think you can fully get rid of the slight nondeterminism.

And using min_p=1, top_k=1 and constant seed is not effective, I understand?


А использование min_p=1, top_k=1 и постоянный seed я так понимаю не эффективны?

Sign up or log in to comment