This model is extremely good

#4
by JohnMolotov - opened

I don't know if this is the right place to put this, but I'd just like to give my own feedback on the model based on my local testing. Benchmarks can be overfitted, private and subjective testing cannot, though of course is less precise.

All models were run at 4-bit quantization. I made some modifications to yxing-bj/vllm to get it running, as well as to fix bugs effecting the Turing architecture; the most notable changes being the use of fp16 instead of bf16, and Triton attention instead of FlashAttention-2, as it was required for my hardware. The variant I tested was plt_num_loops=2. All other models were run with stock ollama (Q4_K_M).

I tested models in two major size categories, ~8b and ~30b, with one 80b model. To be specific: gemma4:e4b, Qwen2.5-Coder:7B, granite-code:8b, ministral-3:8b, Yi-Coder:9b, qwen3.5:9b, devstral-small-2:24b, gemma4:26b, glm-4.7-flash:30b, qwen3-coder:30b, nemotron-3-nano:30b, qwen3.6:35b, Qwen3-Coder-Next:80b. Models were tasked with solving 10 problems from my codebases (covering bugfixing, refactoring, writing greenfield projects/algorithms, re-writing code in other languages, writing tests, and planning) across 3 languages. Models were first tested one-shot, then a subset were tested in aider, and a smaller subset in opencode. The results were first checked against automated tests and then pairwise blind ranked by myself.

Results:
One-shot, LoopCoder-V2 was ahead of every ~8b model (I'd initially ranked ministral above it on apparent code quality, but ministral's code often didn't actually compile), but beneath every large model except for glm-4.7 and nemotron. This is very good given some of these are leading models or significantly larger than it, however not yet something to write home about. The agentic performance however, is truly mindblowing. In opencode it was only beaten by Qwen3-Coder-Next:80b, while in aider it beat every other model (using either harness). I am truly staggered it beat a 80b model in a blind ranking, even if it's somewhat subjective.

However I will note some specific areas it was notably lacking. The biggest being that without an agentic harness and test cases it completely failed at bugfixing and performed poorly at refactoring, though both were good when agentic. Other than that its weakest areas were planning (which is somewhat inherently one-shot) and test generation which it wasn't bad at but wasn't notably good either. I also ran separate tests on just codebase exploration, and found it was middle of the pack for the ~8b models there. Its strength seems to be precisely in agentic code generation, it isn't a general model, and as long as it's approached with that in mind it's absolutely fantastic.

Sign up or log in to comment