Fantastic release!
I want to thank Google for the release of these magnificient models. I specifically like the MoE because it runs fast on my laptop.
Of course everyone is comparing these to Qwen 3.5, which is the latest release by Alibaba. Those are also great models, but Gemma has a lot of strengths. I've found coding and agentic tasks to be roughly on par (sometimes a bit better, sometimes a bit worse) which is no easy feat considering Qwen updates their models twice a week it feels like. However, multilingual performance, creative writing, as well as world knowledge, is noticeably better with Gemma. The 31B and this MoE models are certainly the new darlings in roleplay and creative writing focused communities! So good job on that.
A few points that can be improved:
Native image support is also nice, although we already had that with Gemma 3. I wonder why audio is not natively supported and why audio support is limited to the E2B&E4B models. For voice assistants it would be great to have native multimodal models that also accept audio as input and even generate audio (not just speech) a well as native video input.
I have also noticed the architecture needs quite a bit more memory than Qwen 3.5. Considering the latter is an RNN hybrid, that makes sense. But it is still efficienct, under the condition you are using SWA which also disables context shifting. Unlike with Qwen 3.5 you have the option between context shifting but compromising on context length or no ctx shift and be able to stuff a lot more context in memory, which is great. In the future it would be great to have a brand new architecture that is both memory efficient and enables stuff like ctx shifting at the same time, though. Ideally of course we would strive towards enabling real time learning and much better memory than current day context, but I feel like that is still quite a far way off (e.g. titans architecture).
I would also like to see QAT models again, as in Gemma 3 times it significantly improved text performance for local inference, so I wonder why Gemma 4 has not been trained in a way that is quantization aware from the start. Or perhaps it has and you just didn't made a fuss out of it. After llama.cpp has implemented some fixes, it's pretty good now.
Thank you again for the models. I will continue to test them!
Hi @Dampfinchen ,
Thank you so much for your detailed feedback! We really appreciate you sharing your experience with Gemma 4, especially your insights on MoE performance, multilingual and creative tasks, and comparisons with Qwen 3.5. We’re glad to hear you’re enjoying the models and look forward to seeing your continued testing and exploration!
Hi @Dampfinchen ,
Thank you so much for your detailed feedback! We really appreciate you sharing your experience with Gemma 4, especially your insights on MoE performance, multilingual and creative tasks, and comparisons with Qwen 3.5. We’re glad to hear you’re enjoying the models and look forward to seeing your continued testing and exploration!
Thank you! It's nice to see you interacting with the community. You knocked it out of the park with this release and the high download counts on both your HF page and others like Bartowski and Unsloth speak volumes.
As promised, I have spend a lot of this time with the model and I'm loving it. Of course nothing is perfect so I have noticed two quirks with the model after long term testing:
For creative writing purposes: Creative writing is really really good with this model. However I noticed generations are too similar to each other, at the default sampler settings Google recommends, but also using my own settings which I have to set very high to get only slightly variation between generations. I assume this is a side effect of instruct tuning and probably overfitting. Another problem are repetitive AI slop phrases. For example, it really likes to use the phrase "it is not X but Y, Not just X but Y" and stuff like that, plus generally it is too agreeable with the user. Gemini suffers from these issues as well.
For agentic tasks, it seems the model could use more initiative. Qwen 3.5 handles things on its own very nicely, but Gemma 4 you kinda have to babysit. The model has trouble following through (calling tools) sometimes after it has stated it plans to do so to the user.
Tool Call -> Tool call -> tool call -> Response to the user, that works without issues. However...
Tool Call -> Tool Call -> Tool Call -> Tool Call -> Response to the user -> Tool call, this is what the model has trouble with especially at higher context sizes.
It often tells the user it is going to do something now, but then stops the generation instead of calling a tool. I have noticed this using different frontends and backends and of course with the latest chat template.
I hope this information is useful to you.
Thanks for the feedback. Could you share full repro steps, along with any relevant code, prompts, tool schemas, sampler settings, chat templates, and environment details? That would make it much easier for us to investigate these issues.
Thanks for the feedback. Could you share full repro steps, along with any relevant code, prompts, tool schemas, sampler settings, chat templates, and environment details? That would make it much easier for us to investigate these issues.
For sure. My environment is Windows 11, 32 GB RAM, RTX 2060.
I have made a reproduceable test environment for you. First, please download https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/blob/main/google_gemma-4-26B-A4B-it-Q5_K_S.gguf . It has the latest fixes.
I have made a test package for you. It includes the latest llama.cpp pre-built with CUDA support, including a CLI prompt for settings as well as an MCP server for web search and a conversation.json you need to import to llama.cpp's webUI. Please download it here: https://www.file-upload.net/en/download-15593431/Reproduce-Gemma4Issue.zip.html
Step 1. Let's set up the MCP server for the websearch first. In the MCP folder, run this command: "uvx mcp-proxy --named-server-config config.json --allow-origin "*" --port 8001 --stateless"
Step 2: Start llama.cpp using the command in the configuration text file (and give it the path to the Gemma 4 model you have just downloaded!). I have used deterministic settings (temp 0) to make it easier to reproduce, this issue also happens with the recommended settings. Now, go to the webui in your browser on http://localhost:5001
Step 3: Now we will configure the MCP duck duck go websearch server. (The other MCPs included in the config are not relevant.) Click the gear in the top right corner, -> MCP-> Add Server -> Server URL: "http://127.0.0.1:8001/servers/ddg-search/mcp" -> Click Add and then save.
Step 4: Now import the conversation.json. Click on the gear again -> Import/Export-> Import Conversation -> Import Reproduce-Conversation.json -> Import "Hello Gemma", then save. On the left you should see the conversation after you click the little icon in the top left corner. Click it and, this step is important, make sure the DDG search server is enabled by clicking the + left to the input chat box, MCP-> enable it.
As you can see from the provided conversation, the first thinking -> websearch logic trace was successful. The task was to find a list of 5 coasters at a theme park. However, look at the last reasoning process and open it. At the end of the reasoning process, it will tell itself that it will do a search now to verify the results. "Let's do one more search to get a definitive list of 5." Now instead of calling the search tool again, it will just generate the response to the user. In other cases like with my Hermes Agent example earlier, it will just tell the user it has to do X now but ends the generation instead of calling a tool. Both issues are very similar. I was using the recommended sampler settings (temp 1, top_k 64, top_p 0.95, min_p 0, rep pen 1)
Click regenerate and you will hopefully see the model making the same mistake on your end. If not, hit regenerate until it does, as even though I used deterministic settings, the output is not 100% deterministic and sometimes skips the verification step thats needed to trigger the issue. For deeper analysis, you can run llama.cpp with the --verbose command to see the exact token output.
This issue is really strange, because a lot of the times it works as expected, but in other cases it just doesn't. Maybe it has something to do with the complexity of the tasks.
This is also not a lcpp limitation, as it works with other models reliably.
That's for the "wants to do X but doesn't" bug. Unfortunately I cannot reliably reproduce the looping issues, as that only has happened once when it was working on a big coding project with me. Then it was repeating the same lines of code blocks endlessly.