nex-agi/Nex-N2-Pro · Create GGUFs if possible?

Jun 5

Hello,

Thank you guys for the work that you do! I was wondering if it would be possible to release various sized GGUF quants for people to run under llamacpp, as it would be a great way to test these models?

Thank you.

Jun 5

I have tried to use a quant made myself for llamacpp, but I had recieved this error in the beginning of model loading, would you guys know of a solution?

llama_model_load: error loading model: missing tensor 'blk.60.attn_norm.weight'

while converting, I did notice that the layers went from 0-59, but llamacpp is oddly expecting an extra layer 60

hendrik289

Jun 5

I had the same issue with the nex-agi/Nex-N2-mini and vibecoded it. So I can't tell you what my agent exactly did but you can try it too to get a working gguf.

Jun 5

I had the same issue with the nex-agi/Nex-N2-mini and vibecoded it. So I can't tell you what my agent exactly did but you can try it too to get a working gguf.

did you have to regenerate the GGUF or just make a patch in the llamacpp project to get your current gguf to work? If its the patch, do you think you can upload your version of llamacpp onto github for me to try as well?

Thank you!

Jun 7

•

edited Jun 7

config.json advertises mtp_num_hidden_layers: 1, but this uploaded model does not ship the corresponding MTP (Multi-Token Prediction) tensors. Try to call convert_hf_to_gguf.py with the --no-mtp parameter

Arkovski

Jun 7

Guys, we need IQ3_XXS and IQ4_XS <3

Jun 7

•

edited Jun 7

There are more issues: The chat_template.jinja does not define the tags in the form expected by llama.cpp. This file needs to be patched.

Diff:
103c103
< {{- '<|im_start|>' + message.role + '\n' + content }}
---
> {{- '<|im_start|>' + message.role + '\n<think></think>' + content }}
150c150
< {{- '<think>\n\n</think>\n\n' }}
---
> {{- '<think></think>' }}
152c152
< {{- '<think>' }}
---
> {{- '<think>\n' }}

Jun 7

ok, progress update, I have done what J8son91 mentioned about using --no-mtp to get rid of the extra layer error and it worked, the model loaded. I have also made the diff changes to the chat template suggested by the same user as well. However, when speaking to the model, gibberish is all that comes out. I am not sure what is missing.

hendrik289

Jun 7

Try this chat template https://pastebin.com/ay9hkyNc that worked for me with the Mini version

Jun 8

•

edited Jun 8

ok, progress update, I have done what J8son91 mentioned about using --no-mtp to get rid of the extra layer error and it worked, the model loaded. I have also made the diff changes to the chat template suggested by the same user as well. However, when speaking to the model, gibberish is all that comes out. I am not sure what is missing.

Convert to GGUF:

python3 convert_hf_to_gguf.py /storage/models/nex-agi/Nex-N2-Pro/ \
    --outtype f16 \
    --outfile /storage/models/nex-agi/Nex-N2-Pro_f16.gguf \
    --no-mtp

I have only tried the Q8_0 quantization because quality is more important than speed for me (batch processing):

./llama-quantize /storage/models/nex-agi/Nex-N2-Pro_f16.gguf /storage/models/nex-agi/Nex-N2-Pro_Q8_0.gguf Q8_0

Server:

./llama-server \
     --model /storage/models/nex-agi/Nex-N2-Pro_Q8_0.gguf \
     --port 39507 \
     --host 127.0.0.1 \
     --no-webui \
     --offline \
     -c 262144 \
     -ctk q8_0 -ctv q8_0 \
     --reasoning on \
     --reasoning-format deepseek \
     --chat-template-file /storage/models/nex-agi/Nex-N2-Pro/chat_template.llamacpp-thinking.jinja \
     --parallel 1 \
     --threads 64 \
     --threads-batch 64 \
     --log-verbosity 4 \
     --flash-attn on \
     -ub 4096 -b 4096 \
     --no-mmap \
     -ngl 6

Github copilot chat:
~/.config/Code/User/chatLanguageModels.json

[
    {
        "name": "Custom Endpoint",
        "vendor": "customendpoint",
        "apiType": "chat-completions",
        "models": [
            {
                "id": "llama-server",
                "name": "llama-server",
                "url": "http://localhost:39507/v1",
                "toolCalling": true,
                "vision": true,
                "maxInputTokens": 128000,
                "maxOutputTokens": 16000
            }
        ]
    }
]

Jun 8

Here is the gguf version: https://hugston.com/models/hugston-nex-agi-nex-n2-proq4-k-m

Try it and let me know. if is any good i will upload it in HF.

Jun 8

Here is the gguf version: https://hugston.com/models/hugston-nex-agi-nex-n2-proq4-k-m

Try it and let me know. if is any good i will upload it in HF.

Is it possible for you to make a 2-bit quant? With this quant, on my system, I can barely fit any usable context, but within 2-bit range, I can fit full context to use in agents. That's what I used for the original Qwen 397B as well.

Thank you for your help by the way!

Jun 8

•

edited Jun 8

Is it possible for you to make a 2-bit quant?

The 2bits quant would need an imatrix to be done, which degrades the model quality (according to our research). So we skip that. As soon as time allows can try to do a q3xxs or similar that do not need an Imatrix. would be great to have a bit of feedback on the current one first.

Jun 8

I have found another bug and these bugs need to be fixed before we can create useful quantizations of this model.

In chat_template.jinja, every message ends with <|im_end|> and tokenizer_config.json declares:

"eos_token": "<|im_end|>"

Now according to tokenizer.json, <|im_end|> = 248046 but when you look into config.json you find

"eos_token_id": 248044,

which is a contradiction (BUG!). This means even if the model signals that it is done talking (with <|im_end|> token) the inference server will ignore this continue to compute more tokens creating weird nonsense.

gopi87

Jun 9

Here is the gguf version: https://hugston.com/models/hugston-nex-agi-nex-n2-proq4-k-m

Try it and let me know. if is any good i will upload it in HF.

download really sucks there hope you will upload it in hf

Jun 9

•

edited Jun 9

@gopi87

download really sucks there hope you will upload it in hf

Thank you for the feedback. Can you be more specific what issues did you encounter is it low speed or you got any broken link or timeouts, limits?
It is supposed to be over 100mb/s standing by the other user reports, so in 25 mins should be able to download all the 225 gb file.
OFC we can try to upload it (today ) here but as it is in one file only we get timeouts and cutoff many times, so we thought users can test it first and if is good enough it will be uploaded in HF.

Please let us know what the real issue was and thanks again for your feedback.

gopi87

Jun 9

@gopi87

download really sucks there hope you will upload it in hf

Thank you for the feedback. Can you be more specific what issues did you encounter is it low speed or you got any broken link or timeouts, limits?
It is supposed to be over 100mb/s standing by the other user reports, so in 25 mins should be able to download all the 225 gb file.
OFC we can try to upload it (today ) here but as it is in one file only we get timeouts and cutoff many times, so we thought users can test it first and if is good enough it will be uploaded in HF.

Please let us know what the real issue was and thanks again for your feedback.

thanks for that . the reason why hf is beter is that i can donwload it around 70mb/sec and resume function also available

Jun 9

It is very painful, and time demanding to upload large files in HF (like in every where else). As you see, ~3 hour later we are in ~35% with the hope that it will not break or timeout.
We will do everything to support Huggingface because it deserves it and is a great place of great value. Still we would like to skip this pain if possible., we simply do not have the time and resources to afford it.
So unless we learn some new solution, after the upload of this model we will be avoiding uploads of models over 50gb.

gopi87

Jun 9

•

edited Jun 9

It is very painful, and time demanding to upload large files in HF (like in every where else). As you see, ~3 hour later we are in ~35% with the hope that it will not break or timeout.
We will do everything to support Huggingface because it deserves it and is a great place of great value. Still we would like to skip this pain if possible., we simply do not have the time and resources to afford it.
So unless we learn some new solution, after the upload of this model we will be avoiding uploads of models over 50gb.

yep i think 50gb files will be perfect hope the long one dont get break it.

xldistance

Jun 9

@mradermacher Can you create a GGUF quantification for this model?

Jun 9

•

edited Jun 11

00index

Nex AGI org Jun 12

Hi everyone. Thanks for all the effort digging into GGUF support, and especially to @J8son91 and @InfernalDread for the detailed repros. A few clarifications so people don't go down the wrong path with template edits:

Please don't modify chat_template.jinja. The model was trained strictly on the current template, so editing the <think> tags (e.g. adding \n after <think>, or inserting <think></think>) deviates from the training-time format and can degrade output quality. The "thinking content blends into normal text" symptom that motivated these edits is actually a bug in llama.cpp's reasoning parser, not the template — we investigated this in PR #7 and confirmed the root cause there.

The correct fix is to use our patched llama.cpp, which works with the unmodified GGUF and unmodified template:

Binaries: https://github.com/nex-agi/llama.cpp/releases/tag/nex-b9596-fix-b9599-9cd1771
Docker: docker pull ghcr.io/nex-agi/llama.cpp:server-cuda-nex-b9596-fix-b9598-8c0d5c9 (more variants under https://github.com/orgs/nex-agi/packages)
We're submitting the patch upstream to llama.cpp shortly; once merged, stock builds will work out of the box and we'll update this thread with the PR link.