Instructions to use avar6/Nemotron3.3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use avar6/Nemotron3.3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="avar6/Nemotron3.3")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("avar6/Nemotron3.3") model = AutoModelForCausalLM.from_pretrained("avar6/Nemotron3.3") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use avar6/Nemotron3.3 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "avar6/Nemotron3.3" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "avar6/Nemotron3.3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/avar6/Nemotron3.3
- SGLang
How to use avar6/Nemotron3.3 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "avar6/Nemotron3.3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "avar6/Nemotron3.3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "avar6/Nemotron3.3" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "avar6/Nemotron3.3", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use avar6/Nemotron3.3 with Docker Model Runner:
docker model run hf.co/avar6/Nemotron3.3
Does this actually work?
So Nemotron was supposed to give "more helpful" responses compared to Llama 3.1, and Llama 3.3 is supposed to be smarter than Llama 3.1.
I tried a 2-bit quantization of Nemotron (3.1) and was quite impressed.
Having tried it, does your Nemotron 3.3 actually seem to be smarter than Nemotron 3.1 and more helpful than Llama 3.3?
Do you think nVidia will release their own "Nemotron 3.3", applying their original technique to Llama 3.3?
Have you tried using the non-Instruct version of Llama 3.3 instead of the Instruct version, as one of the inputs? It's just that from what I can tell, it seems you're effectively applying the "Instruct" vector twice, once as part of the Nemotron input, and once as part of the Llama 3.3 input, if that makes sense.
Oh i wasnt expecting anyone to pay attention to this. I've never done this before I was just experimenting based on guidence from the drummer's discord. I wasnt able to quant it into a gguf cause it said I was missing a file that i could not find and hugginface kept erroring out so I gave up.
The drummer's new versions of Nautilus on the beaverAI page are made with llama3.3. I'd suggest trying those instead
Oh, you didn't manage to quant it?
Then are you aware of these quants?
https://huggingface.co/mradermacher/Nemotron3.3-GGUF
https://huggingface.co/mradermacher/Nemotron3.3-i1-GGUF
That's your merge. mradermacher quantized it.
Oh, no I wasn't aware. I didn't think to ask anyone to quant it. But now that it is I will try it π
Though for the record, people were arguing about whether this merge method would work. I was just copying someone's instructions. It was supposed to "subtract" llama3.1 and add 3.3. It appears per the card, I merged 3.1, 3.3, and nemotron