How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ReallyFloppyPenguin/OpenCodeReasoning-Nemotron-14B-GGUF:
# Run inference directly in the terminal:
llama-cli -hf ReallyFloppyPenguin/OpenCodeReasoning-Nemotron-14B-GGUF:
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf ReallyFloppyPenguin/OpenCodeReasoning-Nemotron-14B-GGUF:
# Run inference directly in the terminal:
llama-cli -hf ReallyFloppyPenguin/OpenCodeReasoning-Nemotron-14B-GGUF:
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf ReallyFloppyPenguin/OpenCodeReasoning-Nemotron-14B-GGUF:
# Run inference directly in the terminal:
./llama-cli -hf ReallyFloppyPenguin/OpenCodeReasoning-Nemotron-14B-GGUF:
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf ReallyFloppyPenguin/OpenCodeReasoning-Nemotron-14B-GGUF:
# Run inference directly in the terminal:
./build/bin/llama-cli -hf ReallyFloppyPenguin/OpenCodeReasoning-Nemotron-14B-GGUF:
Use Docker
docker model run hf.co/ReallyFloppyPenguin/OpenCodeReasoning-Nemotron-14B-GGUF:
Quick Links

nvidia/OpenCodeReasoning-Nemotron-14B - GGUF

This repository contains GGUF quantizations of nvidia/OpenCodeReasoning-Nemotron-14B.

About GGUF

GGUF is a quantization method that allows you to run large language models on consumer hardware by reducing the precision of the model weights.

Files

Filename Quant type File Size Description
model-f16.gguf f16 Large Original precision
model-q4_0.gguf Q4_0 Small 4-bit quantization
model-q4_1.gguf Q4_1 Small 4-bit quantization (higher quality)
model-q5_0.gguf Q5_0 Medium 5-bit quantization
model-q5_1.gguf Q5_1 Medium 5-bit quantization (higher quality)
model-q8_0.gguf Q8_0 Large 8-bit quantization

Usage

You can use these models with llama.cpp or any other GGUF-compatible inference engine.

llama.cpp

./llama-cli -m model-q4_0.gguf -p "Your prompt here"

Python (using llama-cpp-python)

from llama_cpp import Llama

llm = Llama(model_path="model-q4_0.gguf")
output = llm("Your prompt here", max_tokens=512)
print(output['choices'][0]['text'])

Original Model

This is a quantized version of nvidia/OpenCodeReasoning-Nemotron-14B. Please refer to the original model card for more information about the model's capabilities, training data, and usage guidelines.

Conversion Details

  • Converted using llama.cpp
  • Original model downloaded from Hugging Face
  • Multiple quantization levels provided for different use cases

License

This model inherits the license from the original model. Please check the original model's license for usage terms.

Downloads last month
54
GGUF
Model size
15B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ReallyFloppyPenguin/OpenCodeReasoning-Nemotron-14B-GGUF

Base model

Qwen/Qwen2.5-14B
Quantized
(7)
this model

Collection including ReallyFloppyPenguin/OpenCodeReasoning-Nemotron-14B-GGUF