Instructions to use AmpereComputing/granite-3.3-2b-instruct-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use AmpereComputing/granite-3.3-2b-instruct-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="AmpereComputing/granite-3.3-2b-instruct-gguf", filename="granite-3.3-2b-instruct-Q8R16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use AmpereComputing/granite-3.3-2b-instruct-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AmpereComputing/granite-3.3-2b-instruct-gguf # Run inference directly in the terminal: llama-cli -hf AmpereComputing/granite-3.3-2b-instruct-gguf
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AmpereComputing/granite-3.3-2b-instruct-gguf # Run inference directly in the terminal: llama-cli -hf AmpereComputing/granite-3.3-2b-instruct-gguf
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf AmpereComputing/granite-3.3-2b-instruct-gguf # Run inference directly in the terminal: ./llama-cli -hf AmpereComputing/granite-3.3-2b-instruct-gguf
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf AmpereComputing/granite-3.3-2b-instruct-gguf # Run inference directly in the terminal: ./build/bin/llama-cli -hf AmpereComputing/granite-3.3-2b-instruct-gguf
Use Docker
docker model run hf.co/AmpereComputing/granite-3.3-2b-instruct-gguf
- LM Studio
- Jan
- Ollama
How to use AmpereComputing/granite-3.3-2b-instruct-gguf with Ollama:
ollama run hf.co/AmpereComputing/granite-3.3-2b-instruct-gguf
- Unsloth Studio new
How to use AmpereComputing/granite-3.3-2b-instruct-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AmpereComputing/granite-3.3-2b-instruct-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AmpereComputing/granite-3.3-2b-instruct-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for AmpereComputing/granite-3.3-2b-instruct-gguf to start chatting
- Pi new
How to use AmpereComputing/granite-3.3-2b-instruct-gguf with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AmpereComputing/granite-3.3-2b-instruct-gguf
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "AmpereComputing/granite-3.3-2b-instruct-gguf" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use AmpereComputing/granite-3.3-2b-instruct-gguf with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AmpereComputing/granite-3.3-2b-instruct-gguf
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default AmpereComputing/granite-3.3-2b-instruct-gguf
Run Hermes
hermes
- Docker Model Runner
How to use AmpereComputing/granite-3.3-2b-instruct-gguf with Docker Model Runner:
docker model run hf.co/AmpereComputing/granite-3.3-2b-instruct-gguf
- Lemonade
How to use AmpereComputing/granite-3.3-2b-instruct-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull AmpereComputing/granite-3.3-2b-instruct-gguf
Run and chat with the model
lemonade run user.granite-3.3-2b-instruct-gguf-{{QUANT_TAG}}List all available models
lemonade list
llm.create_chat_completion(
messages = "No input example has been defined for this model task."
)Ampere® optimized llama.cpp
Ampere® optimized build of llama.cpp with full support for rich collection of GGUF models available at HuggingFace: GGUF models
For best results we recommend using models in our custom quantization formats available here: AmpereComputing HF
This Docker image can be run on bare metal Ampere® CPUs and Ampere® based VMs available in the cloud.
Release notes and binary executables are available on our GitHub
Starting container
Default entrypoint runs the server binary of llama.cpp, mimicking behavior of original llama.cpp server image: docker image
To launch shell instead, do this:
sudo docker run --privileged=true --name llama --entrypoint /bin/bash -it amperecomputingai/llama.cpp:latest
Quick start example will be presented at docker container launch:
Make sure to visit us at Ampere Solutions Portal!
Quantization
Ampere® optimized build of llama.cpp provides support for two new quantization methods, Q4_K_4 and Q8R16, offering model size and perplexity similar to Q4_K and Q8_0, respectively, but performing up to 1.5-2x faster on inference.
First, you'll need to convert the model to the GGUF format using this script:
python3 convert-hf-to-gguf.py [path to the original model] --outtype [f32, f16, bf16 or q8_0] --outfile [output path]
For example:
python3 convert-hf-to-gguf.py path/to/llama2 --outtype f16 --outfile llama-2-7b-f16.gguf
Next, you can quantize the model using the following command:
./llama-quantize [input file] [output file] [quantization method]
For example:
./llama-quantize llama-2-7b-f16.gguf llama-2-7b-Q8R16.gguf Q8R16
Support
Please contact us at ai-support@amperecomputing.com
LEGAL NOTICE
By accessing, downloading or using this software and any required dependent software (the “Ampere AI Software”), you agree to the terms and conditions of the software license agreements for the Ampere AI Software, which may also include notices, disclaimers, or license terms for third party software included with the Ampere AI Software. Please refer to the Ampere AI Software EULA v1.6 or other similarly-named text file for additional details.
- Downloads last month
- 1
We're not able to determine the quantization variants.
Model tree for AmpereComputing/granite-3.3-2b-instruct-gguf
Base model
ibm-granite/granite-3.3-2b-base

# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="AmpereComputing/granite-3.3-2b-instruct-gguf", filename="granite-3.3-2b-instruct-Q8R16.gguf", )