Instructions to use QuantFactory/NexusRaven-V2-13B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use QuantFactory/NexusRaven-V2-13B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="QuantFactory/NexusRaven-V2-13B-GGUF", filename="NexusRaven-V2-13B.Q2_K.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use QuantFactory/NexusRaven-V2-13B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf QuantFactory/NexusRaven-V2-13B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf QuantFactory/NexusRaven-V2-13B-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf QuantFactory/NexusRaven-V2-13B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf QuantFactory/NexusRaven-V2-13B-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf QuantFactory/NexusRaven-V2-13B-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf QuantFactory/NexusRaven-V2-13B-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf QuantFactory/NexusRaven-V2-13B-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf QuantFactory/NexusRaven-V2-13B-GGUF:Q4_K_M
Use Docker
docker model run hf.co/QuantFactory/NexusRaven-V2-13B-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use QuantFactory/NexusRaven-V2-13B-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "QuantFactory/NexusRaven-V2-13B-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantFactory/NexusRaven-V2-13B-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/QuantFactory/NexusRaven-V2-13B-GGUF:Q4_K_M
- Ollama
How to use QuantFactory/NexusRaven-V2-13B-GGUF with Ollama:
ollama run hf.co/QuantFactory/NexusRaven-V2-13B-GGUF:Q4_K_M
- Unsloth Studio new
How to use QuantFactory/NexusRaven-V2-13B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for QuantFactory/NexusRaven-V2-13B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for QuantFactory/NexusRaven-V2-13B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for QuantFactory/NexusRaven-V2-13B-GGUF to start chatting
- Docker Model Runner
How to use QuantFactory/NexusRaven-V2-13B-GGUF with Docker Model Runner:
docker model run hf.co/QuantFactory/NexusRaven-V2-13B-GGUF:Q4_K_M
- Lemonade
How to use QuantFactory/NexusRaven-V2-13B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull QuantFactory/NexusRaven-V2-13B-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.NexusRaven-V2-13B-GGUF-Q4_K_M
List all available models
lemonade list
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf QuantFactory/NexusRaven-V2-13B-GGUF:# Run inference directly in the terminal:
llama-cli -hf QuantFactory/NexusRaven-V2-13B-GGUF:Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf QuantFactory/NexusRaven-V2-13B-GGUF:# Run inference directly in the terminal:
./llama-cli -hf QuantFactory/NexusRaven-V2-13B-GGUF:Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf QuantFactory/NexusRaven-V2-13B-GGUF:# Run inference directly in the terminal:
./build/bin/llama-cli -hf QuantFactory/NexusRaven-V2-13B-GGUF:Use Docker
docker model run hf.co/QuantFactory/NexusRaven-V2-13B-GGUF:QuantFactory/NexusRaven-V2-13B-GGUF
This is quantized version of Nexusflow/NexusRaven-V2-13B created using llama.cpp
NexusRaven-13B: Surpassing GPT-4 for Zero-shot Function Calling
Nexusflow HF - Nexusflow Discord - NexusRaven-V2 blog post - Prompting Notebook CoLab - Leaderboard - Read-World Demo - NexusRaven-V2-13B Github
Introducing NexusRaven-V2-13B
NexusRaven is an open-source and commercially viable function calling LLM that surpasses the state-of-the-art in function calling capabilities.
💪 Versatile Function Calling Capability: NexusRaven-V2 is capable of generating single function calls, nested calls, and parallel calls in many challenging cases.
🤓 Fully Explainable: NexusRaven-V2 is capable of generating very detailed explanations for the function calls it generates. This behavior can be turned off, to save tokens during inference.
📊 Performance Highlights: NexusRaven-V2 surpasses GPT-4 by 7% in function calling success rates in human-generated use cases involving nested and composite functions.
🔧 Generalization to the Unseen: NexusRaven-V2 has never been trained on the functions used in evaluation.
🔥 Commercially Permissive: The training of NexusRaven-V2 does not involve any data generated by proprietary LLMs such as GPT-4. You have full control of the model when deployed in commercial applications.
Please checkout the following links!
NexusRaven-V2 model usage
NexusRaven-V2 accepts a list of python functions.
These python functions can do anything (including sending GET/POST requests to external APIs!).
The two requirements include the python function signature and the appropriate docstring to generate the function call.
NexusRaven-V2 also does best on functions with arguments, so please always only provide functions that require arguments to raven.
NexusRaven-V2's Capabilities
NexusRaven-V2 is capable of generating deeply nested function calls, parallel function calls, and simple single calls. It can also justify the function calls it generated. If you would like to generate the call only, please set a stop criteria of "<bot_end>". Otherwise, please allow NexusRaven-V2 to run until its stop token (i.e. "</s>").
Quick Start Prompting Guide
Please refer to our notebook, How-To-Prompt.ipynb, for more advanced tutorials on using NexusRaven-V2!
- When giving docstrings to Raven, please provide well-indented, detailed, and well-written docstrings as this can help accuracy.
- Raven does better when all functions provided to it has arguments, either required or optional, (i.e.
func(dummy_arg)is preferred overfunc()) as this can help accuracy. - We strongly recommend to set sampling to False when prompting NexusRaven-V2.
- We strongly recommend a very low temperature (~0.001).
- We strongly recommend following the prompting style below.
When handling irrelevant user queries, users have noticed that specifying a "no-op" function with arguments work best. For example, something like this might work:
def no_relevant_function(user_query : str):
"""
Call this when no other provided function can be called to answer the user query.
Args:
user_query: The user_query that cannot be answered by any other function calls.
"""
Please ensure to provide an argument to this function, as Raven works best on functions with arguments.
For parallel calls, due to the model being targeted for industry use, you can "enable" parallel calls by adding this into the prompt:
"Setting: Allowed to issue multiple calls with semicolon\n"
This can be added above the User Query to "allow" the model to use parallel calls, otherwise, the model will focus on nested and single calls primarily.
Quickstart
You can run the model on a GPU using the following code.
# Please `pip install transformers accelerate`
from transformers import pipeline
pipeline = pipeline(
"text-generation",
model="Nexusflow/NexusRaven-V2-13B",
torch_dtype="auto",
device_map="auto",
)
prompt_template = \
'''
Function:
def get_weather_data(coordinates):
"""
Fetches weather data from the Open-Meteo API for the given latitude and longitude.
Args:
coordinates (tuple): The latitude of the location.
Returns:
float: The current temperature in the coordinates you've asked for
"""
Function:
def get_coordinates_from_city(city_name):
"""
Fetches the latitude and longitude of a given city name using the Maps.co Geocoding API.
Args:
city_name (str): The name of the city.
Returns:
tuple: The latitude and longitude of the city.
"""
User Query: {query}<human_end>
'''
prompt = prompt_template.format(query="What's the weather like in Seattle right now?")
result = pipeline(prompt, max_new_tokens=2048, return_full_text=False, do_sample=False, temperature=0.001)[0]["generated_text"]
print (result)
This should generate the following:
Call: get_weather_data(coordinates=get_coordinates_from_city(city_name='Seattle'))<bot_end>
Thought: The function call `get_weather_data(coordinates=get_coordinates_from_city(city_name='Seattle'))` answers the question "What's the weather like in Seattle right now?" by following these steps:
1. `get_coordinates_from_city(city_name='Seattle')`: This function call fetches the latitude and longitude of the city "Seattle" using the Maps.co Geocoding API.
2. `get_weather_data(coordinates=...)`: This function call fetches the current weather data for the coordinates returned by the previous function call.
Therefore, the function call `get_weather_data(coordinates=get_coordinates_from_city(city_name='Seattle'))` answers the question "What's the weather like in Seattle right now?" by first fetching the coordinates of the city "Seattle" and then fetching the current weather data for those coordinates.
If you would like to prevent the generation of the explanation of the function call (for example, to save on inference tokens), please set a stopping criteria of <bot_end>.
Please follow this prompting template to maximize the performance of RavenV2.
Using with OpenAI FC Schematics
Using With LangChain
We've also included a small demo for using Raven with langchain!
Evaluation
For a deeper dive into the results, please see our Github README.
Limitations
- The model works best when it is connected with a retriever when there are a multitude of functions, as a large number of functions will saturate the context window of this model.
- The model can be prone to generate incorrect calls. Please ensure proper guardrails to capture errant behavior is in place.
- The explanations generated by NexusRaven-V2 might be incorrect. Please ensure proper guardrails are present to capture errant behavior.
License
This model was trained on commercially viable data and is licensed under the Nexusflow community license.
Model References
We thank the CodeLlama team for their amazing models!
@misc{rozière2023code,
title={Code Llama: Open Foundation Models for Code},
author={Baptiste Rozière and Jonas Gehring and Fabian Gloeckle and Sten Sootla and Itai Gat and Xiaoqing Ellen Tan and Yossi Adi and Jingyu Liu and Tal Remez and Jérémy Rapin and Artyom Kozhevnikov and Ivan Evtimov and Joanna Bitton and Manish Bhatt and Cristian Canton Ferrer and Aaron Grattafiori and Wenhan Xiong and Alexandre Défossez and Jade Copet and Faisal Azhar and Hugo Touvron and Louis Martin and Nicolas Usunier and Thomas Scialom and Gabriel Synnaeve},
year={2023},
eprint={2308.12950},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Model Citation
@misc{nexusraven,
title={NexusRaven-V2: Surpassing GPT-4 for Zero-shot Function Calling},
author={Nexusflow.ai team},
year={2023},
url={https://nexusflow.ai/blogs/ravenv2}
}
Model Contact
Please join our Discord Channel to reach out for any issues and comments!
- Downloads last month
- 154
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
Model tree for QuantFactory/NexusRaven-V2-13B-GGUF
Base model
codellama/CodeLlama-13b-Instruct-hf


Install from brew
# Start a local OpenAI-compatible server with a web UI: llama-server -hf QuantFactory/NexusRaven-V2-13B-GGUF:# Run inference directly in the terminal: llama-cli -hf QuantFactory/NexusRaven-V2-13B-GGUF: