Instructions to use Arojit/orbi-1b-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Arojit/orbi-1b-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Arojit/orbi-1b-gguf", filename="orbi-1b-q4.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Arojit/orbi-1b-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Arojit/orbi-1b-gguf # Run inference directly in the terminal: llama-cli -hf Arojit/orbi-1b-gguf
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Arojit/orbi-1b-gguf # Run inference directly in the terminal: llama-cli -hf Arojit/orbi-1b-gguf
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Arojit/orbi-1b-gguf # Run inference directly in the terminal: ./llama-cli -hf Arojit/orbi-1b-gguf
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Arojit/orbi-1b-gguf # Run inference directly in the terminal: ./build/bin/llama-cli -hf Arojit/orbi-1b-gguf
Use Docker
docker model run hf.co/Arojit/orbi-1b-gguf
- LM Studio
- Jan
- vLLM
How to use Arojit/orbi-1b-gguf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Arojit/orbi-1b-gguf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Arojit/orbi-1b-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Arojit/orbi-1b-gguf
- Ollama
How to use Arojit/orbi-1b-gguf with Ollama:
ollama run hf.co/Arojit/orbi-1b-gguf
- Unsloth Studio new
How to use Arojit/orbi-1b-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Arojit/orbi-1b-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Arojit/orbi-1b-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Arojit/orbi-1b-gguf to start chatting
- Docker Model Runner
How to use Arojit/orbi-1b-gguf with Docker Model Runner:
docker model run hf.co/Arojit/orbi-1b-gguf
- Lemonade
How to use Arojit/orbi-1b-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Arojit/orbi-1b-gguf
Run and chat with the model
lemonade run user.orbi-1b-gguf-{{QUANT_TAG}}List all available models
lemonade list
Orbi-1B GGUF
Quantized GGUF version of Orbi-1B, a fine-tuned TinyLlama-1.1B-Chat specialized for function calling and robotic assistant interactions. This model generates structured tool calls in response to natural language commands and is optimized for CPU inference with llama.cpp.
Model Description
- Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
- Model Size: 1.1B parameters
- Format: GGUF (llama.cpp compatible)
- Quantization: Q4_K_M (4-bit quantization)
- Optimized for: CPU inference, low memory usage
- License: Apache 2.0
Why GGUF?
GGUF (GPT-Generated Unified Format) offers several advantages:
- Faster CPU Inference: Optimized for running on CPU without GPU
- Lower Memory Usage: 4-bit quantization reduces model size by ~75%
- Cross-Platform: Works on Windows, Linux, macOS (including Apple Silicon)
- No GPU Required: Perfect for edge devices and embedded systems
- Efficient: Powered by llama.cpp's optimized C++ inference engine
File Information
| File | Quant | Size | Use Case |
|---|---|---|---|
| orbi-1b-q4.gguf | Q4_K_M | ~650MB | Recommended - Best balance of speed and quality |
Installation
Requirements
pip install llama-cpp-python
For GPU acceleration (optional):
# CUDA
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
# Metal (macOS)
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
Usage
Basic Inference
import json
import re
from llama_cpp import Llama
# Load model
llm = Llama(
model_path="orbi-1b-q4.gguf",
n_ctx=4096,
n_threads=8,
use_mlock=True,
verbose=False
)
# System prompt
system_prompt = """You are Orbi's brain.
Respond with one or more <tool_call> JSON blocks, in the exact order the user requests actions.
that calls the best tool for the user's request. Do not write stories yourself.
Do not summarize news yourself. Map synonyms to the tool argument enums.
If parameters are missing, pick sensible defaults. Keep outputs terse.
Available tools and enums:
- smile() -> {}
- cry() -> {}
- move_hands(direction โ {left,right,up,down,wave}, speed โ {slow,normal,fast})
- dance(style โ {hiphop,ballet,robot,random}, duration_sec โ [10..120])
- tell_news(topic: string)
- tell_story(topic: string, tone โ {wholesome,funny,dramatic,spooky,random}, length โ {short,medium,long})
"""
# Build prompt
user_input = "Wave your hands quickly and smile"
prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_input}\n<|assistant|>\n"
# Generate
output = llm(
prompt,
max_tokens=256,
temperature=0.0,
)
response = output["choices"][0]["text"]
print(response)
Interactive Controller
Save this as orbi_controller.py:
import json
import re
from llama_cpp import Llama
# Load model
llm = Llama(
model_path="orbi-1b-q4.gguf",
n_ctx=4096,
n_threads=8,
)
def parse_tool_calls(text):
pattern = r"<tool_call>\s*(\{.*?\})\s*</tool_call>"
matches = re.findall(pattern, text, re.DOTALL)
tools = []
for m in matches:
try:
tools.append(json.loads(m))
except:
continue
return tools
# Interactive loop
print("๐ค Orbi is ready! Type 'exit' to quit.\n")
while True:
user_input = input("You: ").strip()
if user_input.lower() in {"exit", "quit"}:
break
prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_input}\n<|assistant|>\n"
output = llm(prompt, max_tokens=256, temperature=0.0)
response = output["choices"][0]["text"]
tools = parse_tool_calls(response)
print(f"Orbi: {json.dumps(tools, indent=2)}\n")
Expected Output Format
<tool_call>
{"name": "move_hands", "arguments": {"direction": "wave", "speed": "fast"}}
</tool_call>
<tool_call>
{"name": "smile", "arguments": {}}
</tool_call>
Performance Benchmarks
Approximate inference speeds on different hardware:
| Hardware | Tokens/sec | Memory Usage |
|---|---|---|
| M1 MacBook Pro | ~45 t/s | 800MB |
| Intel i7-12700K | ~35 t/s | 750MB |
| Raspberry Pi 5 | ~8 t/s | 700MB |
| AMD Ryzen 7 5800X | ~40 t/s | 750MB |
Note: Actual performance may vary based on context length and system configuration.
Supported Tools
- Physical Actions:
smile(),cry(),move_hands(),dance() - Content Generation:
tell_news(),tell_story() - Information:
whats_your_name(),who_am_i() - Utilities:
answer_arithmetic(),english_learning()
Configuration Options
llama-cpp-python Parameters
llm = Llama(
model_path="orbi-1b-q4.gguf",
n_ctx=4096, # Context window size
n_threads=8, # CPU threads (adjust based on your CPU)
n_gpu_layers=0, # Set > 0 for GPU offloading
use_mlock=True, # Lock model in RAM (prevents swapping)
verbose=False, # Disable verbose logging
seed=42, # Set seed for reproducibility
)
Generation Parameters
output = llm(
prompt,
max_tokens=256, # Maximum tokens to generate
temperature=0.0, # 0.0 = greedy (recommended for tool calling)
top_p=0.95, # Nucleus sampling
repeat_penalty=1.1, # Penalize repetition
stop=["</tool_call>"], # Stop sequences
)
Use Cases
- Robotics: Control physical robots with natural language
- IoT Devices: Run on Raspberry Pi or similar edge devices
- Embedded Systems: Low-memory environments
- Offline Applications: No internet connection required
- Desktop Assistants: CPU-only machines without GPU
Limitations
- Quantization may result in slight quality degradation compared to full precision
- Best performance with greedy decoding (temperature=0.0)
- Limited to the predefined set of tools
- Context window is 4096 tokens (inherited from base model)
Model Details
Training
- Method: LoRA fine-tuning on TinyLlama-1.1B-Chat
- Dataset: Custom conversational dataset with tool calling examples
- Framework: Transformers + PEFT + TRL
Quantization
- Method: Q4_K_M quantization via llama.cpp
- Benefits: ~75% size reduction with minimal quality loss
- Original Size: ~2.2GB โ GGUF Size: ~650MB
Troubleshooting
Model loads slowly
- Enable
use_mlock=Trueto keep model in RAM - Increase
n_threadsbased on your CPU cores
Out of memory
- Reduce
n_ctx(context window size) - Close other applications
- Use a lower quantization (Q2_K or Q3_K_M)
Slow inference
- Increase
n_threadsto match your CPU cores - Enable GPU offloading with
n_gpu_layers - Reduce
max_tokensif generating long responses
License
Apache 2.0 (inherited from TinyLlama base model)
Citation
@misc{orbi-1b-gguf,
title={Orbi-1B GGUF: Quantized Function Calling Model},
author={Arojit Ghosh},
year={2025},
howpublished={\url{https://huggingface.co/Arojit/orbi-1b-gguf}}
}
Related Models
- Full Precision: Arojit/orbi-1b
- Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Acknowledgments
- Built on TinyLlama
- Quantized using llama.cpp
- Powered by llama-cpp-python
- Downloads last month
- 4
We're not able to determine the quantization variants.
Model tree for Arojit/orbi-1b-gguf
Base model
TinyLlama/TinyLlama-1.1B-Chat-v1.0