Instructions to use Arojit/orbi-1b-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Arojit/orbi-1b-gguf with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Arojit/orbi-1b-gguf",
	filename="orbi-1b-q4.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Arojit/orbi-1b-gguf with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Arojit/orbi-1b-gguf
# Run inference directly in the terminal:
llama-cli -hf Arojit/orbi-1b-gguf

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Arojit/orbi-1b-gguf
# Run inference directly in the terminal:
llama-cli -hf Arojit/orbi-1b-gguf

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Arojit/orbi-1b-gguf
# Run inference directly in the terminal:
./llama-cli -hf Arojit/orbi-1b-gguf

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Arojit/orbi-1b-gguf
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Arojit/orbi-1b-gguf

Use Docker

docker model run hf.co/Arojit/orbi-1b-gguf

LM Studio
Jan

vLLM

How to use Arojit/orbi-1b-gguf with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Arojit/orbi-1b-gguf"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Arojit/orbi-1b-gguf",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Arojit/orbi-1b-gguf

Ollama
How to use Arojit/orbi-1b-gguf with Ollama:
```
ollama run hf.co/Arojit/orbi-1b-gguf
```

Unsloth Studio new

How to use Arojit/orbi-1b-gguf with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Arojit/orbi-1b-gguf to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Arojit/orbi-1b-gguf to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Arojit/orbi-1b-gguf to start chatting

Docker Model Runner
How to use Arojit/orbi-1b-gguf with Docker Model Runner:
```
docker model run hf.co/Arojit/orbi-1b-gguf
```

Lemonade

How to use Arojit/orbi-1b-gguf with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Arojit/orbi-1b-gguf

Run and chat with the model

lemonade run user.orbi-1b-gguf-{{QUANT_TAG}}

List all available models

lemonade list

Orbi-1B GGUF

Quantized GGUF version of Orbi-1B, a fine-tuned TinyLlama-1.1B-Chat specialized for function calling and robotic assistant interactions. This model generates structured tool calls in response to natural language commands and is optimized for CPU inference with llama.cpp.

Model Description

Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Model Size: 1.1B parameters
Format: GGUF (llama.cpp compatible)
Quantization: Q4_K_M (4-bit quantization)
Optimized for: CPU inference, low memory usage
License: Apache 2.0

Why GGUF?

GGUF (GPT-Generated Unified Format) offers several advantages:

Faster CPU Inference: Optimized for running on CPU without GPU
Lower Memory Usage: 4-bit quantization reduces model size by ~75%
Cross-Platform: Works on Windows, Linux, macOS (including Apple Silicon)
No GPU Required: Perfect for edge devices and embedded systems
Efficient: Powered by llama.cpp's optimized C++ inference engine

File Information

File	Quant	Size	Use Case
orbi-1b-q4.gguf	Q4_K_M	~650MB	Recommended - Best balance of speed and quality

Installation

Requirements

pip install llama-cpp-python

For GPU acceleration (optional):

# CUDA
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

# Metal (macOS)
CMAKE_ARGS="-DLLAMA_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

Usage

Basic Inference

import json
import re
from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="orbi-1b-q4.gguf",
    n_ctx=4096,
    n_threads=8,
    use_mlock=True,
    verbose=False
)

# System prompt
system_prompt = """You are Orbi's brain.
Respond with one or more <tool_call> JSON blocks, in the exact order the user requests actions.
that calls the best tool for the user's request. Do not write stories yourself.
Do not summarize news yourself. Map synonyms to the tool argument enums.
If parameters are missing, pick sensible defaults. Keep outputs terse.

Available tools and enums:
- smile() -> {}
- cry() -> {}
- move_hands(direction ∈ {left,right,up,down,wave}, speed ∈ {slow,normal,fast})
- dance(style ∈ {hiphop,ballet,robot,random}, duration_sec ∈ [10..120])
- tell_news(topic: string)
- tell_story(topic: string, tone ∈ {wholesome,funny,dramatic,spooky,random}, length ∈ {short,medium,long})
"""

# Build prompt
user_input = "Wave your hands quickly and smile"
prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_input}\n<|assistant|>\n"

# Generate
output = llm(
    prompt,
    max_tokens=256,
    temperature=0.0,
)

response = output["choices"][0]["text"]
print(response)

Interactive Controller

Save this as orbi_controller.py:

import json
import re
from llama_cpp import Llama

# Load model
llm = Llama(
    model_path="orbi-1b-q4.gguf",
    n_ctx=4096,
    n_threads=8,
)

def parse_tool_calls(text):
    pattern = r"<tool_call>\s*(\{.*?\})\s*</tool_call>"
    matches = re.findall(pattern, text, re.DOTALL)
    tools = []
    for m in matches:
        try:
            tools.append(json.loads(m))
        except:
            continue
    return tools

# Interactive loop
print("🤖 Orbi is ready! Type 'exit' to quit.\n")
while True:
    user_input = input("You: ").strip()
    if user_input.lower() in {"exit", "quit"}:
        break
    
    prompt = f"<|system|>\n{system_prompt}\n<|user|>\n{user_input}\n<|assistant|>\n"
    output = llm(prompt, max_tokens=256, temperature=0.0)
    response = output["choices"][0]["text"]
    
    tools = parse_tool_calls(response)
    print(f"Orbi: {json.dumps(tools, indent=2)}\n")

Expected Output Format

<tool_call>
{"name": "move_hands", "arguments": {"direction": "wave", "speed": "fast"}}
</tool_call>
<tool_call>
{"name": "smile", "arguments": {}}
</tool_call>

Performance Benchmarks

Approximate inference speeds on different hardware:

Hardware	Tokens/sec	Memory Usage
M1 MacBook Pro	~45 t/s	800MB
Intel i7-12700K	~35 t/s	750MB
Raspberry Pi 5	~8 t/s	700MB
AMD Ryzen 7 5800X	~40 t/s	750MB

Note: Actual performance may vary based on context length and system configuration.

Supported Tools

Physical Actions: smile(), cry(), move_hands(), dance()
Content Generation: tell_news(), tell_story()
Information: whats_your_name(), who_am_i()
Utilities: answer_arithmetic(), english_learning()

Configuration Options

llama-cpp-python Parameters

llm = Llama(
    model_path="orbi-1b-q4.gguf",
    n_ctx=4096,              # Context window size
    n_threads=8,             # CPU threads (adjust based on your CPU)
    n_gpu_layers=0,          # Set > 0 for GPU offloading
    use_mlock=True,          # Lock model in RAM (prevents swapping)
    verbose=False,           # Disable verbose logging
    seed=42,                 # Set seed for reproducibility
)

Generation Parameters

output = llm(
    prompt,
    max_tokens=256,          # Maximum tokens to generate
    temperature=0.0,         # 0.0 = greedy (recommended for tool calling)
    top_p=0.95,             # Nucleus sampling
    repeat_penalty=1.1,      # Penalize repetition
    stop=["</tool_call>"],   # Stop sequences
)

Use Cases

Robotics: Control physical robots with natural language
IoT Devices: Run on Raspberry Pi or similar edge devices
Embedded Systems: Low-memory environments
Offline Applications: No internet connection required
Desktop Assistants: CPU-only machines without GPU

Limitations

Quantization may result in slight quality degradation compared to full precision
Best performance with greedy decoding (temperature=0.0)
Limited to the predefined set of tools
Context window is 4096 tokens (inherited from base model)

Model Details

Training

Method: LoRA fine-tuning on TinyLlama-1.1B-Chat
Dataset: Custom conversational dataset with tool calling examples
Framework: Transformers + PEFT + TRL

Quantization

Method: Q4_K_M quantization via llama.cpp
Benefits: ~75% size reduction with minimal quality loss
Original Size: ~2.2GB → GGUF Size: ~650MB

Troubleshooting

Model loads slowly

Enable use_mlock=True to keep model in RAM
Increase n_threads based on your CPU cores

Out of memory

Reduce n_ctx (context window size)
Close other applications
Use a lower quantization (Q2_K or Q3_K_M)

Slow inference

Increase n_threads to match your CPU cores
Enable GPU offloading with n_gpu_layers
Reduce max_tokens if generating long responses

License

Apache 2.0 (inherited from TinyLlama base model)

Citation

@misc{orbi-1b-gguf,
  title={Orbi-1B GGUF: Quantized Function Calling Model},
  author={Arojit Ghosh},
  year={2025},
  howpublished={\url{https://huggingface.co/Arojit/orbi-1b-gguf}}
}

Related Models

Full Precision: Arojit/orbi-1b
Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0

Acknowledgments

Built on TinyLlama
Quantized using llama.cpp
Powered by llama-cpp-python

Downloads last month: 4

GGUF

Model size

1B params

Architecture

llama

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for Arojit/orbi-1b-gguf

Base model

TinyLlama/TinyLlama-1.1B-Chat-v1.0

Quantized

(148)

this model