Instructions to use Madras1/Jade8b-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Madras1/Jade8b-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Madras1/Jade8b-GGUF")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Madras1/Jade8b-GGUF", dtype="auto")

llama-cpp-python

How to use Madras1/Jade8b-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Madras1/Jade8b-GGUF",
	filename="jade8b-q2_k.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Madras1/Jade8b-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Madras1/Jade8b-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf Madras1/Jade8b-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Madras1/Jade8b-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf Madras1/Jade8b-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Madras1/Jade8b-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Madras1/Jade8b-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Madras1/Jade8b-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Madras1/Jade8b-GGUF:Q4_K_M

Use Docker

docker model run hf.co/Madras1/Jade8b-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use Madras1/Jade8b-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Madras1/Jade8b-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Madras1/Jade8b-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Madras1/Jade8b-GGUF:Q4_K_M

SGLang

How to use Madras1/Jade8b-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Madras1/Jade8b-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Madras1/Jade8b-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Madras1/Jade8b-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Madras1/Jade8b-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use Madras1/Jade8b-GGUF with Ollama:
```
ollama run hf.co/Madras1/Jade8b-GGUF:Q4_K_M
```

Unsloth Studio

How to use Madras1/Jade8b-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Madras1/Jade8b-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Madras1/Jade8b-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Madras1/Jade8b-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use Madras1/Jade8b-GGUF with Docker Model Runner:
```
docker model run hf.co/Madras1/Jade8b-GGUF:Q4_K_M
```

Lemonade

How to use Madras1/Jade8b-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Madras1/Jade8b-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Jade8b-GGUF-Q4_K_M

List all available models

lemonade list

Jade8b

Jade8b is a Brazilian Portuguese conversational finetune of Qwen3 8b built to express a strong, persistent persona. This model is designed for PT-BR chat, chatbot use cases, and character-style interaction, with colloquial language, abbreviations, slang, and a WhatsApp-like tone.

Model Summary

Jade8b is a persona-first model. It was intentionally finetuned so the model speaks like Jade even without a strong system prompt. Because of that, the model often answers in PT-BR with informal phrasing such as vc, slang, and a friendly conversational tone from the very first turn.

Model Details

Developed by: Madras1
Base model: unsloth/qwen3-8b-bnb-4bit
Model type: conversational text-generation finetune
Primary language: Brazilian Portuguese (pt-BR)
License: apache-2.0

Intended Behavior

This model was trained to:

speak naturally in Brazilian Portuguese
maintain a consistent Jade persona
sound informal, friendly, and chat-oriented
work well in casual assistant and conversational use cases

Typical behavior includes:

abbreviations like vc
light slang and colloquial wording
short expressions such as tmj, mano, tlgd
a more human and less robotic tone

If Jade already sounds like a recurring character during inference, that is expected behavior, not an error.

Training Intent

The finetune objective was to make the persona live in the weights, not only in prompting.

High-level training approach:

synthetic PT-BR prompt generation for chat-like situations
persona-driven response distillation
supervised finetuning on conversational data
removal of system persona instructions during SFT so the model directly internalizes the Jade style

This is why the model can already answer with personality, abbreviations, and slang even with a simple user-only prompt.

Training Setup

High-level setup used for this finetune:

around 25,000 examples
3 epochs
Unsloth-based SFT pipeline
chat-style data in Portuguese

Recommended Use

Best fit:

PT-BR chat assistants
persona bots
WhatsApp-style conversational agents
lightweight entertainment or social AI experiences

Less ideal for:

formal writing
highly neutral assistant behavior
high-stakes legal, medical, or financial contexts

Prompting Tips

For the strongest Jade behavior:

use a simple user message
avoid a formal system prompt that fights the finetune
keep prompts conversational when possible

Example prompts:

oi jade, tudo bem?
jade, me explica isso de um jeito simples
vc acha que vale a pena estudar python hoje?

Example Inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Madras1/Jade8b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "user", "content": "oi jade, tudo bem?"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

Because this is a persona-oriented finetune:

it may sound informal in contexts where a neutral tone would be better
it may over-index on chat style depending on the prompt
it is optimized more for persona consistency than strict formality

Downloads last month: 49

GGUF

Model size

8B params

Architecture

qwen3

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Collection including Madras1/Jade8b-GGUF

Jade-v1

Collection

https://github.com/MadrasLe/JadeLLMV-1 • 18 items • Updated Apr 20