Instructions to use g023/qwen3-tiny-v2-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use g023/qwen3-tiny-v2-finetuned with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="g023/qwen3-tiny-v2-finetuned",
	filename="Qwen3-g023-tiny-v2-FT-Q8_0.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use g023/qwen3-tiny-v2-finetuned with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf g023/qwen3-tiny-v2-finetuned:Q8_0
# Run inference directly in the terminal:
llama-cli -hf g023/qwen3-tiny-v2-finetuned:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf g023/qwen3-tiny-v2-finetuned:Q8_0
# Run inference directly in the terminal:
llama-cli -hf g023/qwen3-tiny-v2-finetuned:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf g023/qwen3-tiny-v2-finetuned:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf g023/qwen3-tiny-v2-finetuned:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf g023/qwen3-tiny-v2-finetuned:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf g023/qwen3-tiny-v2-finetuned:Q8_0

Use Docker

docker model run hf.co/g023/qwen3-tiny-v2-finetuned:Q8_0

LM Studio
Jan

vLLM

How to use g023/qwen3-tiny-v2-finetuned with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "g023/qwen3-tiny-v2-finetuned"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "g023/qwen3-tiny-v2-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/g023/qwen3-tiny-v2-finetuned:Q8_0

Ollama
How to use g023/qwen3-tiny-v2-finetuned with Ollama:
```
ollama run hf.co/g023/qwen3-tiny-v2-finetuned:Q8_0
```

Unsloth Studio

How to use g023/qwen3-tiny-v2-finetuned with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for g023/qwen3-tiny-v2-finetuned to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for g023/qwen3-tiny-v2-finetuned to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for g023/qwen3-tiny-v2-finetuned to start chatting

Docker Model Runner
How to use g023/qwen3-tiny-v2-finetuned with Docker Model Runner:
```
docker model run hf.co/g023/qwen3-tiny-v2-finetuned:Q8_0
```

Lemonade

How to use g023/qwen3-tiny-v2-finetuned with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull g023/qwen3-tiny-v2-finetuned:Q8_0

Run and chat with the model

lemonade run user.qwen3-tiny-v2-finetuned-Q8_0

List all available models

lemonade list

Qwen3-g023-tiny-v2-FT-Q8_0 - GRPO Finetuned Q8_0 GGUF Export

https://huggingface.co/g023/qwen3-tiny-v2-finetuned/

Q8_0 GGUF export of a GRPO finetuned Qwen3 model to achieve improved reasoning and reduced repetition. Original SRC Model: https://huggingface.co/g023/qwen3-tiny-v2

THIS IS A WIP (WORK IN PROGRESS)

Files

Qwen3-g023-tiny-v2-FT-Q8_0.gguf: Q8_0 GGUF model (~1.81 GB)
Modelfile: Ollama template + tested default sampling settings
params_best.json: Best sampled parameters from automated sweep
sweep_results.json: Full sweep results and per-test outcomes

Tested Best Parameters (Default in Modelfile)

temperature: 0.65
top_p: 0.9
top_k: 20
min_p: 0.0
repeat_penalty: 1.05
presence_penalty: 0.1
frequency_penalty: 0.1
num_ctx: 40000

Usage (Ollama)

ollama create qwen3-g023-tiny-v2-FT-Q8_0 -f Modelfile
ollama run qwen3-g023-tiny-v2-FT-Q8_0

# thinking on
ollama run qwen3-g023-tiny-v2-FT-Q8_0 --think "Explain why the sky is blue"

# thinking off
ollama run qwen3-g023-tiny-v2-FT-Q8_0 --think=false "Explain why the sky is blue"

or pull from huggingface directly to ollama:

ollama run hf.co/g023/qwen3-tiny-v2-finetuned:Q8_0

Notes

Template is the Qwen3-compatible template with think/no_think handling.
If you want stricter non-thinking behavior, compare alternatives in sweep_results.json.

Downloads last month: 7

GGUF

Model size

2B params

Architecture

qwen3

Hardware compatibility

8-bit

Model tree for g023/qwen3-tiny-v2-finetuned

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Quantized

g023/qwen3-tiny-v2

Quantized

(1)

this model