How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf g023/qwen3-tiny-v2-finetuned:Q8_0
# Run inference directly in the terminal:
llama-cli -hf g023/qwen3-tiny-v2-finetuned:Q8_0
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf g023/qwen3-tiny-v2-finetuned:Q8_0
# Run inference directly in the terminal:
llama-cli -hf g023/qwen3-tiny-v2-finetuned:Q8_0
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf g023/qwen3-tiny-v2-finetuned:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf g023/qwen3-tiny-v2-finetuned:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf g023/qwen3-tiny-v2-finetuned:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf g023/qwen3-tiny-v2-finetuned:Q8_0
Use Docker
docker model run hf.co/g023/qwen3-tiny-v2-finetuned:Q8_0
Quick Links

Qwen3-g023-tiny-v2-FT-Q8_0 - GRPO Finetuned Q8_0 GGUF Export

https://huggingface.co/g023/qwen3-tiny-v2-finetuned/

Q8_0 GGUF export of a GRPO finetuned Qwen3 model to achieve improved reasoning and reduced repetition. Original SRC Model: https://huggingface.co/g023/qwen3-tiny-v2

THIS IS A WIP (WORK IN PROGRESS)

Files

  • Qwen3-g023-tiny-v2-FT-Q8_0.gguf: Q8_0 GGUF model (~1.81 GB)
  • Modelfile: Ollama template + tested default sampling settings
  • params_best.json: Best sampled parameters from automated sweep
  • sweep_results.json: Full sweep results and per-test outcomes

Tested Best Parameters (Default in Modelfile)

  • temperature: 0.65
  • top_p: 0.9
  • top_k: 20
  • min_p: 0.0
  • repeat_penalty: 1.05
  • presence_penalty: 0.1
  • frequency_penalty: 0.1
  • num_ctx: 40000

Usage (Ollama)

ollama create qwen3-g023-tiny-v2-FT-Q8_0 -f Modelfile
ollama run qwen3-g023-tiny-v2-FT-Q8_0

# thinking on
ollama run qwen3-g023-tiny-v2-FT-Q8_0 --think "Explain why the sky is blue"

# thinking off
ollama run qwen3-g023-tiny-v2-FT-Q8_0 --think=false "Explain why the sky is blue"

or pull from huggingface directly to ollama:

ollama run hf.co/g023/qwen3-tiny-v2-finetuned:Q8_0

Notes

  • Template is the Qwen3-compatible template with think/no_think handling.
  • If you want stricter non-thinking behavior, compare alternatives in sweep_results.json.
Downloads last month
7
GGUF
Model size
2B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for g023/qwen3-tiny-v2-finetuned

Finetuned
Qwen/Qwen3-1.7B
Quantized
(1)
this model