Instructions to use openEuler/witty-tune-model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use openEuler/witty-tune-model with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="openEuler/witty-tune-model",
	filename="loraplus_model.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use openEuler/witty-tune-model with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf openEuler/witty-tune-model:IQ4_NL
# Run inference directly in the terminal:
llama-cli -hf openEuler/witty-tune-model:IQ4_NL

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf openEuler/witty-tune-model:IQ4_NL
# Run inference directly in the terminal:
llama-cli -hf openEuler/witty-tune-model:IQ4_NL

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf openEuler/witty-tune-model:IQ4_NL
# Run inference directly in the terminal:
./llama-cli -hf openEuler/witty-tune-model:IQ4_NL

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf openEuler/witty-tune-model:IQ4_NL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf openEuler/witty-tune-model:IQ4_NL

Use Docker

docker model run hf.co/openEuler/witty-tune-model:IQ4_NL

LM Studio
Jan
Ollama
How to use openEuler/witty-tune-model with Ollama:
```
ollama run hf.co/openEuler/witty-tune-model:IQ4_NL
```

Unsloth Studio new

How to use openEuler/witty-tune-model with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for openEuler/witty-tune-model to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for openEuler/witty-tune-model to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for openEuler/witty-tune-model to start chatting

Pi new

How to use openEuler/witty-tune-model with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf openEuler/witty-tune-model:IQ4_NL

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "openEuler/witty-tune-model:IQ4_NL"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use openEuler/witty-tune-model with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf openEuler/witty-tune-model:IQ4_NL

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default openEuler/witty-tune-model:IQ4_NL

Run Hermes

hermes

Docker Model Runner
How to use openEuler/witty-tune-model with Docker Model Runner:
```
docker model run hf.co/openEuler/witty-tune-model:IQ4_NL
```

Lemonade

How to use openEuler/witty-tune-model with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull openEuler/witty-tune-model:IQ4_NL

Run and chat with the model

lemonade run user.witty-tune-model-IQ4_NL

List all available models

lemonade list

witty-tune-model

介绍

基于openEuler-Intelligence构建了一个支持纯CPU部署推理的领域模型OS_model。我们针对智能调优这个相对复杂且专业的场景，进行领域模型构建，后续我们会在操作系统问答以及操作系统运维等模块进行微调泛化。
witty-tune-model当前是基于Qwen3-4b模型微调，本身并不强依赖基础模型，因此后续可以使用不同的基线大模型微调。 OS_model使用了云大数存场景历史性能调优语料进行微调，在大数据spark、数据库pgsql/mysql、分布式存储ceph、虚拟化nginx应用上分别测试了领域模型、deepseek_v31(671b)与Qwen3-4b原始模型，在EulerCopilot调优智能体上的效果。 1、领域模型调优相比开箱性能在大数据spark上提升15%+，数据库pgsql/mysql上提升50%+，虚拟化nginx上提升150%+、分布式存储ceph上提升50%+；
2、领域模型相对于满血版deepseek效果持平，在部分应用上略优于deepseek满血版，全面领先Qwen3-4b；
3、领域模型量化到INT4规模，纯CPU部署情况下，相比FP16规模吞吐率提升2倍，达到小时级调优，且性能基本无损。

应用	deepseek_v31(671b)典型用例平均提升（%）	Qwen3-4b典型用例平均提升（%）	OS领域模型典型用例平均提升（%）	OS领域模型量化典型用例平均提升（%）
spark	7.52	3.39	11.09	17.37
nginx	190.96	76.42	158.67	166.51
ceph	50.43	33.69	48.38	50.57
pgsql	101.66	104.56	119.83	116.24
mysql	49.17	40.01	50.47	51.49

推理部署

针对本地部署资源受限的痛点，我们使用CPU部署量化后的领域模型（同时也支持NPU/GPU部署）；推荐选择llama.cpp作为CPU的推理框架，其优势在于安装方便，无需构建python库依赖，纯CPU部署性能良好，可充分利用CPU多核性能。
我们针对鲲鹏920/鲲鹏920B对llama.cpp进行了针对性优化，使用了异构融合os的绑核、指令集并行优化等技术，推理性能在920上提升40%（16.5tokens/s->23.15tokens/s），920B上提升74%（62.6tokens/s->108.98tokens/s）。
测试条件：鲲鹏920使用32核部署领域模型（参数量4B+IQ4_NL量化），鲲鹏920B使用64核部署，prefill长度6144，decode长度2048

部署平台	prefill吞吐（tokens/s）	decode吞吐（tokens/s）	推理过程吞吐（tokens/s）	相对基线性能提升（%）
鲲鹏920	115.73	4.62	16.50	/
鲲鹏920优化后	81.68	7.35	23.15	40.28
鲲鹏920B	74.28	42.54	62.60	/
鲲鹏920B优化后	325.23	36.39	108.98	74.08

获取模型

推荐使用我们构建好的领域模型。
当前我们的领域模型仅在只针对调优智能体进行了微调，我们会尽快将模型泛化到OS其他应用上

# 克隆之前请先确认已经安装 git-xet
curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/huggingface/xet-core/refs/heads/main/git_xet/install.sh | sh
git xet install
git clone https://huggingface.co/openEuler/witty-tune-model

安装llama.cpp

根据硬件形态从源码构建-备选

# 获取源码，Qwen3需要llama.cpp版本大于等于b5092
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build    # 构建CPU推理版本
cmake -B build -DGGML_CUDA=ON    # 构建CUDA推理版本
cmake -B build -DGGML_CANN=on -DCMAKE_BUILD_TYPE=release    # 构建CANN推理版本
cmake --build build --config Release -j $(nproc)
# 测试
./build/bin/llama-cli -m PATH_TO_MODEL -p "Building a website can be done in 10 steps:" -ngl 99

部署推理服务

llama-server 是一个简单的 HTTP 服务器，包含一组 LLM REST API 和一个简单的 Web 前端，用于通过 llama.cpp 与大型语言模型交互，可兼容openai的接口

./build/bin/llama-server -m witty-tune-model/loraplus_model_IQ4_NL.gguf --jinja -ngl 99 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -np 4 -n 32768 --no-context-shift -t 64 --host 0.0.0.0 --port 8000
测试
curl 'http://127.0.0.1:8000/v1/chat/completions' \
--header 'Content-Type: application/json' \
--data '{
    "model": "witty-tune-model",
    "messages": [
      {"role": "user", "content": "你好"}
    ],
    "stream": false
  }'

默认情况下，服务器将在 http://localhost:8080 监听，可以通过传递 --host 和 --port 更改。以下是对上述命令的一些解释：模型：llama-cli 支持从本地路径、远程 URL 或 Hugging Face Hub 使用模型文件。上面的 -m witty-tune-model/loraplus_model_IQ4_NL.gguf 表示我们使用本地的gguf。

速度优化： CPU：llama-cli 默认会使用 CPU，您可以通过更改 -t 来指定希望使用的线程数，例如 -t64 表示使用 64 个线程。 GPU：如果程序包含 GPU 支持，您可以使用 -ngl，它允许将一些层卸载到 GPU 进行计算。如果有多个 GPU，它会卸载到所有 GPU 上。您可以使用 -dev 控制使用的设备，并使用 -sm 控制使用的并行类型。例如，-ngl 99 -dev cuda0,cuda1 -sm row 表示使用 row 切分将所有层卸载到 GPU 0 和 GPU 1。添加 -fa 也可能加速生成。

采样参数：llama.cpp 支持多种采样方法，并对其中许多方法有默认配置。建议根据实际情况调整这些参数，Qwen3 模型卡片中推荐的参数可作为参考。如果您遇到重复和无尽生成的情况，建议额外传递 --presence-penalty，最大值为 2.0。

上下文管理：llama.cpp 默认采用“轮换”上下文管理方式。-c 控制最大上下文长度（默认值 4096，0 表示从模型加载），-n 控制每次生成的最大长度（默认值 -1 表示无限生成直到结束，-2 表示直到上下文满）。当上下文已满但生成未结束时，初始提示中的前 --keep 个 token（默认值 0，-1 表示全部）会被保留，其余部分的前半部分会被丢弃。然后，模型基于新的上下文 token 继续生成。您可以设置 --no-context-shift 来防止这种轮换行为，一旦达到 -c，生成就会停止。 llama.cpp 支持 YaRN，可以通过 -c 131072 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 启用。

对接openEuler-Intelligence调优智能体

请先部署openEuler-Intelligence调优智能体

openEuler-Intelligence调优智能体安装与使用指南

修改.env.yaml 配置文件内容（项目 config 目录下）

vim config/.env.yaml

修改以下部分：

LLM_KEY: "sk-123456"                     # 必填：模型服务的 API 密钥
LLM_URL: "http://172.168.178.107:8000"   # 必填：LLM 服务的 API 接口地址，如 "https://api.deepseek.com"
LLM_MODEL_NAME: "witty-tune-model"   # 必填：要调用的模型名，如 deepseek-chat
LLM_MAX_TOKENS: 8192                     # 选填：生成文本的最大 token 数，如512或2048

随后可以开启调优主程序

euler-copilot-tune

训练微调过程

基线模型

我们当前选择Qwen3-4B作为我们的基线模型，事实上，对于我们的调优方法，基线模型可以根据使用策略调整。

# 下载基线模型
pip install modelscope
modelscope download --model Qwen/Qwen3-4B-Instruct-2507 --local_dir ./Qwen3-4B-Instruct-2507
# 或者使用 huggingface
pip install huggingface_hub
huggingface-cli download Qwen/Qwen3-4B-Instruct-2507 --local-dir ./Qwen3-4B-Instruct-2507

数据准备

依照目前Eulercopilot调优智能体的推理格式，参考知识蒸馏的模式，收集在调优任务上表现良好的大模型（deepseek_v3.1、qwen3-max、qwen3-235B-A22B等）的回答（业界小规模的大模型E2E训练也大多依赖此种方式构造数据）。调优的数据格式一般包含以下几种问答对：系统状态应用状态分析问答对、瓶颈分析问答对、调优思路分析问答对、推理参数问答对。

数据清洗：

正负样本均衡：收集正样本-调优效果较好的优化参数（性能提升超过5%+），保留5%总数据量的负样本-调优效果较差的优化参数（无提升或者负收益），以保证领域大模型具备针对调优结果较差场景具有反思能力。
标准格式强化：清洗json格式的输出，并添加15%总数据量的json强化问答对，以保证微调后的模型对json的理解能力。
领域知识高质量化：修正知识库中错误的知识，基于历史专家调优经验，知识库新增有显著性能影响的参数，并在描述中体现重要性。

LLaMA-Factory

我们选择LLaMA-Factory作为微调框架，支持多模型、多种精度、多种算法、多种集成方法的LLM微调，部署/使用简单

# 下载源码
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory

# 创建并激活conda环境
conda create -y -n llamafactory python=3.11
conda activate llamafactory

# 源码安装
# cuda环境
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126 # 可选，根据cuda版本调整下载链接，也可直接执行下面的命令自动检测并部署
pip install -e ".[torch,metrics]" --no-build-isolation
# ascend npu环境
pip install -e ".[torch-npu,metrics]" -i https://pypi.tuna.tsinghua.edu.cn/simple

# 与构建镜像安装
#c uda环境
docker run -it --rm --gpus=all --ipc=host hiyouga/llamafactory:latest   # 该镜像基于 Ubuntu 22.04（x86_64）、CUDA 12.4、Python 3.11、PyTorch 2.6.0 和 Flash-attn 2.7.4 构建。全部镜像：https://hub.docker.com/r/hiyouga/llamafactory/tags
# ascend npu环境（暂时不支持A3型号的ascend产品）
docker pull quay.io/ascend/llamafactory:latest-npu-a2
docker run -dit --ipc=host --network host --name 'llamafactory' --privileged -v /usr/local/Ascend/driver:/usr/local/Ascend/driver  -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware  -v /usr/local/sbin/:/usr/local/sbin/ -v /home/:/home/ quay.io/ascend/llamafactory:latest-npu-a2
docker exec -it llamafactory bash

# 安装校验
llamafactory-cli env

LoRA+

我们选择使用LoRA+作为训练方法，LoRA+是LoRA的变种，通过为不同的矩阵设置不同的学习率，有效提高学习效率，能够提高性能（约1%-2%）和微调速度（约2倍）。

训练

LLaMA Factory提供一站式操作体验，LLaMA Board 可视化微调

llamafactory-cli webui

格式转换

GGUF是一种文件格式，用于存储运行模型所需的信息，包括但不限于模型权重、模型超参数、默认生成配置和tokenzier，适用于llama.cpp推理场景，vllm、sglang有限支持。

# 获取源码，Qwen3需要llama.cpp版本大于等于b5092
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# 格式转换需要构建python库依赖
pip install -r requirements/requirements-convert_hf_to_gguf.txt
python convert_hf_to_gguf.py witty-tune-model/loraplus_model --outfile witty-tune-model/loraplus_model.gguf

量化

建议选择IQ4_NL作为量化方式，该量化方式成本低，无需校准集，模型综合能力损失小于1%，在调优场景实测无精度损失，并且能提升模型对json的理解能力。使用IQ4_NL量化后，相比bf16模型，推理速度提升150%；相比Q4_K_M量化，推理速度基本持平（其中prefill阶段推理速度仅降低13%，decode阶段推理速度提升25%），调优场景能力显著提升。需要注意的是，使用llama.cpp量化后的gguf模型，无法使用vllm、sglang部署。

llama-quantize witty-tune-model/loraplus_model.gguf IQ4_NL

Downloads last month: 2

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

4-bit

View +1 variant

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support