Instructions to use openEuler/witty-tune-model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use openEuler/witty-tune-model with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="openEuler/witty-tune-model",
	filename="loraplus_model.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use openEuler/witty-tune-model with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf openEuler/witty-tune-model:IQ4_NL
# Run inference directly in the terminal:
llama-cli -hf openEuler/witty-tune-model:IQ4_NL

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf openEuler/witty-tune-model:IQ4_NL
# Run inference directly in the terminal:
llama-cli -hf openEuler/witty-tune-model:IQ4_NL

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf openEuler/witty-tune-model:IQ4_NL
# Run inference directly in the terminal:
./llama-cli -hf openEuler/witty-tune-model:IQ4_NL

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf openEuler/witty-tune-model:IQ4_NL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf openEuler/witty-tune-model:IQ4_NL

Use Docker

docker model run hf.co/openEuler/witty-tune-model:IQ4_NL

LM Studio
Jan
Ollama
How to use openEuler/witty-tune-model with Ollama:
```
ollama run hf.co/openEuler/witty-tune-model:IQ4_NL
```

Unsloth Studio new

How to use openEuler/witty-tune-model with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for openEuler/witty-tune-model to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for openEuler/witty-tune-model to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for openEuler/witty-tune-model to start chatting

Pi new

How to use openEuler/witty-tune-model with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf openEuler/witty-tune-model:IQ4_NL

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "openEuler/witty-tune-model:IQ4_NL"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use openEuler/witty-tune-model with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf openEuler/witty-tune-model:IQ4_NL

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default openEuler/witty-tune-model:IQ4_NL

Run Hermes

hermes

Docker Model Runner
How to use openEuler/witty-tune-model with Docker Model Runner:
```
docker model run hf.co/openEuler/witty-tune-model:IQ4_NL
```

Lemonade

How to use openEuler/witty-tune-model with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull openEuler/witty-tune-model:IQ4_NL

Run and chat with the model

lemonade run user.witty-tune-model-IQ4_NL

List all available models

lemonade list

xujing99 commited on Feb 13

Commit

8d2e693

1 Parent(s): 5ebb1c2

add README

Browse files

Files changed (1) hide show

README.md +186 -0

README.md ADDED Viewed

	@@ -0,0 +1,186 @@

+---
+license: apache-2.0
+---
+# witty-tune-model
+## 介绍
+基于openEuler-Intelligence构建了一个支持纯CPU部署推理的领域模型OS_model。我们针对智能调优这个相对复杂且专业的场景，进行领域模型构建，后续我们会在操作系统问答以及操作系统运维等模块进行微调泛化。
+witty-tune-model当前是基于Qwen3-4b模型微调，本身并不强依赖基础模型，因此后续可以使用不同的基线大模型微调。
+OS_model使用了云大数存场景历史性能调优语料进行微调，在大数据spark、数据库pgsql/mysql、分布式存储ceph、虚拟化nginx应用上分别测试了领域模型、deepseek_v31(671b)与Qwen3-4b原始模型，在EulerCopilot调优智能体上的效果。
+1、领域模型调优相比开箱性能在大数据spark上提升**15%+**，数据库pgsql/mysql上提升**50%+**，虚拟化nginx上提升**150%+**、分布式存储ceph上提升**50%+**；
+2、领域模型相对于满血版deepseek效果持平，在**部分应用上略优于deepseek满血版，全面领先Qwen3-4b**；
+3、领域模型量化到INT4规模，纯CPU部署情况下，**相比FP16规模吞吐率提升2倍，达到小时级调优**，且性能基本无损。
+| 应用 | deepseek_v31(671b)典型用例平均提升（%） | Qwen3-4b典型用例平均提升（%） | OS领域模型典型用例平均提升（%） | OS领域模型量化典型用例平均提升（%） |
+| :---: | :---: | :---: | :---: | :---: |
+| spark | 7.52 | 3.39 | 11.09 | **17.37** |
+| nginx	| **190.96** | 76.42 | 158.67 | 166.51 |
+| ceph | 50.43 | 33.69 | 48.38 | **50.57** |
+| pgsql | 101.66 | 104.56 | **119.83** | 116.24 |
+| mysql | 49.17 | 40.01 | 50.47 | **51.49** |
+## 推理部署
+针对本地部署资源受限的痛点，我们使用CPU部署量化后的领域模型（同时也支持NPU/GPU部署）；推荐选择llama.cpp作为CPU的推理框架，其优势在于安装方便，无需构建python库依赖，纯CPU部署性能良好，可充分利用CPU多核性能。
+我们针对鲲鹏920/鲲鹏920B对llama.cpp进行了针对性优化，使用了异构融合os的绑核、指令集并行优化等技术，推理性能在920上提升40%（16.5tokens/s->23.15tokens/s），920B上提升74%（62.6tokens/s->108.98tokens/s）。
+***测试条件：鲲鹏920使用32核部署领域模型（参数量4B+IQ4_NL量化），鲲鹏920B使用64核部署，prefill长度6144，decode长度2048***
+| 部署平台 | prefill吞吐（tokens/s） | decode吞吐（tokens/s） | 推理过程吞吐（tokens/s） | 相对基线性能提升（%） |
+| :---: | :---: | :---: | :---: | :---: |
+| 鲲鹏920 | 115.73 | 4.62 | **16.50** | / |
+| 鲲鹏920优化后 | 81.68 | 7.35 | **$\color{red} {23.15} $** | **$\color{red} {40.28} $** |
+| 鲲鹏920B | 74.28 | 42.54 | **62.60** | / |
+| 鲲鹏920B优化后 | 325.23 | 36.39 | **$\color{red} {108.98} $** | **$\color{red} {74.08} $** |
+### 获取模型
+推荐使用我们构建好的领域模型。
+***当前我们的领域模型仅在只针对调优智能体进行了微调，我们会尽快将模型泛化到OS其他应用上***
+```bash
+# 克隆之前请先确认已经安装 git-xet
+curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/huggingface/xet-core/refs/heads/main/git_xet/install.sh | sh
+git xet install
+git clone https://huggingface.co/openEuler/witty-tune-model
+```
+### 安装llama.cpp
+#### 根据硬件形态从源码构建-备选
+```bash
+# 获取源码，Qwen3需要llama.cpp版本大于等于b5092
+git clone https://github.com/ggml-org/llama.cpp
+cd llama.cpp
+cmake -B build    # 构建CPU推理版本
+cmake -B build -DGGML_CUDA=ON    # 构建CUDA推理版本
+cmake -B build -DGGML_CANN=on -DCMAKE_BUILD_TYPE=release    # 构建CANN推理版本
+cmake --build build --config Release -j $(nproc)
+# 测试
+./build/bin/llama-cli -m PATH_TO_MODEL -p "Building a website can be done in 10 steps:" -ngl 99
+```
+### 部署推理服务
+llama-server 是一个简单的 HTTP 服务器，包含一组 LLM REST API 和一个简单的 Web 前端，用于通过 llama.cpp 与大型语言模型交互，可兼容openai的接口
+```bash
+./build/bin/llama-server -m witty-tune-model/loraplus_model_IQ4_NL.gguf --jinja -ngl 99 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -np 4 -n 32768 --no-context-shift -t 64 --host 0.0.0.0 --port 8000
+测试
+curl 'http://127.0.0.1:8000/v1/chat/completions' \
+--header 'Content-Type: application/json' \
+--data '{
+    "model": "witty-tune-model",
+    "messages": [
+      {"role": "user", "content": "你好"}
+    ],
+    "stream": false
+  }'
+```
+默认情况下，服务器将在 http://localhost:8080 监听，可以通过传递 --host 和 --port 更改。
+以下是对上述命令的一些解释：
+模型：llama-cli 支持从本地路径、远程 URL 或 Hugging Face Hub 使用模型文件。
+上面的 -m witty-tune-model/loraplus_model_IQ4_NL.gguf 表示我们使用本地的gguf。
+速度优化：
+CPU：llama-cli 默认会使用 CPU，您可以��过更改 -t 来指定希望使用的线程数，例如 -t64  表示使用 64 个线程。
+GPU：如果程序包含 GPU 支持，您可以使用 -ngl，它允许将一些层卸载到 GPU 进行计算。如果有多个 GPU，它会卸载到所有 GPU 上。您可以使用 -dev 控制使用的设备，并使用 -sm 控制使用的并行类型。例如，-ngl 99 -dev cuda0,cuda1 -sm row 表示使用 row 切分将所有层卸载到 GPU 0 和 GPU 1。添加 -fa 也可能加速生成。
+采样参数：llama.cpp 支持多种采样方法，并对其中许多方法有默认配置。建议根据实际情况调整这些参数，Qwen3 模型卡片中推荐的参数可作为参考。如果您遇到重复和无尽生成的情况，建议额外传递 --presence-penalty，最大值为 2.0。
+上下文管理：llama.cpp 默认采用“轮换”上下文管理方式。-c 控制最大上下文长度（默认值 4096，0 表示从模型加载），-n 控制每次生成的最大长度（默认值 -1 表示无限生成直到结束，-2 表示直到上下文满）。当上下文已满但生成未结束时，初始提示中的前 --keep 个 token（默认值 0，-1 表示全部）会被保留，其余部分的前半部分会被丢弃。然后，模型基于新的上下文 token 继续生成。您可以设置 --no-context-shift 来防止这种轮换行为，一旦达到 -c，生成就会停止。
+llama.cpp 支持 YaRN，可以通过 -c 131072 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768 启用。
+### 对接openEuler-Intelligence调优智能体
+请先部署openEuler-Intelligence调优智能体
+[openEuler-Intelligence调优智能体安装与使用指南](https://gitee.com/openeuler/A-Tune/tree/euler-copilot-tune/)
+修改.env.yaml 配置文件内容（项目 config 目录下）
+```bash
+vim config/.env.yaml
+```
+修改以下部分：
+```
+LLM_KEY: "sk-123456"                     # 必填：模型服务的 API 密钥
+LLM_URL: "http://172.168.178.107:8000"   # 必填：LLM 服务的 API 接口地址，如 "https://api.deepseek.com"
+LLM_MODEL_NAME: "witty-tune-model"   # 必填：要调用的模型名，如 deepseek-chat
+LLM_MAX_TOKENS: 8192                     # 选填：生成文本的最大 token 数，如512或2048
+```
+随后可以开启调优主程序
+```
+euler-copilot-tune
+```
+## 训练微调过程
+### 基线模型
+我们当前选择Qwen3-4B作为我们的基线模型，事实上，对于我们的调优方法，基线模型可以根据使用策略调整。
+```bash
+# 下载基线模型
+pip install modelscope
+modelscope download --model Qwen/Qwen3-4B-Instruct-2507 --local_dir ./Qwen3-4B-Instruct-2507
+# 或者使用 huggingface
+pip install huggingface_hub
+huggingface-cli download Qwen/Qwen3-4B-Instruct-2507 --local-dir ./Qwen3-4B-Instruct-2507
+```
+### 数据准备
+依照目前Eulercopilot调优智能体的推理格式，参考知识蒸馏的模式，收集在调优任务上表现良好的大模型（deepseek_v3.1、qwen3-max、qwen3-235B-A22B等）的回答（业界小规模的大模型E2E训练也大多依赖此种方式构造数据）。调优的数据格式一般包含以下几种问答对：系统状态应用状态分析问答对、瓶颈分析问答对、调优思路分析问答对、推理参数问答对。
+数据清洗：
+- 正负样本均衡：收集正样本-调优效果较好的优化参数（性能提升超过5%+），保留5%总数据量的负样本-调优效果较差的优化参数（无提升或者负收益），以保证领域大模型具备针对调优结果较差场景具有反思能力。
+- 标准格式强化：清洗json格式的输出，并添加15%总数据量的json强化问答对，以保证微调后的模型对json的理解能力。
+- 领域知识高质量化：修正知识库中错误的知识，基于历史专家调优经验，知识库新增有显著性能影响的参数，并在描述中体现重要性。
+### LLaMA-Factory
+我们选择LLaMA-Factory作为微调框架，支持多模型、多种精度、多种算法、多种集成方法的LLM微调，部署/使用简单
+```bash
+# 下载源码
+git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
+cd LLaMA-Factory
+# 创建并激活conda环境
+conda create -y -n llamafactory python=3.11
+conda activate llamafactory
+# 源码安装
+# cuda环境
+pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126 # 可选，根据cuda版本调整下载链接，也可直接执行下面的命令自动检测并部署
+pip install -e ".[torch,metrics]" --no-build-isolation
+# ascend npu环境
+pip install -e ".[torch-npu,metrics]" -i https://pypi.tuna.tsinghua.edu.cn/simple
+# 与构建镜像安装
+#c uda环境
+docker run -it --rm --gpus=all --ipc=host hiyouga/llamafactory:latest   # 该镜像基于 Ubuntu 22.04（x86_64）、CUDA 12.4、Python 3.11、PyTorch 2.6.0 和 Flash-attn 2.7.4 构建。全部镜像：https://hub.docker.com/r/hiyouga/llamafactory/tags
+# ascend npu环境（暂时不支持A3型号的ascend产品）
+docker pull quay.io/ascend/llamafactory:latest-npu-a2
+docker run -dit --ipc=host --network host --name 'llamafactory' --privileged -v /usr/local/Ascend/driver:/usr/local/Ascend/driver  -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware  -v /usr/local/sbin/:/usr/local/sbin/ -v /home/:/home/ quay.io/ascend/llamafactory:latest-npu-a2
+docker exec -it llamafactory bash
+# 安装校验
+llamafactory-cli env
+```
+### LoRA+
+我们选择使用LoRA+作为训练方法，LoRA+是LoRA的变种，通过为不同的矩阵设置不同的学习率，有效提高学习效率，能够提高性能（约1%-2%）和微调速度（约2倍）。
+### 训练
+LLaMA Factory提供一站式操作体验，LLaMA Board 可视化微调
+```bash
+llamafactory-cli webui
+```
+### 格式转换
+GGUF是一种文件格式，用于存储运行模型所需的信息，包括但不限于模型权重、模型超参数、默认生成配置和tokenzier，适用于llama.cpp推理场景，vllm、sglang有限支持。
+```bash
+# 获取源码，Qwen3需要llama.cpp版本大于等于b5092
+git clone https://github.com/ggml-org/llama.cpp
+cd llama.cpp
+# 格式转换需要构建python库依赖
+pip install -r requirements/requirements-convert_hf_to_gguf.txt
+python convert_hf_to_gguf.py witty-tune-model/loraplus_model --outfile witty-tune-model/loraplus_model.gguf
+```
+### 量化
+建议选择IQ4_NL作为量化方式，该量化方式成本低，无需校准集，模型综合能力损失小于1%，在调优场景实测无精度损失，并且能提升模型对json的理解能力。
+使用IQ4_NL量化后，相比bf16模型，推理速度提升150%；相比Q4_K_M量化，推理速度基本持平（其中prefill阶段推理速度仅降低13%，decode阶段推理速度提升25%），调优场景能力显著提升。
+需要注意的是，使用llama.cpp量化后的gguf模型，无法使用vllm、sglang部署。
+```bash
+llama-quantize witty-tune-model/loraplus_model.gguf IQ4_NL
+```