Instructions to use jedisct1/MiMo-V2.5-coder-Q2-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use jedisct1/MiMo-V2.5-coder-Q2-v2 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="jedisct1/MiMo-V2.5-coder-Q2-v2",
	filename="MiMo-V2.5-coder-Q2-00001-of-00016.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use jedisct1/MiMo-V2.5-coder-Q2-v2 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf jedisct1/MiMo-V2.5-coder-Q2-v2
# Run inference directly in the terminal:
llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2-v2

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf jedisct1/MiMo-V2.5-coder-Q2-v2
# Run inference directly in the terminal:
llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2-v2

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf jedisct1/MiMo-V2.5-coder-Q2-v2
# Run inference directly in the terminal:
./llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2-v2

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf jedisct1/MiMo-V2.5-coder-Q2-v2
# Run inference directly in the terminal:
./build/bin/llama-cli -hf jedisct1/MiMo-V2.5-coder-Q2-v2

Use Docker

docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2-v2

LM Studio
Jan

vLLM

How to use jedisct1/MiMo-V2.5-coder-Q2-v2 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "jedisct1/MiMo-V2.5-coder-Q2-v2"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "jedisct1/MiMo-V2.5-coder-Q2-v2",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2-v2

Ollama
How to use jedisct1/MiMo-V2.5-coder-Q2-v2 with Ollama:
```
ollama run hf.co/jedisct1/MiMo-V2.5-coder-Q2-v2
```

Unsloth Studio

How to use jedisct1/MiMo-V2.5-coder-Q2-v2 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jedisct1/MiMo-V2.5-coder-Q2-v2 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for jedisct1/MiMo-V2.5-coder-Q2-v2 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for jedisct1/MiMo-V2.5-coder-Q2-v2 to start chatting

How to use jedisct1/MiMo-V2.5-coder-Q2-v2 with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf jedisct1/MiMo-V2.5-coder-Q2-v2

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "jedisct1/MiMo-V2.5-coder-Q2-v2"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use jedisct1/MiMo-V2.5-coder-Q2-v2 with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf jedisct1/MiMo-V2.5-coder-Q2-v2

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default jedisct1/MiMo-V2.5-coder-Q2-v2

Run Hermes

hermes

Docker Model Runner
How to use jedisct1/MiMo-V2.5-coder-Q2-v2 with Docker Model Runner:
```
docker model run hf.co/jedisct1/MiMo-V2.5-coder-Q2-v2
```

Lemonade

How to use jedisct1/MiMo-V2.5-coder-Q2-v2 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull jedisct1/MiMo-V2.5-coder-Q2-v2

Run and chat with the model

lemonade run user.MiMo-V2.5-coder-Q2-v2-{{QUANT_TAG}}

List all available models

lemonade list

jedisct1 commited on 8 days ago

Commit

84ae63d

verified ·

1 Parent(s): 16de59a

Upload folder using huggingface_hub

Browse files

Files changed (19) hide show

.gitattributes +1 -35
MiMo-V2.5-coder-Q2-00001-of-00016.gguf +3 -0
MiMo-V2.5-coder-Q2-00002-of-00016.gguf +3 -0
MiMo-V2.5-coder-Q2-00003-of-00016.gguf +3 -0
MiMo-V2.5-coder-Q2-00004-of-00016.gguf +3 -0
MiMo-V2.5-coder-Q2-00005-of-00016.gguf +3 -0
MiMo-V2.5-coder-Q2-00006-of-00016.gguf +3 -0
MiMo-V2.5-coder-Q2-00007-of-00016.gguf +3 -0
MiMo-V2.5-coder-Q2-00008-of-00016.gguf +3 -0
MiMo-V2.5-coder-Q2-00009-of-00016.gguf +3 -0
MiMo-V2.5-coder-Q2-00010-of-00016.gguf +3 -0
MiMo-V2.5-coder-Q2-00011-of-00016.gguf +3 -0
MiMo-V2.5-coder-Q2-00012-of-00016.gguf +3 -0
MiMo-V2.5-coder-Q2-00013-of-00016.gguf +3 -0
MiMo-V2.5-coder-Q2-00014-of-00016.gguf +3 -0
MiMo-V2.5-coder-Q2-00015-of-00016.gguf +3 -0
MiMo-V2.5-coder-Q2-00016-of-00016.gguf +3 -0
README.md +214 -0
run-server.sh +62 -0

.gitattributes CHANGED Viewed

@@ -1,35 +1 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
-*.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text


1	+ *.gguf filter=lfs diff=lfs merge=lfs -text

MiMo-V2.5-coder-Q2-00001-of-00016.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2896b618eeb9d0e27bca57a3cf5ecd8520e19f4df296464c5db2fcf61b1a5adb
+size 8831584256

MiMo-V2.5-coder-Q2-00002-of-00016.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1d609955e1801579c62d6299cb49725696b8e6555f988aee10e9342c5d4c488f
+size 6996099872

MiMo-V2.5-coder-Q2-00003-of-00016.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a255ea8fce9fe2cde49203b938027927c96d3d0044ba9beb348f2792fc9c9a17
+size 6996099872

MiMo-V2.5-coder-Q2-00004-of-00016.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:404c6404f2b99d67a8b1a611003032e36e6047d25c1f9f93c73212afb92e638c
+size 6996099872

MiMo-V2.5-coder-Q2-00005-of-00016.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b1623ed43c88610124cdb0e731356f0ff2f47d26e5d2a6d3793c7d93f0b2fcbb
+size 6996099872

MiMo-V2.5-coder-Q2-00006-of-00016.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:abc369764edb625ace3ad95ad522e8d2bfa0a5f3a4b6fc4759812be79e543011
+size 6996099872

MiMo-V2.5-coder-Q2-00007-of-00016.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7678e83adedf0903ae584173bb2b01e36a24e0faa6082246626a7bfdd315d8fb
+size 6996099872

MiMo-V2.5-coder-Q2-00008-of-00016.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5d85c683831f2b9942b817409b3967805a78e20d70598d43a642d6b12d0cccf2
+size 6996099872

MiMo-V2.5-coder-Q2-00009-of-00016.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3d8493b982984c35a887f2a69839d5be3454a5574a96579b662f0cb1f94d5f58
+size 6996099872

MiMo-V2.5-coder-Q2-00010-of-00016.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:89febfedaaaf5c5a2b576b9575594f6f732efb6e16895d025f16c3c3ac3bb877
+size 6996099872

MiMo-V2.5-coder-Q2-00011-of-00016.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:59fbcc70d59edad135e92f310c70055bd09224b1369a119a084121c47f7426ee
+size 6996099872

MiMo-V2.5-coder-Q2-00012-of-00016.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a8411d5242700a9acc72fb8eed3c1f7b9efe3d6cdf90091a7df7d6453f61786b
+size 6996099872

MiMo-V2.5-coder-Q2-00013-of-00016.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d817b70895822965a6f929d9313376f5a46031d3165e88dccb637ec1c25faef8
+size 6996099872

MiMo-V2.5-coder-Q2-00014-of-00016.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a328343a4e68c901f2fe31d993df408680b38bcabb4d114a327b92c14027c074
+size 6996099840

MiMo-V2.5-coder-Q2-00015-of-00016.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:394281ac7b020ef0acd253a535b222aac02b4d488f979e864d63e12fdc2d7bee
+size 6996099872

MiMo-V2.5-coder-Q2-00016-of-00016.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0de886b54240c049d747cad7264425f213b2e94e954cac2be18457808d7a98d1
+size 6996099872

README.md ADDED Viewed

	@@ -0,0 +1,214 @@

+---
+license: mit
+base_model: XiaomiMiMo/MiMo-V2.5
+language:
+- en
+library_name: llama.cpp
+tags:
+- gguf
+- llama.cpp
+- text-generation
+- code
+- coding
+- tool-calling
+- agent
+- mixture-of-experts
+- long-context
+pipeline_tag: text-generation
+---
+# MiMo-V2.5 Coder Q2 v2 GGUF
+This is a text-only GGUF build of [XiaomiMiMo/MiMo-V2.5](https://huggingface.co/XiaomiMiMo/MiMo-V2.5), tuned for coding and OpenAI-compatible tool calling on high-memory local machines.
+The target system for this build is a 128 GB Apple Silicon machine. The default serving profile uses a 100,000-token context and asks llama.cpp to fit as much of the model as possible onto Metal while leaving enough headroom for KV cache and runtime buffers. Smaller-memory machines will likely need a smaller context, more CPU offload, or a smaller quant.
+This is not a multimodal build. MiMo-V2.5 is an omnimodal checkpoint, but this GGUF contains the text model only. The vision and audio encoders are not included. MiMo's multi-token prediction blocks were also omitted because the current llama.cpp MiMo2 generation path does not use those blocks for normal inference.
+## Why this build exists
+Public low-bit quants of very large MoE models can be surprisingly fragile: tool calls may become malformed, code may fail on small API details, and long answers can drift into repeated reasoning loops. This build was made to spend the limited Q2-class quality budget on the workloads where MiMo-V2.5 is most useful locally:
+- coding in common systems and scripting languages
+- web UI/component generation
+- OpenAI-compatible tool calling
+- Swival-style agent loops over real files and commands
+- long English technical prompts
+Chinese-language quality and multimodal behavior were not optimization targets.
+## How it was built
+The source was the original `XiaomiMiMo/MiMo-V2.5` checkpoint, converted to GGUF with llama.cpp's native MiMo2 support. Conversion was text-only and omitted runtime-inactive MTP/NextN blocks, so memory is not spent on tensors that current llama.cpp MiMo2 inference does not execute.
+The final artifact is a split `Q2_K_S` GGUF with an importance matrix built from English coding, debugging, tool-calling, shell, and agent prompts. The calibration mix was designed to make the quantizer preserve behavior that matters for developer workflows rather than generic chat breadth.
+The build was iterative:
+1. Convert the original checkpoint to split BF16 GGUF.
+2. Produce a first low-bit coding/tool-use candidate.
+3. Test that candidate on executable coding tasks and Swival-style tool calls.
+4. Add calibration coverage for the failures that showed up in real tests.
+5. Rebuild the importance matrix from the expanded coding/tool-use prompt mix.
+6. Re-quantize with the final `Q2_K_S` recipe.
+The calibration text is not required to use the model. It was a build-time tool for telling the quantizer which activations mattered most: code generation, code repair, shell-style work, JSON/tool-call formatting, and agent workflows over real files.
+Quantization details:
+- Quant type: `Q2_K_S`
+- Importance matrix: coding and tool-calling focused
+- Embeddings and output tensors kept at higher precision
+- Attention and dense first-FFN tensors protected at higher precision
+- MoE down-expert tensors kept at `Q3_K`
+- Reported size: about 108,496.76 MiB, 2.95 BPW
+- Split files: 16 GGUF shards
+One tokenizer metadata fix is included: the base-vocabulary `</s>` token is marked as a control-looking token so llama.cpp does not warn at load time. MiMo's real EOS token remains `<|im_end|>`.
+## Why this recipe was chosen
+The recipe is a compromise between quality and a hard practical limit: this model has to run locally on a 128 GB unified-memory machine. Higher-bit GGUFs of a model this large can exceed the useful memory envelope once KV cache, batching, Metal buffers, and the operating system are included.
+The first plain `Q2_K` family candidate was small enough, but it was not reliable enough for tool calling. It malformed some Swival-style arguments and missed several conditional tools. The v2 recipe is larger, but it spends the extra space where it helped most:
+- embeddings and output tensors stay higher precision because they are important for token identity and exact syntax
+- attention tensors are protected because tool-call and code prompts are structure-heavy
+- the dense first FFN is protected because early-layer representation quality matters disproportionately after heavy quantization
+- MoE down-expert tensors use `Q3_K`, which was a better quality/memory tradeoff than pushing all expert down-projections lower
+That is why this is still a Q2-class build, but not the smallest possible Q2 build.
+## Why it is good at coding
+This quant was not chosen just because it fits in memory. It was iterated against executable tasks and then rebuilt with a stronger coding/tool-use importance matrix after early failures were identified.
+The first low-bit pass exposed the kinds of issues that matter in practice: malformed tool-call arguments, brittle JavaScript Markdown parsing, incorrect Zig checked-addition APIs, and small C/C++/Go harness problems. Those failures were used to improve the calibration distribution and to validate that the final model can solve the tasks when the problem statement contains the same constraints a developer would normally give.
+The final v2 artifact passed the local coding and web-design harness across:
+- Swift
+- JavaScript
+- TypeScript through Deno
+- Rust
+- C
+- C++
+- Zig
+- Python
+- Perl
+- Go
+- static HTML/CSS
+That harness writes complete model-generated files into isolated directories and validates them with local compilers, runtimes, or test runners. The current v2 run passed 11/11. The checks are intentionally practical rather than benchmark-like: they catch whether the generated code compiles, runs, and handles edge cases from the prompt.
+It was also tested on framework-style frontend tasks. React, Vue, and Solid components were rendered server-side with Deno/npm tooling, including props, filtering behavior, accessible form markup, and summary text checks. The current v2 run passed 3/3.
+The important point is not that these small harnesses prove universal coding ability. They prove that the quantization process did not destroy the details that low-bit models often lose first: exact exported names, balanced parsing logic, checked arithmetic APIs, command/tool argument shapes, and framework-specific rendering conventions.
+## Tool-calling validation
+Tool calling was tested with Swival because Swival exercises OpenAI-style tools in realistic agent loops instead of only checking toy single-call examples.
+Validation included:
+- a broad synthetic selector suite covering the current Swival tool surface
+- real one-shot Swival tasks over files, grep, command execution, fetches, image input, skills, metaskills, snapshots, todos, and subagents
+- a real `/goal` run that required the model to complete work and call `complete_goal`
+The current v2 results were:
+- Swival all-tools selector: 22/22
+- real Swival one-shot suite: 10/10 with zero failed tool calls
+- real Swival goal-mode `complete_goal`: passed with exactly one successful `complete_goal` call
+A separate repetition-loop guard was also run on long coding and web prompts. The current v2 artifact passed 4/4, with no repeated-tail failures.
+These are local validation results, not public benchmark scores. They are included so users know what this quant was optimized for and what kinds of regressions were actively checked.
+Compared with the earlier local candidate, the v2 build fixed the key practical failures: the broad Swival selector went from 18/22 to 22/22, the coding/web suite reached 11/11 after task prompts were aligned with the validators, and the real Swival task suite completed with zero failed tool calls. This is why the package is labeled `v2`.
+## Serving with llama.cpp
+Recent llama.cpp builds should be able to load the repo directly:
+```sh
+llama-server \
+  -hf jedisct1/MiMo-V2.5-coder-Q2-v2 \
+  --host 127.0.0.1 \
+  --port 8080 \
+  --ctx-size 100000 \
+  --parallel 1 \
+  --batch-size 512 \
+  --ubatch-size 128 \
+  --threads 12 \
+  --threads-batch 18 \
+  --prio 0 \
+  --poll 80 \
+  --flash-attn on \
+  --jinja \
+  --fit on \
+  --fit-target 4096 \
+  --fit-ctx 100000 \
+  --gpu-layers auto \
+  --cache-type-k f16 \
+  --cache-type-v f16 \
+  --reasoning off
+```
+If your llama.cpp build does not auto-select the split GGUF set, pass the first shard explicitly:
+```sh
+llama-server \
+  -hf jedisct1/MiMo-V2.5-coder-Q2-v2 \
+  --hf-file MiMo-V2.5-coder-Q2-00001-of-00016.gguf \
+  --ctx-size 100000 \
+  --flash-attn on \
+  --jinja \
+  --reasoning off
+```
+If you cloned or downloaded the repository locally, you can use the helper script:
+```sh
+./run-server.sh
+```
+The helper script loads the first GGUF shard next to it and uses the same default serving profile.
+Default settings:
+```sh
+MIMO_CTX=100000
+MIMO_FIT_CTX=100000
+MIMO_FIT_TARGET=4096
+MIMO_BATCH=512
+MIMO_UBATCH=128
+MIMO_REASONING=off
+MIMO_CPU_MOE=0
+```
+For more memory headroom, use CPU-MoE mode:
+```sh
+MIMO_CPU_MOE=1 MIMO_FIT_TARGET=32768 MIMO_BATCH=128 MIMO_UBATCH=64 ./run-server.sh
+```
+That mode is slower, especially during long prompt prefill, but it leaves more Metal memory available.
+You can point the script at a specific server binary:
+```sh
+LLAMA_SERVER=/path/to/llama-server ./run-server.sh
+```
+## Tool-calling tips
+- Disable reasoning output with `--reasoning off` or `MIMO_REASONING=off`.
+- Send tool schemas from the client rather than enabling llama.cpp built-in tools.
+- Set `parallel_tool_calls` to `false` if your client supports it.
+- Avoid forcing `tool_choice: required`; in testing, that made malformed calls more likely.
+- Use a client that supports OpenAI-compatible tool calls cleanly. Swival was the main validation client for this build.
+## License
+The upstream `XiaomiMiMo/MiMo-V2.5` model card declares the MIT license. This derived GGUF is provided with the same license metadata.

run-server.sh ADDED Viewed

	@@ -0,0 +1,62 @@

+#!/usr/bin/env bash
+set -euo pipefail
+SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)
+LLAMA_SERVER=${LLAMA_SERVER:-llama-server}
+if ! command -v "$LLAMA_SERVER" >/dev/null 2>&1; then
+  echo "llama-server was not found. Install llama.cpp or set LLAMA_SERVER=/path/to/llama-server." >&2
+  exit 1
+fi
+if [[ -n "${MIMO_MODEL:-}" ]]; then
+  MODEL=$MIMO_MODEL
+else
+  shopt -s nullglob
+  CANDIDATES=("$SCRIPT_DIR"/MiMo-V2.5-coder-Q2-00001-of-*.gguf)
+  shopt -u nullglob
+  if [[ ${#CANDIDATES[@]} -eq 0 ]]; then
+    echo "No first GGUF shard found next to run-server.sh." >&2
+    exit 1
+  fi
+  MODEL=${CANDIDATES[0]}
+fi
+ARGS=(
+  --model "$MODEL"
+  --host "${MIMO_HOST:-127.0.0.1}"
+  --port "${MIMO_PORT:-8080}"
+  --ctx-size "${MIMO_CTX:-100000}"
+  --parallel "${MIMO_PARALLEL:-1}"
+  --batch-size "${MIMO_BATCH:-512}"
+  --ubatch-size "${MIMO_UBATCH:-128}"
+  --threads "${MIMO_THREADS:-12}"
+  --threads-batch "${MIMO_THREADS_BATCH:-18}"
+  --prio "${MIMO_PRIO:-0}"
+  --poll "${MIMO_POLL:-80}"
+  --flash-attn on
+  --jinja
+  --fit "${MIMO_FIT:-on}"
+  --fit-target "${MIMO_FIT_TARGET:-4096}"
+  --fit-ctx "${MIMO_FIT_CTX:-100000}"
+  --gpu-layers "${MIMO_GPU_LAYERS:-auto}"
+  --cache-type-k "${MIMO_CACHE_K:-f16}"
+  --cache-type-v "${MIMO_CACHE_V:-f16}"
+  --reasoning "${MIMO_REASONING:-off}"
+)
+if [[ "${MIMO_CPU_MOE:-0}" == "1" ]]; then
+  ARGS+=(--cpu-moe)
+fi
+if [[ -n "${MIMO_DEVICE:-}" ]]; then
+  ARGS+=(--device "$MIMO_DEVICE")
+fi
+if [[ -n "${MIMO_TOOLS:-}" ]]; then
+  ARGS+=(--tools "$MIMO_TOOLS")
+fi
+exec "$LLAMA_SERVER" "${ARGS[@]}" "$@"