Snowflake/snowflake-arctic-embed-m-v2.0 GGUF for embeddings.cpp
This repository contains an optimized GGUF artifact for running
Snowflake/snowflake-arctic-embed-m-v2.0 with
embeddings.cpp.
The GGUF is intended for embedding inference. It is not a llama.cpp text-generation model.
File
| File | Quantization | Size | SHA256 |
|---|---|---|---|
snowflake-arctic-embed-m-v2.0.q4_k_mlp_q8_attn.gguf |
mixed q4_K MLP + q8_0 attention |
186.26 MB |
4fa3b1f7f11d929137cafdd12aac01e6f8d6ee9f6f41853521e43feb7a7f4414 |
The mixed quantization policy is:
mlp.up_gate_proj.weight:q4_Kmlp.down_proj.weight:q4_Kattention.qkv_proj.weight:q8_0attention.o_proj.weight:q8_0
Recommended embeddings.cpp Build
cmake -S . -B build \
-DCMAKE_BUILD_TYPE=Release \
-DEMBEDDINGS_CPP_ENABLE_PYBIND=ON \
-DGGML_CPU_REPACK=ON \
-DGGML_BLAS=OFF \
-DGGML_OPENMP=OFF \
-DGGML_NATIVE=OFF \
-DGGML_CUDA=OFF \
-DGGML_VULKAN=OFF \
-DGGML_METAL=OFF
cmake --build build -j "$(nproc)"
Recommended CPU Runtime
EMBEDDINGS_CPP_CPU_REPACK=1 \
EMBEDDINGS_CPP_FLASH_ATTN=1 \
python your_script.py
By default, embeddings.cpp uses the detected CPU concurrency for model
inference. Set EMBEDDINGS_CPP_THREADS=N only when pinning a deployment to a
measured value for a specific CPU quota or host.
Do not enable the experimental GGML_REPACK_Q8_AVX2=1 path for this artifact;
it was slower on the tuning host.
Reproducing The GGUF
From an embeddings.cpp checkout:
uv pip install -r scripts/requirements.txt
uv run scripts/convert.py \
Snowflake/snowflake-arctic-embed-m-v2.0 \
models/snowflake-arctic-embed-m-v2.0.fp16.gguf \
f16
EMBEDDINGS_CPP_SKIP_QUANT_PATTERNS='attention.qkv_proj.weight,attention.o_proj.weight' \
./build/quantize \
models/snowflake-arctic-embed-m-v2.0.fp16.gguf \
models/snowflake-arctic-embed-m-v2.0.q4_k_mlp_attnf16.gguf \
q4_k
EMBEDDINGS_CPP_SKIP_QUANT_PATTERNS='mlp.up_gate_proj.weight,mlp.down_proj.weight' \
./build/quantize \
models/snowflake-arctic-embed-m-v2.0.q4_k_mlp_attnf16.gguf \
models/snowflake-arctic-embed-m-v2.0.q4_k_mlp_q8_attn.gguf \
q8_0
Notes
This model is derived from Snowflake/snowflake-arctic-embed-m-v2.0. Use the
upstream model card and license terms when deciding whether this artifact is
appropriate for your use case.
- Downloads last month
- 137
4-bit
Model tree for chux0519/snowflake-arctic-embed-m-v2.0-gguf-embeddings-cpp
Base model
Snowflake/snowflake-arctic-embed-m-v2.0