Instructions to use MECHUK/embeddinggemma-rus-32768 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MECHUK/embeddinggemma-rus-32768 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="MECHUK/embeddinggemma-rus-32768",
	filename="embeddinggemma-rus-32768-F32.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use MECHUK/embeddinggemma-rus-32768 with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf MECHUK/embeddinggemma-rus-32768:F32
# Run inference directly in the terminal:
llama cli -hf MECHUK/embeddinggemma-rus-32768:F32

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf MECHUK/embeddinggemma-rus-32768:F32
# Run inference directly in the terminal:
llama cli -hf MECHUK/embeddinggemma-rus-32768:F32

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf MECHUK/embeddinggemma-rus-32768:F32
# Run inference directly in the terminal:
./llama-cli -hf MECHUK/embeddinggemma-rus-32768:F32

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf MECHUK/embeddinggemma-rus-32768:F32
# Run inference directly in the terminal:
./build/bin/llama-cli -hf MECHUK/embeddinggemma-rus-32768:F32

Use Docker

docker model run hf.co/MECHUK/embeddinggemma-rus-32768:F32

LM Studio
Jan
Ollama
How to use MECHUK/embeddinggemma-rus-32768 with Ollama:
```
ollama run hf.co/MECHUK/embeddinggemma-rus-32768:F32
```

Unsloth Studio

How to use MECHUK/embeddinggemma-rus-32768 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MECHUK/embeddinggemma-rus-32768 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MECHUK/embeddinggemma-rus-32768 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for MECHUK/embeddinggemma-rus-32768 to start chatting

Atomic Chat new
Docker Model Runner
How to use MECHUK/embeddinggemma-rus-32768 with Docker Model Runner:
```
docker model run hf.co/MECHUK/embeddinggemma-rus-32768:F32
```

Lemonade

How to use MECHUK/embeddinggemma-rus-32768 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull MECHUK/embeddinggemma-rus-32768:F32

Run and chat with the model

lemonade run user.embeddinggemma-rus-32768-F32

List all available models

lemonade list

embeddinggemma-rus-32768 (GGUF)

GGUF quantizations of alphaedge-ai/embeddinggemma-rus-32768, which is a 57.27% smaller version of google/embeddinggemma-300m optimized for Russian language via vocabulary size reduction using the trimming method.

Model Statistics

Metric	Original	Trimmed	Reduction
Vocabulary size	262,144 tokens	32,768 tokens	87.50%
Model size	307,581,696 params	131,420,928 params	57.27%

GGUF Quantizations

File	Type	Size
`embeddinggemma-rus-32768-Q8_0.gguf`	Q8_0 (8-bit)	136 MB
`embeddinggemma-rus-32768-F32.gguf`	F32 (lossless reference)	503 MB

Integrity checksums are in SHA256SUMS. Q8_0 is the recommended default; F32 is provided as a lossless reference equivalent to the source safetensors.

Conversion

Converted with llama.cpp (commit c1a1c8ee). Integrity checksums are in SHA256SUMS.

Two non-obvious steps were required for a correct conversion of this trimmed model:

Tokenizer registry patch. This trimmed model ships only tokenizer.json (Gemma SPM-style BPE) and no tokenizer.model (SentencePiece). The Gemma3 HF→GGUF converter only takes the SentencePiece path when tokenizer.model exists, so the model's tokenizer chkhsh (b847c511…) was registered as the gemma4 pre-type (get_vocab_base_pre()) in conversion/base.py. This maps the SPM-style BPE (normalizer →▁, BPE over whole text, ByteFallback on raw UTF-8) correctly.
--sentence-transformers-dense-modules. EmbeddingGemma has 2_Dense/3_Dense projection layers; without this flag they are silently dropped and embeddings drift from the SentenceTransformers baseline.

A functional smoke test (llama-embedding, L2-normalized, OpenAI-style JSON output) is included as smoke-embedding.json.

Usage

With llama-server (OpenAI-compatible embeddings endpoint):

llama-server \
  -m embeddinggemma-rus-32768-Q8_0.gguf \
  --embeddings --host 0.0.0.0 --port 8080

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": "task: search result | query: тестовый русский запрос", "model": "embeddinggemma-rus-32768"}'

Or directly with llama-embedding:

llama-embedding \
  -m embeddinggemma-rus-32768-Q8_0.gguf \
  --embd-output-format json --embd-normalize 2 \
  -p "task: search result | query: тестовый русский запрос"

EmbeddingGemma is instruction-tuned; prefix inputs with task: <task> | query: <text> (e.g. task: search result, task: title, task: web query).

Mining Dataset Statistics

Number of texts used for mining: 200,000 texts
Dataset: lbourdois/fineweb-2-trimming

Citations

EmbeddingGemma

@misc{vera2025embeddinggemmapowerfullightweighttext,
      title={EmbeddingGemma: Powerful and Lightweight Text Representations}, 
      author={Henrique Schechter Vera and Sahil Dua and Biao Zhang and Daniel Salz and Ryan Mullins and Sindhu Raghuram Panyam and Sara Smoot and Iftekhar Naim and Joe Zou and Feiyang Chen and Daniel Cer and Alice Lisak and Min Choi and Lucas Gonzalez and Omar Sanseviero and Glenn Cameron and Ian Ballantyne and Kat Black and Kaifeng Chen and Weiyi Wang and Zhe Li and Gus Martins and Jinhyuk Lee and Mark Sherwood and Juyeong Ji and Renjie Wu and Jingxiao Zheng and Jyotinder Singh and Abheesht Sharma and Divyashree Sreepathihalli and Aashi Jain and Adham Elarabawy and AJ Co and Andreas Doumanoglou and Babak Samari and Ben Hora and Brian Potetz and Dahun Kim and Enrique Alfonseca and Fedor Moiseev and Feng Han and Frank Palma Gomez and Gustavo Hernández Ábrego and Hesen Zhang and Hui Hui and Jay Han and Karan Gill and Ke Chen and Koert Chen and Madhuri Shanbhogue and Michael Boratko and Paul Suganthan and Sai Meher Karthik Duddu and Sandeep Mariserla and Setareh Ariafar and Shanfeng Zhang and Shijie Zhang and Simon Baumgartner and Sonam Goenka and Steve Qiu and Tanmaya Dabral and Trevor Walker and Vikram Rao and Waleed Khawaja and Wenlei Zhou and Xiaoqi Ren and Ye Xia and Yichang Chen and Yi-Ting Chen and Zhe Dong and Zhongli Ding and Francesco Visin and Gaël Liu and Jiageng Zhang and Kathleen Kenealy and Michelle Casbon and Ravin Kumar and Thomas Mesnard and Zach Gleicher and Cormac Brick and Olivier Lacombe and Adam Roberts and Qin Yin and Yunhsuan Sung and Raphael Hoffmann and Tris Warkentin and Armand Joulin and Tom Duerig and Mojtaba Seyedhosseini},
      year={2025},
      eprint={2509.20354},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.20354}, 
}

Trimming blog post

@misc{hf_blogpost_trimming,
      title={Introduction to Trimming}, 
      author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI},
      year={2026},
      url={https://huggingface.co/blog/lbourdois/introduction-to-trimming}, 
}

License

This model is derived from google/embeddinggemma-300m. Use of this model is governed by the Gemma Terms of Use. By using this model, you agree to the Gemma Terms of Use. This model is not affiliated with or endorsed by Google.

Downloads last month: 21

GGUF

Model size

0.1B params

Architecture

gemma-embedding

Hardware compatibility

8-bit

32-bit

Model tree for MECHUK/embeddinggemma-rus-32768

Base model

google/embeddinggemma-300m

Quantized

alphaedge-ai/embeddinggemma-rus-32768

Quantized

(1)

this model

Dataset used to train MECHUK/embeddinggemma-rus-32768

Paper for MECHUK/embeddinggemma-rus-32768

EmbeddingGemma: Powerful and Lightweight Text Representations

Paper • 2509.20354 • Published Sep 24, 2025 • 50