tex8's picture
Upload README.md with huggingface_hub
ec1253f verified
metadata
base_model: Octen/Octen-Embedding-8B
tags:
  - gguf
  - embedding
  - text-embeddings-inference
  - sentence-transformers
  - qwen3
language:
  - en
  - zh
  - multilingual
license: apache-2.0

Octen-Embedding-8B - GGUF Quantizations

GGUF quantizations of Octen/Octen-Embedding-8B, converted using llama.cpp b8110.

Octen-Embedding-8B is a fine-tune of Qwen/Qwen3-Embedding-8B, ranked #1 on the RTEB Leaderboard.

Quantized by tex8 — a platform building AI-native web solutions and cloud services.

Available Quantizations

File Quant Size Description
Octen-Embedding-8B-Q4_K_M.gguf Q4_K_M 4.0 GB Good balance of size and quality
Octen-Embedding-8B-Q6_K.gguf Q6_K 6.5 GB High quality, moderate size
Octen-Embedding-8B-Q8_0.gguf Q8_0 8.0 GB Near-lossless, recommended

All quantizations were created with --leave-output-tensor and --token-embedding-type F16 to preserve embedding quality.

Usage with llama.cpp

llama-embedding \
  -m Octen-Embedding-8B-Q8_0.gguf \
  --pooling last \
  -p "Your text here"

Usage with llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="Octen-Embedding-8B-Q8_0.gguf",
    embedding=True,
    n_gpu_layers=-1,
    n_ctx=2048,
)

result = llm.create_embedding("Your text here")
embedding = result['data'][0]['embedding']  # 4096-dim vector

Conversion Command

# Step 1: Convert to F16
python convert_hf_to_gguf.py Octen/Octen-Embedding-8B \
  --outfile Octen-Embedding-8B-f16.gguf \
  --outtype f16

# Step 2: Quantize
llama-quantize \
  --leave-output-tensor \
  --token-embedding-type F16 \
  Octen-Embedding-8B-f16.gguf \
  Octen-Embedding-8B-Q8_0.gguf Q8_0