Upload README.md with huggingface_hub

ec1253f verified 2 months ago

1.95 kB

base_model: Octen/Octen-Embedding-8B
tags:
  - gguf
  - embedding
  - text-embeddings-inference
  - sentence-transformers
  - qwen3
language:
  - en
  - zh
  - multilingual
license: apache-2.0

Octen-Embedding-8B - GGUF Quantizations

GGUF quantizations of Octen/Octen-Embedding-8B, converted using llama.cpp b8110.

Octen-Embedding-8B is a fine-tune of Qwen/Qwen3-Embedding-8B, ranked #1 on the RTEB Leaderboard.

Quantized by tex8 — a platform building AI-native web solutions and cloud services.

Available Quantizations

File	Quant	Size	Description
`Octen-Embedding-8B-Q4_K_M.gguf`	Q4_K_M	4.0 GB	Good balance of size and quality
`Octen-Embedding-8B-Q6_K.gguf`	Q6_K	6.5 GB	High quality, moderate size
`Octen-Embedding-8B-Q8_0.gguf`	Q8_0	8.0 GB	Near-lossless, recommended

All quantizations were created with --leave-output-tensor and --token-embedding-type F16 to preserve embedding quality.

Usage with llama.cpp

llama-embedding \
  -m Octen-Embedding-8B-Q8_0.gguf \
  --pooling last \
  -p "Your text here"

Usage with llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="Octen-Embedding-8B-Q8_0.gguf",
    embedding=True,
    n_gpu_layers=-1,
    n_ctx=2048,
)

result = llm.create_embedding("Your text here")
embedding = result['data'][0]['embedding']  # 4096-dim vector

Conversion Command

# Step 1: Convert to F16
python convert_hf_to_gguf.py Octen/Octen-Embedding-8B \
  --outfile Octen-Embedding-8B-f16.gguf \
  --outtype f16

# Step 2: Quantize
llama-quantize \
  --leave-output-tensor \
  --token-embedding-type F16 \
  Octen-Embedding-8B-f16.gguf \
  Octen-Embedding-8B-Q8_0.gguf Q8_0