Octen-Embedding-8B-INT8

Octen-Embedding-8B-INT8 is a text embedding model developed by Octen for semantic search and retrieval tasks. This model is fine-tuned from Qwen/Qwen3-Embedding-8B and supports multiple languages, providing high-quality embeddings for various applications.

Quantization: This is an INT8 quantized version using bitsandbytes. INT8 quantization significantly reduces memory footprint (~50% smaller), making it suitable for deployment on resource-constrained environments. Note that while memory usage is reduced, inference speed may not necessarily improve and could be slightly slower than the BF16 version on some hardware.

Key Highlights

🥇 RTEB Leaderboard Champion (as of January 12, 2026)

Octen-Embedding-8B ranks #1 on the RTEB Leaderboard with Mean (Task) score of 0.8045
Excellent performance on both Public (0.7953) and Private (0.8157) datasets
Demonstrates true generalization capability without overfitting to public benchmarks

Industry-Oriented Vertical Domain Expertise

Legal: Legal document retrieval
Finance: Financial reports, Q&A, and personal finance content
Healthcare: Medical Q&A, clinical dialogues, and health consultations
Code: Programming problems, code search, and SQL queries

Ultra-Long Context Support

Supports up to 32,768 tokens context length
Suitable for processing long documents in legal, healthcare, and other domains
High-dimensional embedding space for rich semantic representation

Multilingual Capability

Supports 100+ languages
Includes various programming languages
Strong multilingual, cross-lingual, and code retrieval capabilities

Open Source Model List

Model Type	Model	Size	Max Tokens	Embedding Dimensions	HuggingFace Link
Text Embedding	Octen-Embedding-0.6B	0.6B	32,768	1024	✅ Available
Text Embedding	Octen-Embedding-4B	4.0B	32,768	2560	✅ Available
Text Embedding	Octen-Embedding-8B	7.6B	32,768	4096	✅ Available

Model Family Design:

Octen-Embedding-8B: Best performance, RTEB #1, for high-precision retrieval
Octen-Embedding-4B: Best in 4B category, balanced performance and efficiency
Octen-Embedding-0.6B: Lightweight deployment, suitable for edge devices and resource-constrained environments

For API access, deployment solutions, and technical documentation, visit octen.ai.

Experimental Results

RTEB Leaderboard (Overall Performance)

Model	Embedding Dim	Max Tokens	Mean (Public)	Mean (Private)	Mean (Task)
Octen-Embedding-8B	4096	32768	0.7953	0.8157	0.8045
voyage-3-large	1024	32000	0.7434	0.8277	0.7812
gemini-embedding-001	3072	2048	0.7218	0.8075	0.7602
Octen-Embedding-4B	2560	32768	0.7747	0.7942	0.7834
MoD-Embedding	2560	32768	0.7642	0.7900	0.7758
Qwen3-Embedding-8B	4096	32768	0.7310	0.7838	0.7547
Octen-Embedding-0.6B	1024	32768	0.7241	-	-
voyage-3.5	1024	32000	0.7139	0.8102	0.7571
Cohere-embed-v4.0	1536	128000	0.6534	0.7943	0.7166
jina-embeddings-v4	2048	32768	0.6652	0.7664	0.7105
GritLM-7B	4096	32768	0.6187	0.7385	0.6724
text-embedding-3-large	3072	8191	0.6110	0.7130	0.6567
e5-mistral-7b-instruct	4096	32768	0.5090	0.7091	0.5987
NV-Embed-v2	4096	32768	0.5805	0.6691	0.6203
snowflake-arctic-embed-l-v2.0	1024	8192	0.5395	0.7079	0.6150
multilingual-e5-large-instruct	1024	514	0.5478	0.6859	0.6097
gte-multilingual-base	768	8192	0.5291	0.6697	0.5921
text-embedding-3-small	1536	8191	0.5260	0.6630	0.5874
bge-m3	1024	8194	0.5216	0.6726	0.5893
Qwen3-Embedding-4B	2560	32768	-	0.7711	-
Qwen3-Embedding-0.6B	1024	32768	-	0.7117	-

Model Details

Base Model: Qwen/Qwen3-Embedding-8B
Model Size: 8B parameters (INT8 quantized)
Max Sequence Length: 40,960 tokens
Embedding Dimension: 4096
Languages: English, Chinese, and multilingual support
Training Method: LoRA fine-tuning
Quantization: INT8 (bitsandbytes)
Memory Footprint: ~8GB (vs ~16GB for BF16 version)

Usage

Using Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Octen/Octen-Embedding-8B-INT8")

# Encode sentences
sentences = [
    "This is an example sentence",
    "Each sentence is converted to a vector"
]

embeddings = model.encode(sentences)
print(embeddings.shape)
# Output: (2, 4096)

# Compute similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")

Using Transformers

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained("Octen/Octen-Embedding-8B-INT8", padding_side="left")
model = AutoModel.from_pretrained("Octen/Octen-Embedding-8B-INT8")
model.eval()

def encode(texts):
    inputs = tokenizer(texts, padding=True, truncation=True,
                      max_length=8192, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)
        # Use last token embedding
        embeddings = outputs.last_hidden_state[:, -1, :]
        # Normalize embeddings
        embeddings = F.normalize(embeddings, p=2, dim=1)

    return embeddings

# Example usage
texts = ["Hello world", "你好世界"]
embeddings = encode(texts)
similarity = torch.matmul(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")

Recommended Use Cases

Semantic search and information retrieval
Document similarity and clustering
Question answering
Cross-lingual retrieval
Text classification with embeddings
Deployment on GPU-constrained environments

Known Issues

When encoding documents without any instruction prefix, you may encounter unexpected behavior due to an upstream issue in Qwen3-Embedding. To avoid this issue, we recommend adding "- " (dash followed by space) at the beginning of your text when encoding documents:

# Recommended: Add "- " prefix for document encoding
documents = ["- " + doc for doc in documents]
embeddings = model.encode(documents)

This workaround ensures consistent and expected embedding behavior.

Limitations

Performance may vary across different domains and languages
Very long documents (>40K tokens) require truncation
Optimized for retrieval tasks, not for text generation
INT8 quantization may introduce minor accuracy degradation compared to BF16 version
Inference speed may not improve despite reduced memory usage

License

This model is licensed under the Apache License 2.0.

This model is derived from Qwen/Qwen3-Embedding-8B, which is also licensed under Apache License 2.0.

Paper

For more details, please refer to our blog post: Octen Series: Optimizing Embedding Models to #1 on RTEB Leaderboard

Citation

If you find our work helpful, please consider citing:

@misc{octen2025rteb,
  title={Octen Series: Optimizing Embedding Models to #1 on RTEB Leaderboard},
  author={Octen Team},
  year={2025},
  url={https://octen-team.github.io/octen_blog/posts/octen-rteb-first-place/}
}

Downloads last month: 275

Safetensors

Model size

8B params

Tensor type

F32

F16

Model tree for Octen/Octen-Embedding-8B-INT8

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-Embedding-8B

Quantized

(31)

this model

Octen
/

Octen-Embedding-8B-INT8