bflhc's picture
Update README: migrate from bflhc to Octen organization and update citation
150628c
metadata
language:
  - en
  - zh
  - multilingual
license: apache-2.0
library_name: sentence-transformers
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - embedding
  - text-embedding
  - retrieval
  - quantization
  - int8
pipeline_tag: sentence-similarity
base_model: Qwen/Qwen3-Embedding-8B

Octen-Embedding-8B-INT8

Octen-Embedding-8B-INT8 is a text embedding model designed for semantic search and retrieval tasks. This model is fine-tuned from Qwen/Qwen3-Embedding-8B and supports multiple languages, providing high-quality embeddings for various applications.

Quantization: This is an INT8 quantized version using bitsandbytes. INT8 quantization significantly reduces memory footprint (~50% smaller), making it suitable for deployment on resource-constrained environments. Note that while memory usage is reduced, inference speed may not necessarily improve and could be slightly slower than the BF16 version on some hardware.

Model Details

  • Base Model: Qwen/Qwen3-Embedding-8B
  • Model Size: 8B parameters (INT8 quantized)
  • Max Sequence Length: 40,960 tokens
  • Embedding Dimension: 4096
  • Languages: English, Chinese, and multilingual support
  • Training Method: LoRA fine-tuning
  • Quantization: INT8 (bitsandbytes)
  • Memory Footprint: ~8GB (vs ~16GB for BF16 version)

Usage

Using Sentence Transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Octen/Octen-Embedding-8B-INT8")

# Encode sentences
sentences = [
    "This is an example sentence",
    "Each sentence is converted to a vector"
]

embeddings = model.encode(sentences)
print(embeddings.shape)
# Output: (2, 4096)

# Compute similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")

Using Transformers

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained("Octen/Octen-Embedding-8B-INT8", padding_side="left")
model = AutoModel.from_pretrained("Octen/Octen-Embedding-8B-INT8")
model.eval()

def encode(texts):
    inputs = tokenizer(texts, padding=True, truncation=True,
                      max_length=8192, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)
        # Use last token embedding
        embeddings = outputs.last_hidden_state[:, -1, :]
        # Normalize embeddings
        embeddings = F.normalize(embeddings, p=2, dim=1)

    return embeddings

# Example usage
texts = ["Hello world", "你好世界"]
embeddings = encode(texts)
similarity = torch.matmul(embeddings[0], embeddings[1])
print(f"Similarity: {similarity.item():.4f}")

Recommended Use Cases

  • Semantic search and information retrieval
  • Document similarity and clustering
  • Question answering
  • Cross-lingual retrieval
  • Text classification with embeddings
  • Deployment on GPU-constrained environments

Limitations

  • Performance may vary across different domains and languages
  • Very long documents (>40K tokens) require truncation
  • Optimized for retrieval tasks, not for text generation
  • INT8 quantization may introduce minor accuracy degradation compared to BF16 version
  • Inference speed may not improve despite reduced memory usage

License

This model is licensed under the Apache License 2.0.

This model is derived from Qwen/Qwen3-Embedding-8B, which is also licensed under Apache License 2.0.

Paper

For more details, please refer to our blog post: Octen Series: Optimizing Embedding Models to #1 on RTEB Leaderboard

Citation

If you find our work helpful, please consider citing:

@misc{octen2025rteb,
  title={Octen Series: Optimizing Embedding Models to #1 on RTEB Leaderboard},
  author={Octen Team},
  year={2025},
  url={https://octen-team.github.io/octen_blog/posts/octen-rteb-first-place/}
}