|
|
--- |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
- multilingual |
|
|
license: apache-2.0 |
|
|
library_name: sentence-transformers |
|
|
tags: |
|
|
- sentence-transformers |
|
|
- sentence-similarity |
|
|
- feature-extraction |
|
|
- embedding |
|
|
- text-embedding |
|
|
- retrieval |
|
|
- quantization |
|
|
- int8 |
|
|
pipeline_tag: sentence-similarity |
|
|
base_model: Qwen/Qwen3-Embedding-8B |
|
|
--- |
|
|
|
|
|
# Octen-Embedding-8B-INT8 |
|
|
|
|
|
Octen-Embedding-8B-INT8 is a text embedding model designed for semantic search and retrieval tasks. This model is fine-tuned from [Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B) and supports multiple languages, providing high-quality embeddings for various applications. |
|
|
|
|
|
**Quantization**: This is an INT8 quantized version using bitsandbytes. INT8 quantization significantly reduces memory footprint (~50% smaller), making it suitable for deployment on resource-constrained environments. Note that while memory usage is reduced, inference speed may not necessarily improve and could be slightly slower than the BF16 version on some hardware. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: [Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B) |
|
|
- **Model Size**: 8B parameters (INT8 quantized) |
|
|
- **Max Sequence Length**: 40,960 tokens |
|
|
- **Embedding Dimension**: 4096 |
|
|
- **Languages**: English, Chinese, and multilingual support |
|
|
- **Training Method**: LoRA fine-tuning |
|
|
- **Quantization**: INT8 (bitsandbytes) |
|
|
- **Memory Footprint**: ~8GB (vs ~16GB for BF16 version) |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Using Sentence Transformers |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
model = SentenceTransformer("Octen/Octen-Embedding-8B-INT8") |
|
|
|
|
|
# Encode sentences |
|
|
sentences = [ |
|
|
"This is an example sentence", |
|
|
"Each sentence is converted to a vector" |
|
|
] |
|
|
|
|
|
embeddings = model.encode(sentences) |
|
|
print(embeddings.shape) |
|
|
# Output: (2, 4096) |
|
|
|
|
|
# Compute similarity |
|
|
from sentence_transformers.util import cos_sim |
|
|
similarity = cos_sim(embeddings[0], embeddings[1]) |
|
|
print(f"Similarity: {similarity.item():.4f}") |
|
|
``` |
|
|
|
|
|
### Using Transformers |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel, AutoTokenizer |
|
|
import torch |
|
|
import torch.nn.functional as F |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("Octen/Octen-Embedding-8B-INT8", padding_side="left") |
|
|
model = AutoModel.from_pretrained("Octen/Octen-Embedding-8B-INT8") |
|
|
model.eval() |
|
|
|
|
|
def encode(texts): |
|
|
inputs = tokenizer(texts, padding=True, truncation=True, |
|
|
max_length=8192, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
# Use last token embedding |
|
|
embeddings = outputs.last_hidden_state[:, -1, :] |
|
|
# Normalize embeddings |
|
|
embeddings = F.normalize(embeddings, p=2, dim=1) |
|
|
|
|
|
return embeddings |
|
|
|
|
|
# Example usage |
|
|
texts = ["Hello world", "你好世界"] |
|
|
embeddings = encode(texts) |
|
|
similarity = torch.matmul(embeddings[0], embeddings[1]) |
|
|
print(f"Similarity: {similarity.item():.4f}") |
|
|
``` |
|
|
|
|
|
## Recommended Use Cases |
|
|
|
|
|
- Semantic search and information retrieval |
|
|
- Document similarity and clustering |
|
|
- Question answering |
|
|
- Cross-lingual retrieval |
|
|
- Text classification with embeddings |
|
|
- Deployment on GPU-constrained environments |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Performance may vary across different domains and languages |
|
|
- Very long documents (>40K tokens) require truncation |
|
|
- Optimized for retrieval tasks, not for text generation |
|
|
- INT8 quantization may introduce minor accuracy degradation compared to BF16 version |
|
|
- Inference speed may not improve despite reduced memory usage |
|
|
|
|
|
## License |
|
|
|
|
|
This model is licensed under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). |
|
|
|
|
|
This model is derived from [Qwen/Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B), which is also licensed under Apache License 2.0. |
|
|
|
|
|
## Paper |
|
|
|
|
|
For more details, please refer to our blog post: [Octen Series: Optimizing Embedding Models to #1 on RTEB Leaderboard](https://octen-team.github.io/octen_blog/posts/octen-rteb-first-place/) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find our work helpful, please consider citing: |
|
|
|
|
|
```bibtex |
|
|
@misc{octen2025rteb, |
|
|
title={Octen Series: Optimizing Embedding Models to #1 on RTEB Leaderboard}, |
|
|
author={Octen Team}, |
|
|
year={2025}, |
|
|
url={https://octen-team.github.io/octen_blog/posts/octen-rteb-first-place/} |
|
|
} |
|
|
``` |
|
|
|