File size: 2,503 Bytes
0f75113 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
# Multilingual E5 Large Instruct - 8-bit Quantized
This is an 8-bit quantized version of the [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) model.
## Model Details
- Original model: [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct)
- Quantization: 8-bit (using bitsandbytes)
- Model architecture: XLM-RoBERTa Large with instruction tuning
- Original parameters: 560M
- Embedding dimensions: 1024
- Context length: 512 tokens
- Languages supported: 94+ languages
## Usage
This model can be used with the `transformers` library for generating embeddings:
```python
from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F
# Load the model
model_name = "gopersonal/multilingual-e5-large-instruct-8bit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, load_in_8bit=True, device_map="auto")
# Define function to get embeddings
def average_pool(last_hidden_states, attention_mask):
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
def get_detailed_instruct(task_description, query):
return f'Instruct: task_description\nQuery: query'
# Prepare your texts
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'how much protein should a female eat'),
get_detailed_instruct(task, 'best restaurants in new york')
]
# Tokenize and generate embeddings
batch_dict = tokenizer(queries, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# Normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
```
## Infinity Embedding Server Usage
```bash
docker run --gpus all -v $PWD/models:/app/.cache -p 7997:7997 \
michaelf34/infinity:latest \
v2 --model-id gopersonal/multilingual-e5-large-instruct-8bit \
--dtype int8 --batch-size 8 --engine torch --port 7997 --device auto
```
## Benefits of 8-bit Quantization
- Approximately 50% reduction in memory usage compared to FP16
- Faster inference, especially on GPUs with limited VRAM
- Minimal impact on embedding quality and similarity calculations
## License
This model inherits the license of the original model: MIT
|