File size: 2,503 Bytes
0f75113
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71

# Multilingual E5 Large Instruct - 8-bit Quantized

This is an 8-bit quantized version of the [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) model.

## Model Details

- Original model: [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct)
- Quantization: 8-bit (using bitsandbytes)
- Model architecture: XLM-RoBERTa Large with instruction tuning
- Original parameters: 560M
- Embedding dimensions: 1024
- Context length: 512 tokens
- Languages supported: 94+ languages

## Usage

This model can be used with the `transformers` library for generating embeddings:

```python
from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F

# Load the model
model_name = "gopersonal/multilingual-e5-large-instruct-8bit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, load_in_8bit=True, device_map="auto")

# Define function to get embeddings
def average_pool(last_hidden_states, attention_mask):
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def get_detailed_instruct(task_description, query):
    return f'Instruct: task_description\nQuery: query'

# Prepare your texts
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
    get_detailed_instruct(task, 'how much protein should a female eat'),
    get_detailed_instruct(task, 'best restaurants in new york')
]

# Tokenize and generate embeddings
batch_dict = tokenizer(queries, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# Normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
```

## Infinity Embedding Server Usage

```bash
docker run --gpus all -v $PWD/models:/app/.cache -p 7997:7997 \
  michaelf34/infinity:latest \
  v2 --model-id gopersonal/multilingual-e5-large-instruct-8bit \
  --dtype int8 --batch-size 8 --engine torch --port 7997 --device auto
```

## Benefits of 8-bit Quantization

- Approximately 50% reduction in memory usage compared to FP16
- Faster inference, especially on GPUs with limited VRAM
- Minimal impact on embedding quality and similarity calculations

## License

This model inherits the license of the original model: MIT