| # Multilingual E5 Large Instruct - 8-bit Quantized | |
| This is an 8-bit quantized version of the [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) model. | |
| ## Model Details | |
| - Original model: [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) | |
| - Quantization: 8-bit (using bitsandbytes) | |
| - Model architecture: XLM-RoBERTa Large with instruction tuning | |
| - Original parameters: 560M | |
| - Embedding dimensions: 1024 | |
| - Context length: 512 tokens | |
| - Languages supported: 94+ languages | |
| ## Usage | |
| This model can be used with the `transformers` library for generating embeddings: | |
| ```python | |
| from transformers import AutoModel, AutoTokenizer | |
| import torch.nn.functional as F | |
| # Load the model | |
| model_name = "gopersonal/multilingual-e5-large-instruct-8bit" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModel.from_pretrained(model_name, load_in_8bit=True, device_map="auto") | |
| # Define function to get embeddings | |
| def average_pool(last_hidden_states, attention_mask): | |
| last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0) | |
| return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None] | |
| def get_detailed_instruct(task_description, query): | |
| return f'Instruct: task_description\nQuery: query' | |
| # Prepare your texts | |
| task = 'Given a web search query, retrieve relevant passages that answer the query' | |
| queries = [ | |
| get_detailed_instruct(task, 'how much protein should a female eat'), | |
| get_detailed_instruct(task, 'best restaurants in new york') | |
| ] | |
| # Tokenize and generate embeddings | |
| batch_dict = tokenizer(queries, max_length=512, padding=True, truncation=True, return_tensors='pt') | |
| outputs = model(**batch_dict) | |
| embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask']) | |
| # Normalize embeddings | |
| embeddings = F.normalize(embeddings, p=2, dim=1) | |
| ``` | |
| ## Infinity Embedding Server Usage | |
| ```bash | |
| docker run --gpus all -v $PWD/models:/app/.cache -p 7997:7997 \ | |
| michaelf34/infinity:latest \ | |
| v2 --model-id gopersonal/multilingual-e5-large-instruct-8bit \ | |
| --dtype int8 --batch-size 8 --engine torch --port 7997 --device auto | |
| ``` | |
| ## Benefits of 8-bit Quantization | |
| - Approximately 50% reduction in memory usage compared to FP16 | |
| - Faster inference, especially on GPUs with limited VRAM | |
| - Minimal impact on embedding quality and similarity calculations | |
| ## License | |
| This model inherits the license of the original model: MIT | |