# Multilingual E5 Large Instruct - 8-bit Quantized This is an 8-bit quantized version of the [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) model. ## Model Details - Original model: [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) - Quantization: 8-bit (using bitsandbytes) - Model architecture: XLM-RoBERTa Large with instruction tuning - Original parameters: 560M - Embedding dimensions: 1024 - Context length: 512 tokens - Languages supported: 94+ languages ## Usage This model can be used with the `transformers` library for generating embeddings: ```python from transformers import AutoModel, AutoTokenizer import torch.nn.functional as F # Load the model model_name = "gopersonal/multilingual-e5-large-instruct-8bit" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name, load_in_8bit=True, device_map="auto") # Define function to get embeddings def average_pool(last_hidden_states, attention_mask): last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0) return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None] def get_detailed_instruct(task_description, query): return f'Instruct: task_description\nQuery: query' # Prepare your texts task = 'Given a web search query, retrieve relevant passages that answer the query' queries = [ get_detailed_instruct(task, 'how much protein should a female eat'), get_detailed_instruct(task, 'best restaurants in new york') ] # Tokenize and generate embeddings batch_dict = tokenizer(queries, max_length=512, padding=True, truncation=True, return_tensors='pt') outputs = model(**batch_dict) embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask']) # Normalize embeddings embeddings = F.normalize(embeddings, p=2, dim=1) ``` ## Infinity Embedding Server Usage ```bash docker run --gpus all -v $PWD/models:/app/.cache -p 7997:7997 \ michaelf34/infinity:latest \ v2 --model-id gopersonal/multilingual-e5-large-instruct-8bit \ --dtype int8 --batch-size 8 --engine torch --port 7997 --device auto ``` ## Benefits of 8-bit Quantization - Approximately 50% reduction in memory usage compared to FP16 - Faster inference, especially on GPUs with limited VRAM - Minimal impact on embedding quality and similarity calculations ## License This model inherits the license of the original model: MIT