gopersonal
/

multilingual-e5-large-instruct-8bit

8-bit precision

Model card Files Files and versions

multilingual-e5-large-instruct-8bit / README.md

scotto2's picture

Upload 8-bit quantized E5 large instruct model

0f75113 verified 10 months ago

|

history blame contribute delete

2.5 kB


	# Multilingual E5 Large Instruct - 8-bit Quantized

	This is an 8-bit quantized version of the [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) model.

	## Model Details

	- Original model: [intfloat/multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct)
	- Quantization: 8-bit (using bitsandbytes)
	- Model architecture: XLM-RoBERTa Large with instruction tuning
	- Original parameters: 560M
	- Embedding dimensions: 1024
	- Context length: 512 tokens
	- Languages supported: 94+ languages

	## Usage

	This model can be used with the `transformers` library for generating embeddings:

	```python
	from transformers import AutoModel, AutoTokenizer
	import torch.nn.functional as F

	# Load the model
	model_name = "gopersonal/multilingual-e5-large-instruct-8bit"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModel.from_pretrained(model_name, load_in_8bit=True, device_map="auto")

	# Define function to get embeddings
	def average_pool(last_hidden_states, attention_mask):
	last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
	return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

	def get_detailed_instruct(task_description, query):
	return f'Instruct: task_description\nQuery: query'

	# Prepare your texts
	task = 'Given a web search query, retrieve relevant passages that answer the query'
	queries = [
	get_detailed_instruct(task, 'how much protein should a female eat'),
	get_detailed_instruct(task, 'best restaurants in new york')
	]

	# Tokenize and generate embeddings
	batch_dict = tokenizer(queries, max_length=512, padding=True, truncation=True, return_tensors='pt')
	outputs = model(**batch_dict)
	embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

	# Normalize embeddings
	embeddings = F.normalize(embeddings, p=2, dim=1)
	```

	## Infinity Embedding Server Usage

	```bash
	docker run --gpus all -v $PWD/models:/app/.cache -p 7997:7997 \
	michaelf34/infinity:latest \
	v2 --model-id gopersonal/multilingual-e5-large-instruct-8bit \
	--dtype int8 --batch-size 8 --engine torch --port 7997 --device auto
	```

	## Benefits of 8-bit Quantization

	- Approximately 50% reduction in memory usage compared to FP16
	- Faster inference, especially on GPUs with limited VRAM
	- Minimal impact on embedding quality and similarity calculations

	## License

	This model inherits the license of the original model: MIT