|
|
--- |
|
|
library_name: transformers |
|
|
tags: [] |
|
|
--- |
|
|
|
|
|
# Introduction |
|
|
We introduce **Elb**edding, *TBD* |
|
|
|
|
|
For more technical details, refer to our paper: *TBD* |
|
|
|
|
|
# Model Details |
|
|
- Base Decoder-only LLM: *TBD* |
|
|
- Pooling Type: Last EOS Token |
|
|
- Maximum context length: 512 |
|
|
- Embedding Dimension: 4096 |
|
|
|
|
|
# How to use with 🤗 Transformers? |
|
|
|
|
|
```python |
|
|
from typing import List |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
import torch |
|
|
|
|
|
|
|
|
def get_detailed_instruct(queries: List[str]) -> List[str]: |
|
|
return [f"Instruct: Retrieve semantically similar text.\nQuery: {query}" for query in queries] |
|
|
|
|
|
def tokenize(sentences: List[str], tokenizer: AutoTokenizer): |
|
|
texts = [x + tokenizer.eos_token for x in sentences] |
|
|
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=512).to("cuda") |
|
|
inputs.input_ids[:, -1] = tokenizer.eos_token_id |
|
|
inputs.pop("token_type_ids", None) |
|
|
return inputs |
|
|
|
|
|
|
|
|
def pool(last_hidden_state: torch.Tensor, attention_mask: torch.Tensor, do_normalize: bool = True) -> torch.Tensor: |
|
|
left_padding = attention_mask[:, -1].sum() == attention_mask.shape[0] |
|
|
if left_padding: |
|
|
embeddings = last_hidden_state[:, -1] |
|
|
else: |
|
|
sequence_lengths = attention_mask.sum(dim=1) - 1 |
|
|
batch_size = last_hidden_state.shape[0] |
|
|
embeddings = last_hidden_state[torch.arange(batch_size, device=last_hidden_state.device).long(), sequence_lengths.long()] |
|
|
if do_normalize: |
|
|
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1) |
|
|
return embeddings |
|
|
|
|
|
|
|
|
model = AutoModel.from_pretrained(pretrained_model_name_or_path="lamarr-llm-development/elbedding", trust_remote_code=True) |
|
|
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="lamarr-llm-development/elbedding", trust_remote_code=True) |
|
|
|
|
|
model = model.to("cuda") |
|
|
|
|
|
sentences = ["Hi how are you doing?"] |
|
|
# sentences = get_detailed_instruct(sentences) # if the sentence is a query |
|
|
sentences_inputs = tokenize(sentences=sentences, tokenizer=tokenizer) |
|
|
sentences_outputs = model(**sentences_inputs) |
|
|
embeddings = pool( |
|
|
last_hidden_state=sentences_outputs.last_hidden_state, |
|
|
attention_mask=sentences_inputs.attention_mask, |
|
|
) |
|
|
print(embeddings) |
|
|
``` |
|
|
|
|
|
# How to use with Sentence Transformers? |
|
|
|
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
from typing import List |
|
|
|
|
|
def get_detailed_instruct(queries: List[str]) -> List[str]: |
|
|
return [f"Instruct: Retrieve semantically similar text.\nQuery: {query}" for query in queries] |
|
|
|
|
|
model = SentenceTransformer("lamarr-llm-development/elbedding", trust_remote_code=True) |
|
|
|
|
|
# sentences = get_detailed_instruct(sentences) # if the sentence is a query |
|
|
sentences = ["Hi how are you doing?"] |
|
|
embeddings = model.encode(sentences=sentences, normalize_embeddings=True) |
|
|
print(embeddings) |
|
|
|
|
|
``` |
|
|
|
|
|
|
|
|
## Supported Languages |
|
|
*TBD* |
|
|
|
|
|
|
|
|
## MTEB Benchmark Evaluation |
|
|
*TBD* |
|
|
|
|
|
## FAQ |
|
|
|
|
|
**Do I need to add instructions to the query?** |
|
|
|
|
|
Yes, this is how the model is trained, otherwise you will see a performance degradation. On the other hand, there is no need to add instructions to the document side. |
|
|
|
|
|
## Citation |
|
|
*TBD* |
|
|
|
|
|
## Limitations |
|
|
*TBD* |
|
|
|