Update README.md
Browse files
README.md
CHANGED
|
@@ -100,4 +100,72 @@ datasets:
|
|
| 100 |
|
| 101 |
# F2LLM-v2-80M
|
| 102 |
|
| 103 |
-
F2LLM-v2 is a family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a curated composite of 60 million publicly available high-quality data, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
# F2LLM-v2-80M
|
| 102 |
|
| 103 |
+
F2LLM-v2 is a family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a curated composite of 60 million publicly available high-quality data, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages.
|
| 104 |
+
|
| 105 |
+
## Usage
|
| 106 |
+
|
| 107 |
+
### With Sentence Transformers
|
| 108 |
+
|
| 109 |
+
To encode text with the [Sentence Transformers](https://www.sbert.net/) library:
|
| 110 |
+
|
| 111 |
+
```python
|
| 112 |
+
from sentence_transformers import SentenceTransformer
|
| 113 |
+
model = SentenceTransformer("codefuse-ai/F2LLM-v2-80M", device="cuda:0", model_kwargs={"torch_dtype": "bfloat16"})
|
| 114 |
+
# Some sample query and documents
|
| 115 |
+
query = "What is F2LLM used for?"
|
| 116 |
+
documents = [
|
| 117 |
+
'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
|
| 118 |
+
'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.',
|
| 119 |
+
'F2LLM 是 CodeFuse 开源的系列嵌入模型。',
|
| 120 |
+
'F2LLM — это модель вычисления встраивания текста, которую можно использовать для различных задач НЛП, таких как поиск информации, семантический поиск и классификация текста.'
|
| 121 |
+
]
|
| 122 |
+
# Encode the query and documents separately. The encode_query method uses the query prompt
|
| 123 |
+
query_embedding = model.encode_query(query)
|
| 124 |
+
document_embeddings = model.encode_document(documents)
|
| 125 |
+
print(query_embedding.shape, document_embeddings.shape)
|
| 126 |
+
# (320,) (4, 320)
|
| 127 |
+
# Compute cosine similarity between the query and documents
|
| 128 |
+
similarity = model.similarity(query_embedding, document_embeddings)
|
| 129 |
+
print(similarity)
|
| 130 |
+
# tensor([[0.4943, 0.6311, 0.5591, 0.6962]])
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
### With Transformers
|
| 134 |
+
|
| 135 |
+
Or directly with the [Transformers](https://huggingface.co/docs/transformers/index) library:
|
| 136 |
+
|
| 137 |
+
```python
|
| 138 |
+
from transformers import AutoModel, AutoTokenizer
|
| 139 |
+
import torch
|
| 140 |
+
import torch.nn.functional as F
|
| 141 |
+
model_path = "codefuse-ai/F2LLM-v2-80M"
|
| 142 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
| 143 |
+
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map={'': 0})
|
| 144 |
+
query = "What is F2LLM used for?"
|
| 145 |
+
query_prompt = "Instruct: Given a question, retrieve passages that can help answer the question.\nQuery: "
|
| 146 |
+
documents = [
|
| 147 |
+
'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
|
| 148 |
+
'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.',
|
| 149 |
+
'F2LLM 是 CodeFuse 开源的系列嵌入模型。',
|
| 150 |
+
'F2LLM — это модель вычисления встраивания текста, которую можно использовать для различных задач НЛП, таких как поиск информации, семантический поиск и классификация текста.'
|
| 151 |
+
]
|
| 152 |
+
def encode(sentences):
|
| 153 |
+
batch_size = len(sentences)
|
| 154 |
+
# the tokenizer will automatically add eos token
|
| 155 |
+
tokenized_inputs = tokenizer(sentences, padding=True, return_tensors='pt').to(model.device)
|
| 156 |
+
last_hidden_state = model(**tokenized_inputs).last_hidden_state
|
| 157 |
+
eos_positions = tokenized_inputs.attention_mask.sum(dim=1) - 1
|
| 158 |
+
embeddings = last_hidden_state[torch.arange(batch_size, device=model.device), eos_positions]
|
| 159 |
+
embeddings = F.normalize(embeddings, p=2, dim=1)
|
| 160 |
+
return embeddings
|
| 161 |
+
# Encode the query and documents
|
| 162 |
+
query_embedding = encode([query_prompt + query])
|
| 163 |
+
document_embeddings = encode(documents)
|
| 164 |
+
print(query_embedding.shape, document_embeddings.shape)
|
| 165 |
+
# torch.Size([1, 320]) torch.Size([4, 320])
|
| 166 |
+
# Compute cosine similarity between the query and documents
|
| 167 |
+
similarity = query_embedding @ document_embeddings.T
|
| 168 |
+
print(similarity)
|
| 169 |
+
# tensor([[0.6914, 0.7812, 0.7148, 0.8359]], device='cuda:0',
|
| 170 |
+
# dtype=torch.bfloat16, grad_fn=<MmBackward0>)
|
| 171 |
+
```
|