Geralt-Targaryen commited on
Commit
5e78b50
·
verified ·
1 Parent(s): 2057988

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +171 -3
README.md CHANGED
@@ -1,3 +1,171 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ - ru
7
+ - es
8
+ - fr
9
+ - de
10
+ - ar
11
+ - nl
12
+ - vi
13
+ - hi
14
+ - ko
15
+ - ja
16
+ - it
17
+ - id
18
+ - pt
19
+ - pl
20
+ - tr
21
+ - da
22
+ - th
23
+ - sv
24
+ - fa
25
+ - uk
26
+ - cs
27
+ - 'no'
28
+ - el
29
+ - ca
30
+ - ro
31
+ - fi
32
+ - bg
33
+ - tl
34
+ - gl
35
+ - my
36
+ - hy
37
+ - km
38
+ - ne
39
+ - hu
40
+ - eu
41
+ - he
42
+ - lo
43
+ - sw
44
+ - az
45
+ - lv
46
+ - si
47
+ - sk
48
+ - tg
49
+ - et
50
+ - lt
51
+ - ms
52
+ - hr
53
+ - is
54
+ - sl
55
+ - sr
56
+ - ur
57
+ - bn
58
+ - af
59
+ - ta
60
+ - ka
61
+ - te
62
+ - ml
63
+ - mn
64
+ - nn
65
+ - kk
66
+ - cy
67
+ - mr
68
+ - sq
69
+ - nb
70
+ - mk
71
+ - jv
72
+ - kn
73
+ - eo
74
+ - la
75
+ - gu
76
+ - uz
77
+ - am
78
+ - oc
79
+ - be
80
+ - mg
81
+ - vo
82
+ - pa
83
+ - lb
84
+ - ht
85
+ - br
86
+ - ga
87
+ - xh
88
+ - tt
89
+ - bs
90
+ - yo
91
+ base_model:
92
+ - codefuse-ai/F2LLM-v2-4B-Preview
93
+ pipeline_tag: feature-extraction
94
+ library_name: transformers
95
+ tags:
96
+ - sentence-transformers
97
+ datasets:
98
+ - codefuse-ai/F2LLM-v2
99
+ ---
100
+
101
+ # F2LLM-v2-4B
102
+
103
+ F2LLM-v2 is a family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a curated composite of 60 million publicly available high-quality data, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages.
104
+
105
+ ## Usage
106
+
107
+ ### With Sentence Transformers
108
+
109
+ To encode text with the [Sentence Transformers](https://www.sbert.net/) library:
110
+
111
+ ```python
112
+ from sentence_transformers import SentenceTransformer
113
+ model = SentenceTransformer("codefuse-ai/F2LLM-v2-4B", device="cuda:0", model_kwargs={"torch_dtype": "bfloat16"})
114
+ # Some sample query and documents
115
+ query = "What is F2LLM used for?"
116
+ documents = [
117
+ 'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
118
+ 'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.',
119
+ 'F2LLM 是 CodeFuse 开源的系列嵌入模型。',
120
+ 'F2LLM — это модель вычисления встраивания текста, которую можно использовать для различных задач НЛП, таких как поиск информации, семантический поиск и классификация текста.'
121
+ ]
122
+ # Encode the query and documents separately. The encode_query method uses the query prompt
123
+ query_embedding = model.encode_query(query)
124
+ document_embeddings = model.encode_document(documents)
125
+ print(query_embedding.shape, document_embeddings.shape)
126
+ # (2560,) (4, 2560)
127
+ # Compute cosine similarity between the query and documents
128
+ similarity = model.similarity(query_embedding, document_embeddings)
129
+ print(similarity)
130
+ # tensor([[0.6348, 0.8547, 0.7168, 0.8356]])
131
+ ```
132
+
133
+ ### With Transformers
134
+
135
+ Or directly with the [Transformers](https://huggingface.co/docs/transformers/index) library:
136
+
137
+ ```python
138
+ from transformers import AutoModel, AutoTokenizer
139
+ import torch
140
+ import torch.nn.functional as F
141
+ model_path = "codefuse-ai/F2LLM-v2-4B"
142
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
143
+ model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map={'': 0})
144
+ query = "What is F2LLM used for?"
145
+ query_prompt = "Instruct: Given a question, retrieve passages that can help answer the question.\nQuery: "
146
+ documents = [
147
+ 'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
148
+ 'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.',
149
+ 'F2LLM 是 CodeFuse 开源的系列嵌入模型。',
150
+ 'F2LLM — это модель вычисления встраивания текста, которую можно использовать для различных задач НЛП, таких как поиск информации, семантический поиск и классификация текста.'
151
+ ]
152
+ def encode(sentences):
153
+ batch_size = len(sentences)
154
+ # the tokenizer will automatically add eos token
155
+ tokenized_inputs = tokenizer(sentences, padding=True, return_tensors='pt').to(model.device)
156
+ last_hidden_state = model(**tokenized_inputs).last_hidden_state
157
+ eos_positions = tokenized_inputs.attention_mask.sum(dim=1) - 1
158
+ embeddings = last_hidden_state[torch.arange(batch_size, device=model.device), eos_positions]
159
+ embeddings = F.normalize(embeddings, p=2, dim=1)
160
+ return embeddings
161
+ # Encode the query and documents
162
+ query_embedding = encode([query_prompt + query])
163
+ document_embeddings = encode(documents)
164
+ print(query_embedding.shape, document_embeddings.shape)
165
+ # torch.Size([1, 2560]) torch.Size([4, 2560])
166
+ # Compute cosine similarity between the query and documents
167
+ similarity = query_embedding @ document_embeddings.T
168
+ print(similarity)
169
+ # tensor([[0.6328, 0.8555, 0.7148, 0.8398]], device='cuda:0',
170
+ # dtype=torch.bfloat16, grad_fn=<MmBackward0>)
171
+ ```