Geralt-Targaryen commited on
Commit
d0899b5
·
verified ·
1 Parent(s): 55562d0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -3
README.md CHANGED
@@ -1,3 +1,145 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - es
6
+ - fr
7
+ - de
8
+ - ru
9
+ - nl
10
+ - vi
11
+ - zh
12
+ - hi
13
+ - id
14
+ - it
15
+ - ja
16
+ - pt
17
+ - pl
18
+ - ar
19
+ - ko
20
+ - uk
21
+ - th
22
+ - ca
23
+ - cs
24
+ - gl
25
+ - tl
26
+ - eu
27
+ - hy
28
+ - ne
29
+ - fa
30
+ - my
31
+ - lo
32
+ - km
33
+ - az
34
+ - tg
35
+ - sv
36
+ - si
37
+ - da
38
+ - tr
39
+ - sw
40
+ - fi
41
+ - ro
42
+ - 'no'
43
+ - hu
44
+ - he
45
+ - el
46
+ - sk
47
+ - bg
48
+ base_model:
49
+ - Qwen/Qwen3-8B
50
+ pipeline_tag: feature-extraction
51
+ library_name: transformers
52
+ tags:
53
+ - sentence-transformers
54
+ ---
55
+
56
+ # F2LLM-v2-8B-Preview
57
+
58
+ **F2LLM-v2-8B-Preview** is a multilingual embedding model trained from Qwen3-8B on a corpus of **27 million samples**, spanning **over 100 languages**. It is a "preview" version trained without instructions and intended to serve as a foundation for downstream embedding tasks and further fine-tuning.
59
+
60
+ ## Usage
61
+
62
+ ### With Sentence Transformers
63
+
64
+ To encode text with the [Sentence Transformers](https://www.sbert.net/) library:
65
+
66
+ ```python
67
+ from sentence_transformers import SentenceTransformer
68
+
69
+ model = SentenceTransformer("codefuse-ai/F2LLM-v2-8B-Preview", device="cuda:0", model_kwargs={"torch_dtype": "bfloat16"})
70
+
71
+ # Some sample query and documents
72
+ query = "What is F2LLM used for?"
73
+ documents = [
74
+ 'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
75
+ 'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.',
76
+ 'F2LLM 是 CodeFuse 开源的系列嵌入模型。',
77
+ 'F2LLM — это модель вычисления встраивания текста, которую можно использовать для различных задач НЛП, таких как поиск информации, семантический поиск и классификация текста.'
78
+ ]
79
+
80
+ # Encode the query and documents separately, the encode_query method uses the query prompt
81
+ query_embedding = model.encode_query(query)
82
+ document_embeddings = model.encode_document(documents)
83
+ print(query_embedding.shape, document_embeddings.shape)
84
+ # (4096,) (4, 4096)
85
+
86
+ # Compute cosine similarity between the query and documents
87
+ similarity = model.similarity(query_embedding, document_embeddings)
88
+ print(similarity)
89
+ # tensor([[0.6329, 0.8003, 0.6361, 0.8267]])
90
+ ```
91
+
92
+ ### With Transformers
93
+
94
+ Or directly with the [Transformers](https://huggingface.co/docs/transformers/index) library:
95
+
96
+ ```python
97
+ from transformers import AutoModel, AutoTokenizer
98
+ import torch
99
+ import torch.nn.functional as F
100
+
101
+
102
+ model_path = "codefuse-ai/F2LLM-v2-8B-Preview"
103
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
104
+ model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map={'': 0})
105
+
106
+ query = "What is F2LLM used for?"
107
+
108
+ documents = [
109
+ 'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
110
+ 'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.',
111
+ 'F2LLM 是 CodeFuse 开源的系列嵌入模型。',
112
+ 'F2LLM — это модель вычисления встраивания текста, которую можно использовать для различных задач НЛП, таких как поиск информации, семантический поиск и классификация текста.'
113
+ ]
114
+
115
+ def encode(sentences):
116
+ batch_size = len(sentences)
117
+ # the tokenizer will automatically add eos token
118
+ tokenized_inputs = tokenizer(sentences, padding=True, return_tensors='pt').to(model.device)
119
+ last_hidden_state = model(**tokenized_inputs).last_hidden_state
120
+ eos_positions = tokenized_inputs.attention_mask.sum(dim=1) - 1
121
+ embeddings = last_hidden_state[torch.arange(batch_size, device=model.device), eos_positions]
122
+ embeddings = F.normalize(embeddings, p=2, dim=1)
123
+ return embeddings
124
+
125
+ # Encode the query and documents
126
+ query_embedding = encode([query])
127
+ document_embeddings = encode(documents)
128
+ print(query_embedding.shape, document_embeddings.shape)
129
+ # torch.Size([1, 4096]) torch.Size([4, 4096])
130
+
131
+ # Compute cosine similarity between the query and documents
132
+ similarity = query_embedding @ document_embeddings.T
133
+ print(similarity)
134
+ # tensor([[0.6328, 0.8008, 0.6328, 0.8242]], device='cuda:0',
135
+ # dtype=torch.bfloat16, grad_fn=<MmBackward0>)
136
+ ```
137
+
138
+ ## Future Releases
139
+
140
+ We are committed to the open-source community and will soon release:
141
+
142
+ - **The Finetuned Version:** Optimized for downstream tasks, with state-of-the-art performance on MTEB.
143
+ - **The Training Data:** We will be releasing the data used to train F2LLM-v2 to help advance the field of multilingual embeddings.
144
+
145
+ Stay tuned for more updates!