Update README.md

2b1bf02 verified 1 day ago

4.86 kB

	---
	license: apache-2.0
	language:
	- en
	- es
	- fr
	- de
	- ru
	- nl
	- vi
	- zh
	- hi
	- id
	- it
	- ja
	- pt
	- pl
	- ar
	- ko
	- uk
	- th
	- ca
	- cs
	- gl
	- tl
	- eu
	- hy
	- ne
	- fa
	- my
	- lo
	- km
	- az
	- tg
	- sv
	- si
	- da
	- tr
	- sw
	- fi
	- ro
	- 'no'
	- hu
	- he
	- el
	- sk
	- bg
	base_model:
	- Qwen/Qwen3-14B
	pipeline_tag: feature-extraction
	library_name: transformers
	tags:
	- sentence-transformers
	---

	# F2LLM-v2-14B-Preview

	F2LLM-v2-14B-Preview is a multilingual embedding model trained from Qwen3-14B on a corpus of 27 million samples, spanning over 100 natural and programming languages. It is a "preview" version trained without instructions and intended to serve as a foundation for downstream embedding tasks and further fine-tuning.

	## Usage

	### With Sentence Transformers

	To encode text with the [Sentence Transformers](https://www.sbert.net/) library:

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("codefuse-ai/F2LLM-v2-14B-Preview", device="cuda:0", model_kwargs={"torch_dtype": "bfloat16"})

	# Some sample query and documents
	query = "What is F2LLM used for?"
	documents = [
	'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
	'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.',
	'F2LLM 是 CodeFuse 开源的系列嵌入模型。',
	'F2LLM — это модель вычисления встраивания текста, которую можно использовать для различных задач НЛП, таких как поиск информации, семантический поиск и классификация текста.'
	]

	# Encode the query and documents
	query_embedding = model.encode(query)
	document_embeddings = model.encode(documents)
	print(query_embedding.shape, document_embeddings.shape)
	# (5120,) (4, 5120)

	# Compute cosine similarity between the query and documents
	similarity = model.similarity(query_embedding, document_embeddings)
	print(similarity)
	# tensor([[0.5889, 0.7934, 0.6786, 0.7778]])
	```

	### With Transformers

	Or directly with the [Transformers](https://huggingface.co/docs/transformers/index) library:

	```python
	from transformers import AutoModel, AutoTokenizer
	import torch
	import torch.nn.functional as F


	model_path = "codefuse-ai/F2LLM-v2-14B-Preview"
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map={'': 0})

	query = "What is F2LLM used for?"

	documents = [
	'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
	'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.',
	'F2LLM 是 CodeFuse 开源的系列嵌入模型。',
	'F2LLM — это модель вычисления встраивания текста, которую можно использовать для различных задач НЛП, таких как поиск информации, семантический поиск и классификация текста.'
	]

	def encode(sentences):
	batch_size = len(sentences)
	# the tokenizer will automatically add eos token
	tokenized_inputs = tokenizer(sentences, padding=True, return_tensors='pt').to(model.device)
	last_hidden_state = model(**tokenized_inputs).last_hidden_state
	eos_positions = tokenized_inputs.attention_mask.sum(dim=1) - 1
	embeddings = last_hidden_state[torch.arange(batch_size, device=model.device), eos_positions]
	embeddings = F.normalize(embeddings, p=2, dim=1)
	return embeddings

	# Encode the query and documents
	query_embedding = encode([query])
	document_embeddings = encode(documents)
	print(query_embedding.shape, document_embeddings.shape)
	# torch.Size([1, 5120]) torch.Size([4, 5120])

	# Compute cosine similarity between the query and documents
	similarity = query_embedding @ document_embeddings.T
	print(similarity)
	# tensor([[0.5898, 0.7930, 0.6797, 0.7773]], device='cuda:0',
	# dtype=torch.bfloat16, grad_fn=<MmBackward0>)
	```

	## Intermediate Checkpoints

	To facilitate future research, we release intermediate checkpoints in the `intermediate_checkpoints` branch.

	## Future Releases

	We are committed to the open-source community and will soon release:

	- The Finetuned Version: Optimized for downstream tasks, with state-of-the-art performance on MTEB.
	- The Training Data: We will be releasing the data used to train F2LLM-v2 to help advance the field of multilingual embeddings.

	Stay tuned for more updates!