Update README.md

0311ad7 verified 1 day ago

4.79 kB

license: apache-2.0
language:
  - en
  - zh
  - ru
  - es
  - fr
  - de
  - ar
  - nl
  - vi
  - hi
  - ko
  - ja
  - it
  - id
  - pt
  - pl
  - tr
  - da
  - th
  - sv
  - fa
  - uk
  - cs
  - 'no'
  - el
  - ca
  - ro
  - fi
  - bg
  - tl
  - gl
  - my
  - hy
  - km
  - ne
  - hu
  - eu
  - he
  - lo
  - sw
  - az
  - lv
  - si
  - sk
  - tg
  - et
  - lt
  - ms
  - hr
  - is
  - sl
  - sr
  - ur
  - bn
  - af
  - ta
  - ka
  - te
  - ml
  - mn
  - nn
  - kk
  - cy
  - mr
  - sq
  - nb
  - mk
  - jv
  - kn
  - eo
  - la
  - gu
  - uz
  - am
  - oc
  - be
  - mg
  - vo
  - pa
  - lb
  - ht
  - br
  - ga
  - xh
  - tt
  - bs
  - yo
base_model:
  - codefuse-ai/F2LLM-v2-0.6B-Preview
pipeline_tag: feature-extraction
library_name: transformers
tags:
  - sentence-transformers
datasets:
  - codefuse-ai/F2LLM-v2

F2LLM-v2-80M

F2LLM-v2 is a family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a curated composite of 60 million publicly available high-quality data, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages.

Usage

With Sentence Transformers

To encode text with the Sentence Transformers library:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("codefuse-ai/F2LLM-v2-80M", device="cuda:0", model_kwargs={"torch_dtype": "bfloat16"})
# Some sample query and documents
query = "What is F2LLM used for?"
documents = [
    'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
    'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.',
    'F2LLM 是 CodeFuse 开源的系列嵌入模型。',
    'F2LLM — это модель вычисления встраивания текста, которую можно использовать для различных задач НЛП, таких как поиск информации, семантический поиск и классификация текста.'
]
# Encode the query and documents separately. The encode_query method uses the query prompt
query_embedding = model.encode_query(query)
document_embeddings = model.encode_document(documents)
print(query_embedding.shape, document_embeddings.shape)
# (320,) (4, 320)
# Compute cosine similarity between the query and documents
similarity = model.similarity(query_embedding, document_embeddings)
print(similarity)
# tensor([[0.6968, 0.7818, 0.7165, 0.8374]])

With Transformers

Or directly with the Transformers library:

from transformers import AutoModel, AutoTokenizer
import torch
import torch.nn.functional as F
model_path = "codefuse-ai/F2LLM-v2-80M"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map={'': 0})
query = "What is F2LLM used for?"
query_prompt = "Instruct: Given a question, retrieve passages that can help answer the question.\nQuery: "
documents = [
    'We present F2LLM, a family of fully open embedding LLMs that achieve a strong balance between model size, training data, and embedding performance.',
    'F2LLM is a model for computing text embeddings that can be used for various NLP tasks such as information retrieval, semantic search, and text classification.',
    'F2LLM 是 CodeFuse 开源的系列嵌入模型。',
    'F2LLM — это модель вычисления встраивания текста, которую можно использовать для различных задач НЛП, таких как поиск информации, семантический поиск и классификация текста.'
]
def encode(sentences):
    batch_size = len(sentences)
    # the tokenizer will automatically add eos token
    tokenized_inputs = tokenizer(sentences, padding=True, return_tensors='pt').to(model.device)
    last_hidden_state = model(**tokenized_inputs).last_hidden_state
    eos_positions = tokenized_inputs.attention_mask.sum(dim=1) - 1
    embeddings = last_hidden_state[torch.arange(batch_size, device=model.device), eos_positions]
    embeddings = F.normalize(embeddings, p=2, dim=1)
    return embeddings
# Encode the query and documents
query_embedding = encode([query_prompt + query])
document_embeddings = encode(documents)
print(query_embedding.shape, document_embeddings.shape)
# torch.Size([1, 320]) torch.Size([4, 320])
# Compute cosine similarity between the query and documents
similarity = query_embedding @ document_embeddings.T
print(similarity)
# tensor([[0.6914, 0.7812, 0.7148, 0.8359]], device='cuda:0',
#        dtype=torch.bfloat16, grad_fn=<MmBackward0>)