| | --- |
| | language: |
| | - af |
| | - am |
| | - ar |
| | - as |
| | - az |
| | - be |
| | - bg |
| | - bn |
| | - bo |
| | - bs |
| | - ca |
| | - ceb |
| | - co |
| | - cs |
| | - cy |
| | - da |
| | - de |
| | - el |
| | - en |
| | - eo |
| | - es |
| | - et |
| | - eu |
| | - fa |
| | - fi |
| | - fr |
| | - fy |
| | - ga |
| | - gd |
| | - gl |
| | - gu |
| | - ha |
| | - haw |
| | - he |
| | - hi |
| | - hmn |
| | - hr |
| | - ht |
| | - hu |
| | - hy |
| | - id |
| | - ig |
| | - is |
| | - it |
| | - ja |
| | - jv |
| | - ka |
| | - kk |
| | - km |
| | - kn |
| | - ko |
| | - ku |
| | - ky |
| | - la |
| | - lb |
| | - lo |
| | - lt |
| | - lv |
| | - mg |
| | - mi |
| | - mk |
| | - ml |
| | - mn |
| | - mr |
| | - ms |
| | - mt |
| | - my |
| | - ne |
| | - nl |
| | - no |
| | - ny |
| | - or |
| | - pa |
| | - pl |
| | - pt |
| | - ro |
| | - ru |
| | - rw |
| | - si |
| | - sk |
| | - sl |
| | - sm |
| | - sn |
| | - so |
| | - sq |
| | - sr |
| | - st |
| | - su |
| | - sv |
| | - sw |
| | - ta |
| | - te |
| | - tg |
| | - th |
| | - tk |
| | - tl |
| | - tr |
| | - tt |
| | - ug |
| | - uk |
| | - ur |
| | - uz |
| | - vi |
| | - wo |
| | - xh |
| | - yi |
| | - yo |
| | - zh |
| | - zu |
| | tags: |
| | - ctranslate2 |
| | - int8 |
| | - float16 |
| | - bert |
| | - sentence_embedding |
| | - multilingual |
| | - google |
| | - sentence-similarity |
| | license: apache-2.0 |
| | datasets: |
| | - CommonCrawl |
| | - Wikipedia |
| | --- |
| | # # Fast-Inference with Ctranslate2 |
| | Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on CPU or GPU. |
| |
|
| | quantized version of [setu4993/LaBSE](https://huggingface.co/setu4993/LaBSE) |
| | ```bash |
| | pip install hf-hub-ctranslate2>=2.12.0 ctranslate2>=3.17.1 |
| | ``` |
| |
|
| | ```python |
| | # from transformers import AutoTokenizer |
| | model_name = "michaelfeil/ct2fast-LaBSE" |
| | model_name_orig="setu4993/LaBSE" |
| | |
| | from hf_hub_ctranslate2 import EncoderCT2fromHfHub |
| | model = EncoderCT2fromHfHub( |
| | # load in int8 on CUDA |
| | model_name_or_path=model_name, |
| | device="cuda", |
| | compute_type="int8_float16" |
| | ) |
| | outputs = model.generate( |
| | text=["I like soccer", "I like tennis", "The eiffel tower is in Paris"], |
| | max_length=64, |
| | ) # perform downstream tasks on outputs |
| | outputs["pooler_output"] |
| | outputs["last_hidden_state"] |
| | outputs["attention_mask"] |
| | |
| | # alternative, use SentenceTransformer Mix-In |
| | # for end-to-end Sentence embeddings generation |
| | # (not pulling from this CT2fast-HF repo) |
| | |
| | from hf_hub_ctranslate2 import CT2SentenceTransformer |
| | model = CT2SentenceTransformer( |
| | model_name_orig, compute_type="int8_float16", device="cuda" |
| | ) |
| | embeddings = model.encode( |
| | ["I like soccer", "I like tennis", "The eiffel tower is in Paris"], |
| | batch_size=32, |
| | convert_to_numpy=True, |
| | normalize_embeddings=True, |
| | ) |
| | print(embeddings.shape, embeddings) |
| | scores = (embeddings @ embeddings.T) * 100 |
| | |
| | # Hint: you can also host this code via REST API and |
| | # via github.com/michaelfeil/infinity |
| | |
| | |
| | ``` |
| |
|
| | Checkpoint compatible to [ctranslate2>=3.17.1](https://github.com/OpenNMT/CTranslate2) |
| | and [hf-hub-ctranslate2>=2.12.0](https://github.com/michaelfeil/hf-hub-ctranslate2) |
| | - `compute_type=int8_float16` for `device="cuda"` |
| | - `compute_type=int8` for `device="cpu"` |
| |
|
| | Converted on 2023-10-13 using |
| | ``` |
| | LLama-2 -> removed <pad> token. |
| | ``` |
| |
|
| | # Licence and other remarks: |
| | This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo. |
| |
|
| | # Original description |
| | |
| | |
| | # LaBSE |
| |
|
| | ## Model description |
| |
|
| | Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages. The pre-training process combines masked language modeling with translation language modeling. The model is useful for getting multilingual sentence embeddings and for bi-text retrieval. |
| |
|
| | - Model: [HuggingFace's model hub](https://huggingface.co/setu4993/LaBSE). |
| | - Paper: [arXiv](https://arxiv.org/abs/2007.01852). |
| | - Original model: [TensorFlow Hub](https://tfhub.dev/google/LaBSE/2). |
| | - Blog post: [Google AI Blog](https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html). |
| | - Conversion from TensorFlow to PyTorch: [GitHub](https://github.com/setu4993/convert-labse-tf-pt). |
| |
|
| | This is migrated from the v2 model on the TF Hub, which uses dict-based input. The embeddings produced by both the versions of the model are [equivalent](https://github.com/setu4993/convert-labse-tf-pt/blob/ec3a019159a54ed6493181a64486c2808c01f216/tests/test_conversion.py#L31). |
| |
|
| | ## Usage |
| |
|
| | Using the model: |
| |
|
| | ```python |
| | import torch |
| | from transformers import BertModel, BertTokenizerFast |
| | |
| | |
| | tokenizer = BertTokenizerFast.from_pretrained("setu4993/LaBSE") |
| | model = BertModel.from_pretrained("setu4993/LaBSE") |
| | model = model.eval() |
| | |
| | english_sentences = [ |
| | "dog", |
| | "Puppies are nice.", |
| | "I enjoy taking long walks along the beach with my dog.", |
| | ] |
| | english_inputs = tokenizer(english_sentences, return_tensors="pt", padding=True) |
| | |
| | with torch.no_grad(): |
| | english_outputs = model(**english_inputs) |
| | ``` |
| |
|
| | To get the sentence embeddings, use the pooler output: |
| |
|
| | ```python |
| | english_embeddings = english_outputs.pooler_output |
| | ``` |
| |
|
| | Output for other languages: |
| |
|
| | ```python |
| | italian_sentences = [ |
| | "cane", |
| | "I cuccioli sono carini.", |
| | "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane.", |
| | ] |
| | japanese_sentences = ["犬", "子犬はいいです", "私は犬と一緒にビーチを散歩するのが好きです"] |
| | italian_inputs = tokenizer(italian_sentences, return_tensors="pt", padding=True) |
| | japanese_inputs = tokenizer(japanese_sentences, return_tensors="pt", padding=True) |
| | |
| | with torch.no_grad(): |
| | italian_outputs = model(**italian_inputs) |
| | japanese_outputs = model(**japanese_inputs) |
| | |
| | italian_embeddings = italian_outputs.pooler_output |
| | japanese_embeddings = japanese_outputs.pooler_output |
| | ``` |
| |
|
| | For similarity between sentences, an L2-norm is recommended before calculating the similarity: |
| |
|
| | ```python |
| | import torch.nn.functional as F |
| | |
| | |
| | def similarity(embeddings_1, embeddings_2): |
| | normalized_embeddings_1 = F.normalize(embeddings_1, p=2) |
| | normalized_embeddings_2 = F.normalize(embeddings_2, p=2) |
| | return torch.matmul( |
| | normalized_embeddings_1, normalized_embeddings_2.transpose(0, 1) |
| | ) |
| | |
| | |
| | print(similarity(english_embeddings, italian_embeddings)) |
| | print(similarity(english_embeddings, japanese_embeddings)) |
| | print(similarity(italian_embeddings, japanese_embeddings)) |
| | ``` |
| |
|
| | ## Details |
| |
|
| | Details about data, training, evaluation and performance metrics are available in the [original paper](https://arxiv.org/abs/2007.01852). |
| |
|
| | ### BibTeX entry and citation info |
| |
|
| | ```bibtex |
| | @misc{feng2020languageagnostic, |
| | title={Language-agnostic BERT Sentence Embedding}, |
| | author={Fangxiaoyu Feng and Yinfei Yang and Daniel Cer and Naveen Arivazhagan and Wei Wang}, |
| | year={2020}, |
| | eprint={2007.01852}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL} |
| | } |
| | ``` |
| |
|