facebook
/

blaser-2.0-ref

Model card Files Files and versions

xet

Community

cointegrated commited on Aug 21, 2023

Commit

e13aca7

1 Parent(s): c5d6c38

Add a README

Browse files

Files changed (1) hide show

README.md +60 -0

README.md CHANGED Viewed

@@ -1,3 +1,63 @@
 ---
 license: cc-by-nc-4.0
 ---

 ---
 license: cc-by-nc-4.0
 ---
+# BLASER 2.0
+[[Paper]]()
+BLASER 2.0 is the new version of BLASER [Chen et al., 2023](https://aclanthology.org/2023.acl-long.504/),
+a model for automatic evaluation of machine translation quality.
+BLASER 2.0 is based on [SONAR](https://huggingface.co/facebook/SONAR) sentence embeddings
+and works with both speech and text modalities.
+The actual model predicts a similarity score for the translation based on the source sentence
+and the reference translation. Its sibling model, [BLASER 2.0-QE](facebook/blaser-2.0-qe),
+does not use references.
+Supervised BLASER model are trained to predict cross-lingual semantic similarity scores,
+XSTS ([Licht et al., 2022](https://aclanthology.org/2022.amta-research.24/)),
+on a scale where 1 corresponds to completely unrelated sentences and
+5 corresponds to fully semantically equivalent sentences.
+The models predictions, though, are unbounded and can occasionally surpass these limits.
+## Installation
+See the SONAR github [repo](https://github.com/facebookresearch/SONAR) for the installation instructions.
+## Usage
+BLASER 2.0 models accept 1024-dimensional SONAR sentence embeddings as inputs,
+and produce a single score as an output.
+The code below illustrates their usage with text embeddings:
+```Python
+from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
+from sonar.models.blaser.loader import load_blaser_model
+blaser = load_blaser_model("blaser_2_0_ref").eval()
+text_embedder = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder", tokenizer="text_sonar_basic_encoder")
+src_embs = text_embedder.predict(["Le chat s'assit sur le tapis."], source_lang="fra_Latn")
+ref_embs = text_embedder.predict(["The cat sat on the mat."], source_lang="eng_Latn")
+mt_embs = text_embedder.predict(["The cat sat down on the carpet."], source_lang="eng_Latn")
+print(blaser(src=src_embs, ref=ref_embs, mt=mt_embs).item())  # 4.688
+```
+With BLASER 2.0 models, SONAR text and speech embeddings can be used interchangeably.
+## Model details
+- **Developed by:** Seamless Communication et al.
+- **License:** CC-BY-NC 4.0 license
+- **Citation:** If you use BLASER 2.0 in your work, please cite:
+```bibtex
+@article{seamlessm4t2023,
+  title={SeamlessM4T—Massively Multilingual \& Multimodal Machine Translation},
+  author={{Seamless Communication}, Lo\"{i}c Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye,  Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-juss\`{a} \footnotemark[3], Onur \,{C}elebi,Maha Elbayad,Cynthia Gao, Francisco Guzm\'an, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang},
+  journal={ArXiv},
+  year={2023}
+}
+```