cointegrated commited on
Commit
e13aca7
·
1 Parent(s): c5d6c38

Add a README

Browse files
Files changed (1) hide show
  1. README.md +60 -0
README.md CHANGED
@@ -1,3 +1,63 @@
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
4
+
5
+ # BLASER 2.0
6
+ [[Paper]]()
7
+
8
+ BLASER 2.0 is the new version of BLASER [Chen et al., 2023](https://aclanthology.org/2023.acl-long.504/),
9
+ a model for automatic evaluation of machine translation quality.
10
+
11
+ BLASER 2.0 is based on [SONAR](https://huggingface.co/facebook/SONAR) sentence embeddings
12
+ and works with both speech and text modalities.
13
+
14
+ The actual model predicts a similarity score for the translation based on the source sentence
15
+ and the reference translation. Its sibling model, [BLASER 2.0-QE](facebook/blaser-2.0-qe),
16
+ does not use references.
17
+
18
+ Supervised BLASER model are trained to predict cross-lingual semantic similarity scores,
19
+ XSTS ([Licht et al., 2022](https://aclanthology.org/2022.amta-research.24/)),
20
+ on a scale where 1 corresponds to completely unrelated sentences and
21
+ 5 corresponds to fully semantically equivalent sentences.
22
+ The models predictions, though, are unbounded and can occasionally surpass these limits.
23
+
24
+ ## Installation
25
+
26
+ See the SONAR github [repo](https://github.com/facebookresearch/SONAR) for the installation instructions.
27
+
28
+ ## Usage
29
+
30
+ BLASER 2.0 models accept 1024-dimensional SONAR sentence embeddings as inputs,
31
+ and produce a single score as an output.
32
+ The code below illustrates their usage with text embeddings:
33
+
34
+ ```Python
35
+ from sonar.inference_pipelines.text import TextToEmbeddingModelPipeline
36
+ from sonar.models.blaser.loader import load_blaser_model
37
+
38
+ blaser = load_blaser_model("blaser_2_0_ref").eval()
39
+ text_embedder = TextToEmbeddingModelPipeline(encoder="text_sonar_basic_encoder", tokenizer="text_sonar_basic_encoder")
40
+
41
+ src_embs = text_embedder.predict(["Le chat s'assit sur le tapis."], source_lang="fra_Latn")
42
+ ref_embs = text_embedder.predict(["The cat sat on the mat."], source_lang="eng_Latn")
43
+ mt_embs = text_embedder.predict(["The cat sat down on the carpet."], source_lang="eng_Latn")
44
+ print(blaser(src=src_embs, ref=ref_embs, mt=mt_embs).item()) # 4.688
45
+ ```
46
+
47
+ With BLASER 2.0 models, SONAR text and speech embeddings can be used interchangeably.
48
+
49
+
50
+ ## Model details
51
+
52
+ - **Developed by:** Seamless Communication et al.
53
+ - **License:** CC-BY-NC 4.0 license
54
+ - **Citation:** If you use BLASER 2.0 in your work, please cite:
55
+
56
+ ```bibtex
57
+ @article{seamlessm4t2023,
58
+ title={SeamlessM4T—Massively Multilingual \& Multimodal Machine Translation},
59
+ author={{Seamless Communication}, Lo\"{i}c Barrault, Yu-An Chung, Mariano Cora Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis, Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes, Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi, Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-juss\`{a} \footnotemark[3], Onur \,{C}elebi,Maha Elbayad,Cynthia Gao, Francisco Guzm\'an, Justine Kao, Ann Lee, Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang},
60
+ journal={ArXiv},
61
+ year={2023}
62
+ }
63
+ ```