--- language: - multilingual - en - ru - de - fr - es - zh - ja - ko - ar license: cc-by-nc-4.0 library_name: transformers tags: - sonar - sentence-embeddings - multilingual - translation - text-generation - text2text-generation base_model: facebook/nllb-200-distilled-1.3B pipeline_tag: translation --- # SONAR 200 Text Decoder (HuggingFace Port) This is a port of [Meta's SONAR](https://github.com/facebookresearch/SONAR) text decoder from fairseq2 to HuggingFace Transformers format. ## Model Description SONAR decoder converts 1024-dimensional sentence embeddings back to text. It supports 202 languages (same as NLLB-200). - **Original model:** [facebook/SONAR](https://huggingface.co/facebook/SONAR) - **Encoder port:** [cointegrated/SONAR_200_text_encoder](https://huggingface.co/cointegrated/SONAR_200_text_encoder) - **Code & Documentation:** [GitHub: SonarTransformers](https://github.com/raxtemur/SonarTransformers) ## Usage ### With sonar_transformers library (recommended, see [GitHub: SonarTransformers](https://github.com/raxtemur/SonarTransformers)) ```bash pip install torch transformers sentencepiece ``` ```python from sonar_transformers import SonarPipeline pipeline = SonarPipeline() # Translation result = pipeline.translate( ["Hello, how are you?"], source_lang="eng_Latn", target_lang="rus_Cyrl" ) print(result) # ['Здравствуйте, как дела?'] # Encode text to embeddings embeddings = pipeline.encode(["Hello world!"], source_lang="eng_Latn") print(embeddings.shape) # torch.Size([1, 1024]) # Decode embeddings back to text texts = pipeline.decode(embeddings, target_lang="eng_Latn") print(texts) # ['Hello world!'] ``` ### Direct usage with transformers ```python import torch from transformers import M2M100ForConditionalGeneration, NllbTokenizer from transformers.modeling_outputs import BaseModelOutput # Load model and tokenizer model = M2M100ForConditionalGeneration.from_pretrained("raxtemur/SONAR_200_text_decoder") tokenizer = NllbTokenizer.from_pretrained("raxtemur/SONAR_200_text_decoder") # Your embeddings from SONAR encoder (1024-dim vectors) embeddings = torch.randn(1, 1024) # Replace with actual embeddings # Prepare encoder outputs encoder_outputs = BaseModelOutput(last_hidden_state=embeddings.unsqueeze(1)) # Generate text target_lang = "eng_Latn" forced_bos_token_id = tokenizer.convert_tokens_to_ids(target_lang) generated_ids = model.generate( encoder_outputs=encoder_outputs, forced_bos_token_id=forced_bos_token_id, max_length=128, num_beams=5 ) text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True) print(text) ``` ## Compatibility Tested against original fairseq2 SONAR: | Test | Result | |------|--------| | Encoder cosine similarity | **1.000000** | | Decoder output match | **Identical** | | Round-trip (encode→decode) | **Works** | | Translation | **Works** | Example outputs: - "Hello world!" → "Hello world!" ✓ - "This is a test sentence." → "This is a test sentence." ✓ - eng→rus: "Hello, how are you?" → "Здравствуйте, как дела?" ✓ - eng→deu: "Machine learning is powerful." → "Maschinelles Lernen ist mächtig." ✓ ## Conversion Details This model was converted from the original fairseq2 checkpoint using the following key mappings: | fairseq2 | HuggingFace | |----------|-------------| | `decoder.decoder.layers.N.encoder_decoder_attn.*` | `model.decoder.layers.N.encoder_attn.*` | | `decoder.decoder.layers.N.ffn.inner_proj.*` | `model.decoder.layers.N.fc1.*` | | `decoder.decoder.layers.N.ffn.output_proj.*` | `model.decoder.layers.N.fc2.*` | | `decoder.decoder.layers.N.ffn_layer_norm.*` | `model.decoder.layers.N.final_layer_norm.*` | | `decoder.decoder_frontend.embed.weight` | `model.decoder.embed_tokens.weight` | | `decoder.final_proj.weight` | `lm_head.weight` | Special tokens were reordered: - fairseq2: `[pad=0, unk=1, bos=2, eos=3]` - HuggingFace: `[bos=0, pad=1, eos=2, unk=3]` ## Language Codes (FLORES-200) Common codes: - `eng_Latn` - English - `rus_Cyrl` - Russian - `deu_Latn` - German - `fra_Latn` - French - `spa_Latn` - Spanish - `zho_Hans` - Chinese (Simplified) - `jpn_Jpan` - Japanese - `kor_Hang` - Korean - `arb_Arab` - Arabic Full list: 202 languages from FLORES-200. ## Citation ```bibtex @article{Duquenne:2023:sonar_arxiv, author = {Duquenne, Paul-Ambroise and Schwenk, Holger and Balikas, Georgios and others}, title = {SONAR: Sentence-Level Multimodal and Language-Agnostic Representations}, journal = {arXiv preprint arXiv:2308.11466}, year = {2023}, } ``` ## License **CC-BY-NC-4.0** (inherited from original SONAR) The model weights are derived from [Meta's SONAR](https://github.com/facebookresearch/SONAR) and are licensed under CC-BY-NC-4.0. Commercial use is not permitted. ## Acknowledgments - [Meta AI](https://github.com/facebookresearch/SONAR) - Original SONAR - [cointegrated](https://huggingface.co/cointegrated/SONAR_200_text_encoder) - Encoder conversion inspiration