File size: 5,030 Bytes
e8e3689 b20d370 e8e3689 278f6ad e8e3689 278f6ad e8e3689 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 | ---
language:
- multilingual
- en
- ru
- de
- fr
- es
- zh
- ja
- ko
- ar
license: cc-by-nc-4.0
library_name: transformers
tags:
- sonar
- sentence-embeddings
- multilingual
- translation
- text-generation
- text2text-generation
base_model: facebook/nllb-200-distilled-1.3B
pipeline_tag: translation
---
# SONAR 200 Text Decoder (HuggingFace Port)
This is a port of [Meta's SONAR](https://github.com/facebookresearch/SONAR) text decoder from fairseq2 to HuggingFace Transformers format.
## Model Description
SONAR decoder converts 1024-dimensional sentence embeddings back to text. It supports 202 languages (same as NLLB-200).
- **Original model:** [facebook/SONAR](https://huggingface.co/facebook/SONAR)
- **Encoder port:** [cointegrated/SONAR_200_text_encoder](https://huggingface.co/cointegrated/SONAR_200_text_encoder)
- **Code & Documentation:** [GitHub: SonarTransformers](https://github.com/raxtemur/SonarTransformers)
## Usage
### With sonar_transformers library (recommended, see [GitHub: SonarTransformers](https://github.com/raxtemur/SonarTransformers))
```bash
pip install torch transformers sentencepiece
```
```python
from sonar_transformers import SonarPipeline
pipeline = SonarPipeline()
# Translation
result = pipeline.translate(
["Hello, how are you?"],
source_lang="eng_Latn",
target_lang="rus_Cyrl"
)
print(result) # ['Здравствуйте, как дела?']
# Encode text to embeddings
embeddings = pipeline.encode(["Hello world!"], source_lang="eng_Latn")
print(embeddings.shape) # torch.Size([1, 1024])
# Decode embeddings back to text
texts = pipeline.decode(embeddings, target_lang="eng_Latn")
print(texts) # ['Hello world!']
```
### Direct usage with transformers
```python
import torch
from transformers import M2M100ForConditionalGeneration, NllbTokenizer
from transformers.modeling_outputs import BaseModelOutput
# Load model and tokenizer
model = M2M100ForConditionalGeneration.from_pretrained("raxtemur/SONAR_200_text_decoder")
tokenizer = NllbTokenizer.from_pretrained("raxtemur/SONAR_200_text_decoder")
# Your embeddings from SONAR encoder (1024-dim vectors)
embeddings = torch.randn(1, 1024) # Replace with actual embeddings
# Prepare encoder outputs
encoder_outputs = BaseModelOutput(last_hidden_state=embeddings.unsqueeze(1))
# Generate text
target_lang = "eng_Latn"
forced_bos_token_id = tokenizer.convert_tokens_to_ids(target_lang)
generated_ids = model.generate(
encoder_outputs=encoder_outputs,
forced_bos_token_id=forced_bos_token_id,
max_length=128,
num_beams=5
)
text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(text)
```
## Compatibility
Tested against original fairseq2 SONAR:
| Test | Result |
|------|--------|
| Encoder cosine similarity | **1.000000** |
| Decoder output match | **Identical** |
| Round-trip (encode→decode) | **Works** |
| Translation | **Works** |
Example outputs:
- "Hello world!" → "Hello world!" ✓
- "This is a test sentence." → "This is a test sentence." ✓
- eng→rus: "Hello, how are you?" → "Здравствуйте, как дела?" ✓
- eng→deu: "Machine learning is powerful." → "Maschinelles Lernen ist mächtig." ✓
## Conversion Details
This model was converted from the original fairseq2 checkpoint using the following key mappings:
| fairseq2 | HuggingFace |
|----------|-------------|
| `decoder.decoder.layers.N.encoder_decoder_attn.*` | `model.decoder.layers.N.encoder_attn.*` |
| `decoder.decoder.layers.N.ffn.inner_proj.*` | `model.decoder.layers.N.fc1.*` |
| `decoder.decoder.layers.N.ffn.output_proj.*` | `model.decoder.layers.N.fc2.*` |
| `decoder.decoder.layers.N.ffn_layer_norm.*` | `model.decoder.layers.N.final_layer_norm.*` |
| `decoder.decoder_frontend.embed.weight` | `model.decoder.embed_tokens.weight` |
| `decoder.final_proj.weight` | `lm_head.weight` |
Special tokens were reordered:
- fairseq2: `[pad=0, unk=1, bos=2, eos=3]`
- HuggingFace: `[bos=0, pad=1, eos=2, unk=3]`
## Language Codes (FLORES-200)
Common codes:
- `eng_Latn` - English
- `rus_Cyrl` - Russian
- `deu_Latn` - German
- `fra_Latn` - French
- `spa_Latn` - Spanish
- `zho_Hans` - Chinese (Simplified)
- `jpn_Jpan` - Japanese
- `kor_Hang` - Korean
- `arb_Arab` - Arabic
Full list: 202 languages from FLORES-200.
## Citation
```bibtex
@article{Duquenne:2023:sonar_arxiv,
author = {Duquenne, Paul-Ambroise and Schwenk, Holger and Balikas, Georgios and others},
title = {SONAR: Sentence-Level Multimodal and Language-Agnostic Representations},
journal = {arXiv preprint arXiv:2308.11466},
year = {2023},
}
```
## License
**CC-BY-NC-4.0** (inherited from original SONAR)
The model weights are derived from [Meta's SONAR](https://github.com/facebookresearch/SONAR) and are licensed under CC-BY-NC-4.0. Commercial use is not permitted.
## Acknowledgments
- [Meta AI](https://github.com/facebookresearch/SONAR) - Original SONAR
- [cointegrated](https://huggingface.co/cointegrated/SONAR_200_text_encoder) - Encoder conversion inspiration
|