File size: 5,030 Bytes
e8e3689
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b20d370
e8e3689
 
 
 
 
 
 
 
 
 
 
 
278f6ad
e8e3689
 
 
278f6ad
e8e3689
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---
language:
- multilingual
- en
- ru
- de
- fr
- es
- zh
- ja
- ko
- ar
license: cc-by-nc-4.0
library_name: transformers
tags:
- sonar
- sentence-embeddings
- multilingual
- translation
- text-generation
- text2text-generation
base_model: facebook/nllb-200-distilled-1.3B
pipeline_tag: translation
---

# SONAR 200 Text Decoder (HuggingFace Port)

This is a port of [Meta's SONAR](https://github.com/facebookresearch/SONAR) text decoder from fairseq2 to HuggingFace Transformers format.

## Model Description

SONAR decoder converts 1024-dimensional sentence embeddings back to text. It supports 202 languages (same as NLLB-200).

- **Original model:** [facebook/SONAR](https://huggingface.co/facebook/SONAR)
- **Encoder port:** [cointegrated/SONAR_200_text_encoder](https://huggingface.co/cointegrated/SONAR_200_text_encoder)
- **Code & Documentation:** [GitHub: SonarTransformers](https://github.com/raxtemur/SonarTransformers)

## Usage

### With sonar_transformers library (recommended, see [GitHub: SonarTransformers](https://github.com/raxtemur/SonarTransformers))

```bash
pip install torch transformers sentencepiece
```

```python
from sonar_transformers import SonarPipeline

pipeline = SonarPipeline()

# Translation
result = pipeline.translate(
    ["Hello, how are you?"],
    source_lang="eng_Latn",
    target_lang="rus_Cyrl"
)
print(result)  # ['Здравствуйте, как дела?']

# Encode text to embeddings
embeddings = pipeline.encode(["Hello world!"], source_lang="eng_Latn")
print(embeddings.shape)  # torch.Size([1, 1024])

# Decode embeddings back to text
texts = pipeline.decode(embeddings, target_lang="eng_Latn")
print(texts)  # ['Hello world!']
```

### Direct usage with transformers

```python
import torch
from transformers import M2M100ForConditionalGeneration, NllbTokenizer
from transformers.modeling_outputs import BaseModelOutput

# Load model and tokenizer
model = M2M100ForConditionalGeneration.from_pretrained("raxtemur/SONAR_200_text_decoder")
tokenizer = NllbTokenizer.from_pretrained("raxtemur/SONAR_200_text_decoder")

# Your embeddings from SONAR encoder (1024-dim vectors)
embeddings = torch.randn(1, 1024)  # Replace with actual embeddings

# Prepare encoder outputs
encoder_outputs = BaseModelOutput(last_hidden_state=embeddings.unsqueeze(1))

# Generate text
target_lang = "eng_Latn"
forced_bos_token_id = tokenizer.convert_tokens_to_ids(target_lang)

generated_ids = model.generate(
    encoder_outputs=encoder_outputs,
    forced_bos_token_id=forced_bos_token_id,
    max_length=128,
    num_beams=5
)

text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(text)
```

## Compatibility

Tested against original fairseq2 SONAR:

| Test | Result |
|------|--------|
| Encoder cosine similarity | **1.000000** |
| Decoder output match | **Identical** |
| Round-trip (encode→decode) | **Works** |
| Translation | **Works** |

Example outputs:
- "Hello world!" → "Hello world!" ✓
- "This is a test sentence." → "This is a test sentence." ✓
- eng→rus: "Hello, how are you?" → "Здравствуйте, как дела?" ✓
- eng→deu: "Machine learning is powerful." → "Maschinelles Lernen ist mächtig." ✓

## Conversion Details

This model was converted from the original fairseq2 checkpoint using the following key mappings:

| fairseq2 | HuggingFace |
|----------|-------------|
| `decoder.decoder.layers.N.encoder_decoder_attn.*` | `model.decoder.layers.N.encoder_attn.*` |
| `decoder.decoder.layers.N.ffn.inner_proj.*` | `model.decoder.layers.N.fc1.*` |
| `decoder.decoder.layers.N.ffn.output_proj.*` | `model.decoder.layers.N.fc2.*` |
| `decoder.decoder.layers.N.ffn_layer_norm.*` | `model.decoder.layers.N.final_layer_norm.*` |
| `decoder.decoder_frontend.embed.weight` | `model.decoder.embed_tokens.weight` |
| `decoder.final_proj.weight` | `lm_head.weight` |

Special tokens were reordered:
- fairseq2: `[pad=0, unk=1, bos=2, eos=3]`
- HuggingFace: `[bos=0, pad=1, eos=2, unk=3]`

## Language Codes (FLORES-200)

Common codes:
- `eng_Latn` - English
- `rus_Cyrl` - Russian
- `deu_Latn` - German
- `fra_Latn` - French
- `spa_Latn` - Spanish
- `zho_Hans` - Chinese (Simplified)
- `jpn_Jpan` - Japanese
- `kor_Hang` - Korean
- `arb_Arab` - Arabic

Full list: 202 languages from FLORES-200.

## Citation

```bibtex
@article{Duquenne:2023:sonar_arxiv,
  author = {Duquenne, Paul-Ambroise and Schwenk, Holger and Balikas, Georgios and others},
  title = {SONAR: Sentence-Level Multimodal and Language-Agnostic Representations},
  journal = {arXiv preprint arXiv:2308.11466},
  year = {2023},
}
```

## License

**CC-BY-NC-4.0** (inherited from original SONAR)

The model weights are derived from [Meta's SONAR](https://github.com/facebookresearch/SONAR) and are licensed under CC-BY-NC-4.0. Commercial use is not permitted.

## Acknowledgments

- [Meta AI](https://github.com/facebookresearch/SONAR) - Original SONAR
- [cointegrated](https://huggingface.co/cointegrated/SONAR_200_text_encoder) - Encoder conversion inspiration