fzengin18
/

multrenizer

@@ -11,6 +11,9 @@ tags:
 - turkish
 - english
 - bilingual
 ---
 # Multrenizer
@@ -255,6 +258,14 @@ The released artifact is trained with the default file-based interleave in `trai
 Corpus collection is Turkish-forward, and code-switching examples are generated from OPUS parallel pairs during data preparation.
 ### Vocabulary Budget
 Multrenizer is designed around a `26,000` target vocabulary, with a fixed budget reserved for always-preserved tokens:

 - turkish
 - english
 - bilingual
+datasets:
+- wikimedia/wikipedia
+- Helsinki-NLP/opus-100
 ---
 # Multrenizer
 Corpus collection is Turkish-forward, and code-switching examples are generated from OPUS parallel pairs during data preparation.
+Exact source configs used during corpus preparation:
+- `wikimedia/wikipedia` with `20231101.tr`
+- `wikimedia/wikipedia` with `20231101.en`
+- `Helsinki-NLP/opus-100` with `en-tr`
+The synthetic code-switching stream is generated locally from OPUS-100 parallel pairs, so it does not appear as a separate Hugging Face dataset entry.
 ### Vocabulary Budget
 Multrenizer is designed around a `26,000` target vocabulary, with a fixed budget reserved for always-preserved tokens: