fzengin18 commited on
Commit
e1fe4bd
·
verified ·
1 Parent(s): a60203b

Add dataset metadata to model card

Browse files
Files changed (1) hide show
  1. README.md +11 -0
README.md CHANGED
@@ -11,6 +11,9 @@ tags:
11
  - turkish
12
  - english
13
  - bilingual
 
 
 
14
  ---
15
 
16
  # Multrenizer
@@ -255,6 +258,14 @@ The released artifact is trained with the default file-based interleave in `trai
255
 
256
  Corpus collection is Turkish-forward, and code-switching examples are generated from OPUS parallel pairs during data preparation.
257
 
 
 
 
 
 
 
 
 
258
  ### Vocabulary Budget
259
 
260
  Multrenizer is designed around a `26,000` target vocabulary, with a fixed budget reserved for always-preserved tokens:
 
11
  - turkish
12
  - english
13
  - bilingual
14
+ datasets:
15
+ - wikimedia/wikipedia
16
+ - Helsinki-NLP/opus-100
17
  ---
18
 
19
  # Multrenizer
 
258
 
259
  Corpus collection is Turkish-forward, and code-switching examples are generated from OPUS parallel pairs during data preparation.
260
 
261
+ Exact source configs used during corpus preparation:
262
+
263
+ - `wikimedia/wikipedia` with `20231101.tr`
264
+ - `wikimedia/wikipedia` with `20231101.en`
265
+ - `Helsinki-NLP/opus-100` with `en-tr`
266
+
267
+ The synthetic code-switching stream is generated locally from OPUS-100 parallel pairs, so it does not appear as a separate Hugging Face dataset entry.
268
+
269
  ### Vocabulary Budget
270
 
271
  Multrenizer is designed around a `26,000` target vocabulary, with a fixed budget reserved for always-preserved tokens: