xixianliao commited on
Commit
7848237
Β·
verified Β·
1 Parent(s): 7d043a3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -16
README.md CHANGED
@@ -331,8 +331,8 @@ Click the expand button below to see the full list of language pairs and the dat
331
 
332
  This model has been fine-tuned on ~51k instructions, primarily targeting machine translation performance for the languages pairs featured in the WMT 2025 shared task.
333
  The corpus used for the instruction tuning round was built to focus on paragraph-level translation, context-aware machine translation, and sentence-level translation.
334
- To construct paragraph-level data, we source from [FLORES-200-dev](https://huggingface.co/datasets/openlanguagedata/flores_plus), [NTREX](https://github.com/MicrosoftTranslator/NTREX), and [News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary).
335
- The paragraph-level data were constructed by concatenating adjacent sentences (randomly grouping 2, 3, or 4) from the same article or document in FLORES-dev, NTREX, and News Commentary.
336
  Serbian Cyrillic data from FLORES-dev was transliterated into Serbian Latin.
337
  In addition, we included data from [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) that we considered relevant to our tasks.
338
 
@@ -374,20 +374,20 @@ Click the expand button below to see the full list of tasks included in the fine
374
  | en β†’ cs | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) |
375
  | cs β†’ de | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) |
376
  | en β†’ ru | 50 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) |
377
- | en β†’ ar | 30 | Paragraph-level | [FLORES-200-dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
378
- | en β†’ bho | 30 | Paragraph-level | [FLORES-200-dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
379
- | en β†’ ja | 30 | Paragraph-level | [FLORES-200-dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
380
- | en β†’ uk | 30 | Paragraph-level | [FLORES-200-dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
381
- | cs β†’ uk | 30 | Paragraph-level | [FLORES-200-dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
382
- | ja β†’ zh | 30 | Paragraph-level | [FLORES-200-dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
383
- | en β†’ zh | 30 | Paragraph-level | [FLORES-200-dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
384
- | en β†’ ko | 30 | Paragraph-level | [FLORES-200-dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
385
- | en β†’ et | 30 | Paragraph-level | [FLORES-200-dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
386
- | en β†’ is | 30 | Paragraph-level | [FLORES-200-dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
387
- | en β†’ sh | 30 | Paragraph-level | [FLORES-200-dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
388
- | en β†’ cs | 30 | Paragraph-level | [FLORES-200-dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
389
- | cs β†’ de | 30 | Paragraph-level | [FLORES-200-dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
390
- | en β†’ ru | 21 | Paragraph-level | [FLORES-200-dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
391
  | **Total** | **50,841** | | |
392
 
393
 
 
331
 
332
  This model has been fine-tuned on ~51k instructions, primarily targeting machine translation performance for the languages pairs featured in the WMT 2025 shared task.
333
  The corpus used for the instruction tuning round was built to focus on paragraph-level translation, context-aware machine translation, and sentence-level translation.
334
+ To construct paragraph-level data, we source from [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus), [NTREX](https://github.com/MicrosoftTranslator/NTREX), and [News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary).
335
+ The paragraph-level data were constructed by concatenating adjacent sentences (randomly grouping 2, 3, or 4) from the same article or document in Flores+200 dev, NTREX, and News Commentary.
336
  Serbian Cyrillic data from FLORES-dev was transliterated into Serbian Latin.
337
  In addition, we included data from [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) that we considered relevant to our tasks.
338
 
 
374
  | en β†’ cs | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) |
375
  | cs β†’ de | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) |
376
  | en β†’ ru | 50 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) |
377
+ | en β†’ ar | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
378
+ | en β†’ bho | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
379
+ | en β†’ ja | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
380
+ | en β†’ uk | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
381
+ | cs β†’ uk | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
382
+ | ja β†’ zh | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
383
+ | en β†’ zh | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
384
+ | en οΏ½οΏ½ ko | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
385
+ | en β†’ et | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
386
+ | en β†’ is | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
387
+ | en β†’ sh | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
388
+ | en β†’ cs | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
389
+ | cs β†’ de | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
390
+ | en β†’ ru | 21 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
391
  | **Total** | **50,841** | | |
392
 
393