Update README.md
Browse files
README.md
CHANGED
|
@@ -331,8 +331,8 @@ Click the expand button below to see the full list of language pairs and the dat
|
|
| 331 |
|
| 332 |
This model has been fine-tuned on ~51k instructions, primarily targeting machine translation performance for the languages pairs featured in the WMT 2025 shared task.
|
| 333 |
The corpus used for the instruction tuning round was built to focus on paragraph-level translation, context-aware machine translation, and sentence-level translation.
|
| 334 |
-
To construct paragraph-level data, we source from [
|
| 335 |
-
The paragraph-level data were constructed by concatenating adjacent sentences (randomly grouping 2, 3, or 4) from the same article or document in
|
| 336 |
Serbian Cyrillic data from FLORES-dev was transliterated into Serbian Latin.
|
| 337 |
In addition, we included data from [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) that we considered relevant to our tasks.
|
| 338 |
|
|
@@ -374,20 +374,20 @@ Click the expand button below to see the full list of tasks included in the fine
|
|
| 374 |
| en β cs | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) |
|
| 375 |
| cs β de | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) |
|
| 376 |
| en β ru | 50 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) |
|
| 377 |
-
| en β ar | 30 | Paragraph-level | [
|
| 378 |
-
| en β bho | 30 | Paragraph-level | [
|
| 379 |
-
| en β ja | 30 | Paragraph-level | [
|
| 380 |
-
| en β uk | 30 | Paragraph-level | [
|
| 381 |
-
| cs β uk | 30 | Paragraph-level | [
|
| 382 |
-
| ja β zh | 30 | Paragraph-level | [
|
| 383 |
-
| en β zh | 30 | Paragraph-level | [
|
| 384 |
-
| en
|
| 385 |
-
| en β et | 30 | Paragraph-level | [
|
| 386 |
-
| en β is | 30 | Paragraph-level | [
|
| 387 |
-
| en β sh | 30 | Paragraph-level | [
|
| 388 |
-
| en β cs | 30 | Paragraph-level | [
|
| 389 |
-
| cs β de | 30 | Paragraph-level | [
|
| 390 |
-
| en β ru | 21 | Paragraph-level | [
|
| 391 |
| **Total** | **50,841** | | |
|
| 392 |
|
| 393 |
|
|
|
|
| 331 |
|
| 332 |
This model has been fine-tuned on ~51k instructions, primarily targeting machine translation performance for the languages pairs featured in the WMT 2025 shared task.
|
| 333 |
The corpus used for the instruction tuning round was built to focus on paragraph-level translation, context-aware machine translation, and sentence-level translation.
|
| 334 |
+
To construct paragraph-level data, we source from [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus), [NTREX](https://github.com/MicrosoftTranslator/NTREX), and [News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary).
|
| 335 |
+
The paragraph-level data were constructed by concatenating adjacent sentences (randomly grouping 2, 3, or 4) from the same article or document in Flores+200 dev, NTREX, and News Commentary.
|
| 336 |
Serbian Cyrillic data from FLORES-dev was transliterated into Serbian Latin.
|
| 337 |
In addition, we included data from [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) that we considered relevant to our tasks.
|
| 338 |
|
|
|
|
| 374 |
| en β cs | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) |
|
| 375 |
| cs β de | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) |
|
| 376 |
| en β ru | 50 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) |
|
| 377 |
+
| en β ar | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
|
| 378 |
+
| en β bho | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
|
| 379 |
+
| en β ja | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
|
| 380 |
+
| en β uk | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
|
| 381 |
+
| cs β uk | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
|
| 382 |
+
| ja β zh | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
|
| 383 |
+
| en β zh | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
|
| 384 |
+
| en οΏ½οΏ½ ko | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
|
| 385 |
+
| en β et | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
|
| 386 |
+
| en β is | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
|
| 387 |
+
| en β sh | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
|
| 388 |
+
| en β cs | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
|
| 389 |
+
| cs β de | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
|
| 390 |
+
| en β ru | 21 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) |
|
| 391 |
| **Total** | **50,841** | | |
|
| 392 |
|
| 393 |
|