Instructions to use islomov/rubai-corrector-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use islomov/rubai-corrector-base with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="islomov/rubai-corrector-base")

# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("islomov/rubai-corrector-base")
model = AutoModelForSeq2SeqLM.from_pretrained("islomov/rubai-corrector-base")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use islomov/rubai-corrector-base with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "islomov/rubai-corrector-base"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "islomov/rubai-corrector-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/islomov/rubai-corrector-base

SGLang

How to use islomov/rubai-corrector-base with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "islomov/rubai-corrector-base" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "islomov/rubai-corrector-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "islomov/rubai-corrector-base" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "islomov/rubai-corrector-base",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use islomov/rubai-corrector-base with Docker Model Runner:
```
docker model run hf.co/islomov/rubai-corrector-base
```

islomov commited on Mar 28

Commit

bfe896d

verified ·

1 Parent(s): cb6d989

Initial private upload

Browse files

Files changed (11) hide show

.gitattributes +0 -34
README.md +230 -0
added_tokens.json +127 -0
config.json +32 -0
data_format.example.jsonl +5 -0
finetune.py +236 -0
generation_config.json +7 -0
model.safetensors +3 -0
special_tokens_map.json +150 -0
test_model.py +169 -0
tokenizer_config.json +1163 -0

.gitattributes CHANGED Viewed

@@ -1,35 +1 @@
-*.7z filter=lfs diff=lfs merge=lfs -text
-*.arrow filter=lfs diff=lfs merge=lfs -text
-*.bin filter=lfs diff=lfs merge=lfs -text
-*.bz2 filter=lfs diff=lfs merge=lfs -text
-*.ckpt filter=lfs diff=lfs merge=lfs -text
-*.ftz filter=lfs diff=lfs merge=lfs -text
-*.gz filter=lfs diff=lfs merge=lfs -text
-*.h5 filter=lfs diff=lfs merge=lfs -text
-*.joblib filter=lfs diff=lfs merge=lfs -text
-*.lfs.* filter=lfs diff=lfs merge=lfs -text
-*.mlmodel filter=lfs diff=lfs merge=lfs -text
-*.model filter=lfs diff=lfs merge=lfs -text
-*.msgpack filter=lfs diff=lfs merge=lfs -text
-*.npy filter=lfs diff=lfs merge=lfs -text
-*.npz filter=lfs diff=lfs merge=lfs -text
-*.onnx filter=lfs diff=lfs merge=lfs -text
-*.ot filter=lfs diff=lfs merge=lfs -text
-*.parquet filter=lfs diff=lfs merge=lfs -text
-*.pb filter=lfs diff=lfs merge=lfs -text
-*.pickle filter=lfs diff=lfs merge=lfs -text
-*.pkl filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text
-*.pth filter=lfs diff=lfs merge=lfs -text
-*.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
-saved_model/**/* filter=lfs diff=lfs merge=lfs -text
-*.tar.* filter=lfs diff=lfs merge=lfs -text
-*.tar filter=lfs diff=lfs merge=lfs -text
-*.tflite filter=lfs diff=lfs merge=lfs -text
-*.tgz filter=lfs diff=lfs merge=lfs -text
-*.wasm filter=lfs diff=lfs merge=lfs -text
-*.xz filter=lfs diff=lfs merge=lfs -text
-*.zip filter=lfs diff=lfs merge=lfs -text
-*.zst filter=lfs diff=lfs merge=lfs -text
-*tfevents* filter=lfs diff=lfs merge=lfs -text


























1	*.safetensors filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,230 @@

+---
+language:
+- uz
+- ru
+library_name: transformers
+pipeline_tag: text2text-generation
+tags:
+- uzbek
+- russian
+- text-normalization
+- error-correction
+- byt5
+- base-model
+- finetuning
+---
+# rubai-corrector-base
+Base ByT5 correction checkpoint for building task-specific Rubai correctors.
+This is the foundation model of the `rubai-corrector` line. It is meant to be fine-tuned for a concrete demand:
+- transcript display cleanup
+- punctuation and comma recovery
+- OCR and ASR typo repair
+- apostrophe normalization
+- mixed Uzbek/Russian cleanup
+- domain-specific formatting rules
+If you want a ready-to-use ASR display model, use [rubai-corrector-transcript-uz](https://huggingface.co/rubai/rubai-corrector-transcript-uz). This package is the base for further adaptation.
+## Authors
+- **[Sardor Islomov](https://www.linkedin.com/in/islomov-sardor/)** — lead author
+- [Davron Ibrokhimov](https://www.linkedin.com/in/davron-ibrokhimov-8b62b8287/)
+## Model Family
+| Model | Use Case |
+|---|---|
+| **rubai-corrector-base** (this model) | Fine-tuning base for new correction tasks |
+| [rubai-corrector-transcript-uz](https://huggingface.co/rubai/rubai-corrector-transcript-uz) | Ready-to-use transcript display normalization |
+Both models share the same ByT5 architecture. The transcript model is fine-tuned from this base for ASR display text.
+## Quick Smoke Test
+The model uses the `correct: ` instruction prefix.
+```python
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+model_path = "rubai/rubai-corrector-base"
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
+text = "men ozim kordim"
+inputs = tokenizer([f"correct: {text}"], return_tensors="pt", padding=True)
+output_ids = model.generate(**inputs, max_new_tokens=128)
+prediction = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
+print(prediction)
+```
+Expected output:
+```text
+Men o'zim ko'rdim
+```
+For a local runnable example suite, see [test_model.py](./test_model.py).
+## Real Base Examples
+These are real outputs from this packaged checkpoint.
+### Abbreviations
+```text
+Input:  telefon rqami qaysi
+Output: Telefon raqami qaysi
+```
+### Apostrophes
+```text
+Input:  men ozim kordim
+Output: Men o'zim ko'rdim
+Input:  togri yoldan boring
+Output: To'g'ri yo'ldan boring
+```
+### OCR And ASR Noise
+```text
+Input:  rnen universitetda oqiyrnan
+Output: Men universitetda o'qiyman
+Input:  bu juda rnuhirn masala
+Output: Bu juda muhim masala
+```
+### Numbers And Dates
+```text
+Input:  narxi yigirma besh ming so'm
+Output: Narxi 25 000 so'm
+Input:  uchrashuv o'n beshinchi yanvar kuni
+Output: Uchrashuv 15-yanvar kuni
+```
+### Mixed Uzbek And Russian
+```text
+Input:  men segodnya bozorga bordim
+Output: Men сегодня bozorga bordim
+Input:  privet kak делa
+Output: Привет как дела
+```
+## Fine-Tuning
+This package includes a standalone fine-tuning script:
+- [finetune.py](./finetune.py)
+It keeps the same core training behavior as the original project line:
+- input prefix: `correct: `
+- ByT5 / `T5ForConditionalGeneration`
+- Adafactor optimizer
+- linear warmup scheduler
+- seq2seq supervised fine-tuning on `input -> output` pairs
+Example:
+```bash
+python finetune.py \
+  --model-path rubai/rubai-corrector-base \
+  --train-file ./data/train.jsonl \
+  --eval-file ./data/valid.jsonl \
+  --output-dir ./outputs/my-domain-corrector \
+  --learning-rate 5e-5 \
+  --num-train-epochs 2 \
+  --per-device-train-batch-size 16 \
+  --gradient-accumulation-steps 4 \
+  --max-source-length 512 \
+  --max-target-length 512 \
+  --bf16
+```
+## Input Data Format
+Training data is JSONL. Each line must contain:
+- `input`: noisy or source text
+- `output`: target corrected text
+Example:
+```jsonl
+{"input":"men ozim kordim","output":"Men o'zim ko'rdim"}
+{"input":"narxi yigirma besh ming so'm","output":"Narxi 25 000 so'm"}
+{"input":"rnen universitetda oqiyrnan","output":"Men universitetda o'qiyman"}
+{"input":"men segodnya bozorga bordim","output":"Men сегодня bozorga bordim"}
+```
+A tiny sample file is included here:
+- [data_format.example.jsonl](./data_format.example.jsonl)
+You can point `finetune.py` either to a JSONL file directly or to a directory containing `data.jsonl`.
+## How This Base Was Trained
+This model starts from [google/byt5-small](https://huggingface.co/google/byt5-small) and was built with a 3-stage curriculum on Uzbek text correction data.
+### Stage 1 — Foundation
+The foundation stage used ~1,000,000 synthetic correction pairs generated from Uzbek text with transformations such as:
+- apostrophe removal
+- comma removal
+- lowercasing
+- OCR-like character substitutions
+- `h`/`x` swaps
+- abbreviation-like corruption
+### Stage 2 — Curated Mix
+Stage 2 added ~408,000 curated rows covering:
+- general error correction
+- text denormalization (numbers, dates, formatting)
+- Russian Latin-to-Cyrillic recovery
+- focused apostrophe and `h`/`x` restoration
+- anti-Cyrillic guardrails (prevent unwanted script switching)
+### Stage 3 — Polish
+Stage 3 used ~32,000 rows for fine-grained behavior tuning:
+- comma and punctuation restoration
+- exact-copy preservation (teach the model not to over-correct)
+- format restoration (numbers, dates, addresses)
+- mixed-script guardrails (prevent script bleeding between Uzbek and Russian)
+- period hallucination prevention
+## Training Details
+- **Architecture:** `T5ForConditionalGeneration` with ByT5 tokenizer
+- **Precision:** BF16 mixed precision
+- **Optimizer:** Adafactor
+- **Scheduler:** linear warmup + linear decay
+- **Max sequence length:** 512
+- **Gradient checkpointing:** enabled
+- **Curriculum learning:** enabled (length-sorted batches)
+## Notes
+- This base model is for continuation training and task-specific adaptation.
+- It can be used directly for inference, but that is not its main role in the model family.
+- For Rubai STT postprocessing out of the box, use [rubai-corrector-transcript-uz](https://huggingface.co/rubai/rubai-corrector-transcript-uz).
+## Acknowledgements
+Special thanks to [Davron Ibrokhimov](https://www.linkedin.com/in/davron-ibrokhimov-8b62b8287/) for sponsoring this work and making it possible to keep these models open.
+Thank you to the community that supports Uzbek language technology. In particular:
+- [MetaSell](https://metasell.ai/) for support and resources
+- [Kotib](https://kotib.ai/) for their support and collaboration on Uzbek STT
+- [Global Move](https://globalmove.uz/) for backing open Uzbek NLP work
+Thanks to Arofat, Gulimshaxnoz, and many others who contributed in ways big and small. The list is too long to fit here, but every contribution matters and is appreciated.
+Support my works and open-source movement: https://tirikchilik.uz/islomovs

added_tokens.json ADDED Viewed

	@@ -0,0 +1,127 @@

+{
+  "<extra_id_0>": 259,
+  "<extra_id_100>": 359,
+  "<extra_id_101>": 360,
+  "<extra_id_102>": 361,
+  "<extra_id_103>": 362,
+  "<extra_id_104>": 363,
+  "<extra_id_105>": 364,
+  "<extra_id_106>": 365,
+  "<extra_id_107>": 366,
+  "<extra_id_108>": 367,
+  "<extra_id_109>": 368,
+  "<extra_id_10>": 269,
+  "<extra_id_110>": 369,
+  "<extra_id_111>": 370,
+  "<extra_id_112>": 371,
+  "<extra_id_113>": 372,
+  "<extra_id_114>": 373,
+  "<extra_id_115>": 374,
+  "<extra_id_116>": 375,
+  "<extra_id_117>": 376,
+  "<extra_id_118>": 377,
+  "<extra_id_119>": 378,
+  "<extra_id_11>": 270,
+  "<extra_id_120>": 379,
+  "<extra_id_121>": 380,
+  "<extra_id_122>": 381,
+  "<extra_id_123>": 382,
+  "<extra_id_124>": 383,
+  "<extra_id_12>": 271,
+  "<extra_id_13>": 272,
+  "<extra_id_14>": 273,
+  "<extra_id_15>": 274,
+  "<extra_id_16>": 275,
+  "<extra_id_17>": 276,
+  "<extra_id_18>": 277,
+  "<extra_id_19>": 278,
+  "<extra_id_1>": 260,
+  "<extra_id_20>": 279,
+  "<extra_id_21>": 280,
+  "<extra_id_22>": 281,
+  "<extra_id_23>": 282,
+  "<extra_id_24>": 283,
+  "<extra_id_25>": 284,
+  "<extra_id_26>": 285,
+  "<extra_id_27>": 286,
+  "<extra_id_28>": 287,
+  "<extra_id_29>": 288,
+  "<extra_id_2>": 261,
+  "<extra_id_30>": 289,
+  "<extra_id_31>": 290,
+  "<extra_id_32>": 291,
+  "<extra_id_33>": 292,
+  "<extra_id_34>": 293,
+  "<extra_id_35>": 294,
+  "<extra_id_36>": 295,
+  "<extra_id_37>": 296,
+  "<extra_id_38>": 297,
+  "<extra_id_39>": 298,
+  "<extra_id_3>": 262,
+  "<extra_id_40>": 299,
+  "<extra_id_41>": 300,
+  "<extra_id_42>": 301,
+  "<extra_id_43>": 302,
+  "<extra_id_44>": 303,
+  "<extra_id_45>": 304,
+  "<extra_id_46>": 305,
+  "<extra_id_47>": 306,
+  "<extra_id_48>": 307,
+  "<extra_id_49>": 308,
+  "<extra_id_4>": 263,
+  "<extra_id_50>": 309,
+  "<extra_id_51>": 310,
+  "<extra_id_52>": 311,
+  "<extra_id_53>": 312,
+  "<extra_id_54>": 313,
+  "<extra_id_55>": 314,
+  "<extra_id_56>": 315,
+  "<extra_id_57>": 316,
+  "<extra_id_58>": 317,
+  "<extra_id_59>": 318,
+  "<extra_id_5>": 264,
+  "<extra_id_60>": 319,
+  "<extra_id_61>": 320,
+  "<extra_id_62>": 321,
+  "<extra_id_63>": 322,
+  "<extra_id_64>": 323,
+  "<extra_id_65>": 324,
+  "<extra_id_66>": 325,
+  "<extra_id_67>": 326,
+  "<extra_id_68>": 327,
+  "<extra_id_69>": 328,
+  "<extra_id_6>": 265,
+  "<extra_id_70>": 329,
+  "<extra_id_71>": 330,
+  "<extra_id_72>": 331,
+  "<extra_id_73>": 332,
+  "<extra_id_74>": 333,
+  "<extra_id_75>": 334,
+  "<extra_id_76>": 335,
+  "<extra_id_77>": 336,
+  "<extra_id_78>": 337,
+  "<extra_id_79>": 338,
+  "<extra_id_7>": 266,
+  "<extra_id_80>": 339,
+  "<extra_id_81>": 340,
+  "<extra_id_82>": 341,
+  "<extra_id_83>": 342,
+  "<extra_id_84>": 343,
+  "<extra_id_85>": 344,
+  "<extra_id_86>": 345,
+  "<extra_id_87>": 346,
+  "<extra_id_88>": 347,
+  "<extra_id_89>": 348,
+  "<extra_id_8>": 267,
+  "<extra_id_90>": 349,
+  "<extra_id_91>": 350,
+  "<extra_id_92>": 351,
+  "<extra_id_93>": 352,
+  "<extra_id_94>": 353,
+  "<extra_id_95>": 354,
+  "<extra_id_96>": 355,
+  "<extra_id_97>": 356,
+  "<extra_id_98>": 357,
+  "<extra_id_99>": 358,
+  "<extra_id_9>": 268
+}

config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "architectures": [
+    "T5ForConditionalGeneration"
+  ],
+  "classifier_dropout": 0.0,
+  "d_ff": 3584,
+  "d_kv": 64,
+  "d_model": 1472,
+  "decoder_start_token_id": 0,
+  "dense_act_fn": "gelu_new",
+  "dropout_rate": 0.1,
+  "dtype": "float32",
+  "eos_token_id": 1,
+  "feed_forward_proj": "gated-gelu",
+  "gradient_checkpointing": false,
+  "initializer_factor": 1.0,
+  "is_encoder_decoder": true,
+  "is_gated_act": true,
+  "layer_norm_epsilon": 1e-06,
+  "model_type": "t5",
+  "num_decoder_layers": 4,
+  "num_heads": 6,
+  "num_layers": 12,
+  "pad_token_id": 0,
+  "relative_attention_max_distance": 128,
+  "relative_attention_num_buckets": 32,
+  "tie_word_embeddings": false,
+  "tokenizer_class": "ByT5Tokenizer",
+  "transformers_version": "4.57.3",
+  "use_cache": true,
+  "vocab_size": 384
+}

data_format.example.jsonl ADDED Viewed

	@@ -0,0 +1,5 @@

+{"input":"men ozim kordim","output":"Men o'zim ko'rdim"}
+{"input":"telefon rqami qaysi","output":"Telefon raqami qaysi"}
+{"input":"narxi yigirma besh ming so'm","output":"Narxi 25 000 so'm"}
+{"input":"rnen universitetda oqiyrnan","output":"Men universitetda o'qiyman"}
+{"input":"men segodnya bozorga bordim","output":"Men сегодня bozorga bordim"}

finetune.py ADDED Viewed

	@@ -0,0 +1,236 @@

+#!/usr/bin/env python3
+"""Fine-tune rubai-corrector-base on JSONL correction pairs."""
+from __future__ import annotations
+import argparse
+import json
+import random
+from pathlib import Path
+from typing import Any
+import torch
+from torch.utils.data import Dataset
+from transformers import (
+    AutoModelForSeq2SeqLM,
+    AutoTokenizer,
+    DataCollatorForSeq2Seq,
+    Seq2SeqTrainer,
+    Seq2SeqTrainingArguments,
+    get_linear_schedule_with_warmup,
+    set_seed,
+)
+from transformers.optimization import Adafactor
+INPUT_PREFIX = "correct: "
+def resolve_jsonl_path(path: Path) -> Path:
+    if path.is_dir():
+        candidate = path / "data.jsonl"
+        if candidate.exists():
+            return candidate
+        raise FileNotFoundError(f"Directory {path} does not contain data.jsonl")
+    return path
+def load_records(path: Path) -> list[dict[str, Any]]:
+    data_path = resolve_jsonl_path(path)
+    records: list[dict[str, Any]] = []
+    with data_path.open("r", encoding="utf-8") as handle:
+        for line_num, line in enumerate(handle, start=1):
+            line = line.strip()
+            if not line:
+                continue
+            record = json.loads(line)
+            if not isinstance(record.get("input"), str) or not isinstance(record.get("output"), str):
+                raise ValueError(
+                    f"{data_path}:{line_num} must contain string fields 'input' and 'output'"
+                )
+            records.append(record)
+    if not records:
+        raise ValueError(f"No records loaded from {data_path}")
+    return records
+def split_records(
+    records: list[dict[str, Any]],
+    validation_split: float,
+    seed: int,
+) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]:
+    if validation_split <= 0:
+        return records, []
+    if not 0 < validation_split < 1:
+        raise ValueError("--validation-split must be between 0 and 1")
+    items = records[:]
+    random.Random(seed).shuffle(items)
+    eval_size = max(1, int(len(items) * validation_split))
+    return items[eval_size:], items[:eval_size]
+class CorrectionDataset(Dataset):
+    def __init__(
+        self,
+        records: list[dict[str, Any]],
+        tokenizer,
+        max_source_length: int,
+        max_target_length: int,
+    ):
+        self.records = records
+        self.tokenizer = tokenizer
+        self.max_source_length = max_source_length
+        self.max_target_length = max_target_length
+    def __len__(self) -> int:
+        return len(self.records)
+    def __getitem__(self, index: int) -> dict[str, Any]:
+        record = self.records[index]
+        model_inputs = self.tokenizer(
+            INPUT_PREFIX + record["input"],
+            truncation=True,
+            max_length=self.max_source_length,
+        )
+        labels = self.tokenizer(
+            record["output"],
+            truncation=True,
+            max_length=self.max_target_length,
+        )
+        model_inputs["labels"] = labels["input_ids"]
+        return model_inputs
+class AdafactorSeq2SeqTrainer(Seq2SeqTrainer):
+    def create_optimizer(self):
+        if self.optimizer is None:
+            self.optimizer = Adafactor(
+                self.model.parameters(),
+                lr=self.args.learning_rate,
+                scale_parameter=False,
+                relative_step=False,
+                warmup_init=False,
+                weight_decay=self.args.weight_decay,
+            )
+        return self.optimizer
+    def create_scheduler(self, num_training_steps: int, optimizer=None):
+        if self.lr_scheduler is None:
+            actual_optimizer = optimizer if optimizer is not None else self.optimizer
+            self.lr_scheduler = get_linear_schedule_with_warmup(
+                actual_optimizer,
+                num_warmup_steps=self.args.get_warmup_steps(num_training_steps),
+                num_training_steps=num_training_steps,
+            )
+        return self.lr_scheduler
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--model-path", type=Path, default=Path(__file__).resolve().parent)
+    parser.add_argument("--train-file", type=Path, required=True)
+    parser.add_argument("--eval-file", type=Path, default=None)
+    parser.add_argument("--validation-split", type=float, default=0.0)
+    parser.add_argument("--output-dir", type=Path, required=True)
+    parser.add_argument("--max-source-length", type=int, default=512)
+    parser.add_argument("--max-target-length", type=int, default=512)
+    parser.add_argument("--learning-rate", type=float, default=5e-5)
+    parser.add_argument("--weight-decay", type=float, default=0.01)
+    parser.add_argument("--warmup-ratio", type=float, default=0.1)
+    parser.add_argument("--num-train-epochs", type=float, default=2.0)
+    parser.add_argument("--per-device-train-batch-size", type=int, default=16)
+    parser.add_argument("--per-device-eval-batch-size", type=int, default=16)
+    parser.add_argument("--gradient-accumulation-steps", type=int, default=4)
+    parser.add_argument("--save-steps", type=int, default=500)
+    parser.add_argument("--eval-steps", type=int, default=500)
+    parser.add_argument("--logging-steps", type=int, default=50)
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--bf16", action="store_true")
+    parser.add_argument("--fp16", action="store_true")
+    parser.add_argument("--gradient-checkpointing", action="store_true", default=True)
+    parser.add_argument("--no-gradient-checkpointing", action="store_true")
+    parser.add_argument("--resume-from-checkpoint", type=str, default=None)
+    return parser.parse_args()
+def main() -> int:
+    args = parse_args()
+    set_seed(args.seed)
+    gradient_checkpointing = args.gradient_checkpointing and not args.no_gradient_checkpointing
+    tokenizer = AutoTokenizer.from_pretrained(args.model_path)
+    model = AutoModelForSeq2SeqLM.from_pretrained(args.model_path)
+    if gradient_checkpointing:
+        model.gradient_checkpointing_enable()
+    train_records = load_records(args.train_file)
+    if args.eval_file is not None:
+        eval_records = load_records(args.eval_file)
+    else:
+        train_records, eval_records = split_records(train_records, args.validation_split, args.seed)
+    train_dataset = CorrectionDataset(
+        train_records,
+        tokenizer,
+        max_source_length=args.max_source_length,
+        max_target_length=args.max_target_length,
+    )
+    eval_dataset = None
+    if eval_records:
+        eval_dataset = CorrectionDataset(
+            eval_records,
+            tokenizer,
+            max_source_length=args.max_source_length,
+            max_target_length=args.max_target_length,
+        )
+    data_collator = DataCollatorForSeq2Seq(
+        tokenizer=tokenizer,
+        model=model,
+        label_pad_token_id=-100,
+        pad_to_multiple_of=8 if torch.cuda.is_available() else None,
+    )
+    training_args = Seq2SeqTrainingArguments(
+        output_dir=str(args.output_dir),
+        learning_rate=args.learning_rate,
+        weight_decay=args.weight_decay,
+        warmup_ratio=args.warmup_ratio,
+        num_train_epochs=args.num_train_epochs,
+        per_device_train_batch_size=args.per_device_train_batch_size,
+        per_device_eval_batch_size=args.per_device_eval_batch_size,
+        gradient_accumulation_steps=args.gradient_accumulation_steps,
+        logging_steps=args.logging_steps,
+        save_steps=args.save_steps,
+        eval_steps=args.eval_steps,
+        evaluation_strategy="steps" if eval_dataset is not None else "no",
+        save_strategy="steps",
+        save_total_limit=2,
+        predict_with_generate=False,
+        report_to=[],
+        bf16=args.bf16,
+        fp16=args.fp16 and not args.bf16,
+        gradient_checkpointing=gradient_checkpointing,
+        dataloader_num_workers=2,
+        remove_unused_columns=False,
+        seed=args.seed,
+    )
+    trainer = AdafactorSeq2SeqTrainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        tokenizer=tokenizer,
+        data_collator=data_collator,
+    )
+    trainer.train(resume_from_checkpoint=args.resume_from_checkpoint)
+    trainer.save_model(args.output_dir)
+    tokenizer.save_pretrained(args.output_dir)
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

generation_config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "_from_model_config": true,
+  "decoder_start_token_id": 0,
+  "eos_token_id": 1,
+  "pad_token_id": 0,
+  "transformers_version": "4.57.3"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:60935f88716bf89db9795b5d2e7685c1198e28dcf40d058adf4659393f41dbdb
+size 1198571496

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,150 @@

+{
+  "additional_special_tokens": [
+    "<extra_id_0>",
+    "<extra_id_1>",
+    "<extra_id_2>",
+    "<extra_id_3>",
+    "<extra_id_4>",
+    "<extra_id_5>",
+    "<extra_id_6>",
+    "<extra_id_7>",
+    "<extra_id_8>",
+    "<extra_id_9>",
+    "<extra_id_10>",
+    "<extra_id_11>",
+    "<extra_id_12>",
+    "<extra_id_13>",
+    "<extra_id_14>",
+    "<extra_id_15>",
+    "<extra_id_16>",
+    "<extra_id_17>",
+    "<extra_id_18>",
+    "<extra_id_19>",
+    "<extra_id_20>",
+    "<extra_id_21>",
+    "<extra_id_22>",
+    "<extra_id_23>",
+    "<extra_id_24>",
+    "<extra_id_25>",
+    "<extra_id_26>",
+    "<extra_id_27>",
+    "<extra_id_28>",
+    "<extra_id_29>",
+    "<extra_id_30>",
+    "<extra_id_31>",
+    "<extra_id_32>",
+    "<extra_id_33>",
+    "<extra_id_34>",
+    "<extra_id_35>",
+    "<extra_id_36>",
+    "<extra_id_37>",
+    "<extra_id_38>",
+    "<extra_id_39>",
+    "<extra_id_40>",
+    "<extra_id_41>",
+    "<extra_id_42>",
+    "<extra_id_43>",
+    "<extra_id_44>",
+    "<extra_id_45>",
+    "<extra_id_46>",
+    "<extra_id_47>",
+    "<extra_id_48>",
+    "<extra_id_49>",
+    "<extra_id_50>",
+    "<extra_id_51>",
+    "<extra_id_52>",
+    "<extra_id_53>",
+    "<extra_id_54>",
+    "<extra_id_55>",
+    "<extra_id_56>",
+    "<extra_id_57>",
+    "<extra_id_58>",
+    "<extra_id_59>",
+    "<extra_id_60>",
+    "<extra_id_61>",
+    "<extra_id_62>",
+    "<extra_id_63>",
+    "<extra_id_64>",
+    "<extra_id_65>",
+    "<extra_id_66>",
+    "<extra_id_67>",
+    "<extra_id_68>",
+    "<extra_id_69>",
+    "<extra_id_70>",
+    "<extra_id_71>",
+    "<extra_id_72>",
+    "<extra_id_73>",
+    "<extra_id_74>",
+    "<extra_id_75>",
+    "<extra_id_76>",
+    "<extra_id_77>",
+    "<extra_id_78>",
+    "<extra_id_79>",
+    "<extra_id_80>",
+    "<extra_id_81>",
+    "<extra_id_82>",
+    "<extra_id_83>",
+    "<extra_id_84>",
+    "<extra_id_85>",
+    "<extra_id_86>",
+    "<extra_id_87>",
+    "<extra_id_88>",
+    "<extra_id_89>",
+    "<extra_id_90>",
+    "<extra_id_91>",
+    "<extra_id_92>",
+    "<extra_id_93>",
+    "<extra_id_94>",
+    "<extra_id_95>",
+    "<extra_id_96>",
+    "<extra_id_97>",
+    "<extra_id_98>",
+    "<extra_id_99>",
+    "<extra_id_100>",
+    "<extra_id_101>",
+    "<extra_id_102>",
+    "<extra_id_103>",
+    "<extra_id_104>",
+    "<extra_id_105>",
+    "<extra_id_106>",
+    "<extra_id_107>",
+    "<extra_id_108>",
+    "<extra_id_109>",
+    "<extra_id_110>",
+    "<extra_id_111>",
+    "<extra_id_112>",
+    "<extra_id_113>",
+    "<extra_id_114>",
+    "<extra_id_115>",
+    "<extra_id_116>",
+    "<extra_id_117>",
+    "<extra_id_118>",
+    "<extra_id_119>",
+    "<extra_id_120>",
+    "<extra_id_121>",
+    "<extra_id_122>",
+    "<extra_id_123>",
+    "<extra_id_124>"
+  ],
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

test_model.py ADDED Viewed

	@@ -0,0 +1,169 @@

+#!/usr/bin/env python3
+"""Run example inference for rubai-corrector-base."""
+from __future__ import annotations
+import argparse
+import json
+from pathlib import Path
+import torch
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+EXAMPLES = [
+    {
+        "category": "abbreviation",
+        "input": "telefon rqami qaysi",
+        "expected": "Telefon raqami qaysi",
+    },
+    {
+        "category": "apostrophe",
+        "input": "men ozim kordim",
+        "expected": "Men o'zim ko'rdim",
+    },
+    {
+        "category": "apostrophe",
+        "input": "togri yoldan boring",
+        "expected": "To'g'ri yo'ldan boring",
+    },
+    {
+        "category": "ocr",
+        "input": "rnen universitetda oqiyrnan",
+        "expected": "Men universitetda o'qiyman",
+    },
+    {
+        "category": "ocr",
+        "input": "bu juda rnuhirn masala",
+        "expected": "Bu juda muhim masala",
+    },
+    {
+        "category": "numbers",
+        "input": "narxi yigirma besh ming so'm",
+        "expected": "Narxi 25 000 so'm",
+    },
+    {
+        "category": "numbers",
+        "input": "uchrashuv o'n beshinchi yanvar kuni",
+        "expected": "Uchrashuv 15-yanvar kuni",
+    },
+    {
+        "category": "mixed_uz_ru",
+        "input": "men segodnya bozorga bordim",
+        "expected": "Men сегодня bozorga bordim",
+    },
+    {
+        "category": "mixed_script",
+        "input": "privet kak делa",
+        "expected": "Привет как дела",
+    },
+    {
+        "category": "uzbek_cleanup",
+        "input": "xamma narsa tayyor",
+        "expected": "Hamma narsa tayyor",
+    },
+]
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--model-path",
+        type=Path,
+        default=Path(__file__).resolve().parent,
+        help="Path to the packaged model folder.",
+    )
+    parser.add_argument(
+        "--device",
+        default="cuda:0" if torch.cuda.is_available() else "cpu",
+        help="Inference device, for example cuda:0 or cpu.",
+    )
+    parser.add_argument(
+        "--text",
+        type=str,
+        default=None,
+        help="Run a single custom input instead of the built-in example suite.",
+    )
+    parser.add_argument(
+        "--max-new-tokens",
+        type=int,
+        default=256,
+        help="Maximum generation length.",
+    )
+    parser.add_argument(
+        "--json",
+        action="store_true",
+        help="Print results as JSON.",
+    )
+    return parser.parse_args()
+def load_model(model_path: Path, device: str):
+    tokenizer = AutoTokenizer.from_pretrained(model_path)
+    model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
+    model.to(device)
+    model.eval()
+    return tokenizer, model
+def predict(texts: list[str], tokenizer, model, device: str, max_new_tokens: int) -> list[str]:
+    prompts = [f"correct: {text}" for text in texts]
+    inputs = tokenizer(prompts, return_tensors="pt", padding=True)
+    inputs = {name: tensor.to(device) for name, tensor in inputs.items()}
+    with torch.inference_mode():
+        output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
+    return tokenizer.batch_decode(output_ids, skip_special_tokens=True)
+def main() -> int:
+    args = parse_args()
+    tokenizer, model = load_model(args.model_path, args.device)
+    if args.text is not None:
+        prediction = predict([args.text], tokenizer, model, args.device, args.max_new_tokens)[0]
+        if args.json:
+            print(json.dumps({"input": args.text, "prediction": prediction}, ensure_ascii=False, indent=2))
+        else:
+            print(f"Input:      {args.text}")
+            print(f"Prediction: {prediction}")
+        return 0
+    predictions = predict(
+        [example["input"] for example in EXAMPLES],
+        tokenizer,
+        model,
+        args.device,
+        args.max_new_tokens,
+    )
+    results = []
+    for example, prediction in zip(EXAMPLES, predictions):
+        results.append(
+            {
+                "category": example["category"],
+                "input": example["input"],
+                "expected": example["expected"],
+                "prediction": prediction,
+                "exact_match": prediction == example["expected"],
+            }
+        )
+    if args.json:
+        print(json.dumps(results, ensure_ascii=False, indent=2))
+        return 0
+    print(f"Model: {args.model_path}")
+    print(f"Device: {args.device}")
+    print()
+    for row in results:
+        print(f"[{row['category']}]")
+        print(f"Input:      {row['input']}")
+        print(f"Expected:   {row['expected']}")
+        print(f"Prediction: {row['prediction']}")
+        print(f"Exact:      {row['exact_match']}")
+        print()
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,1163 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "259": {
+      "content": "<extra_id_0>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "260": {
+      "content": "<extra_id_1>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "261": {
+      "content": "<extra_id_2>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "262": {
+      "content": "<extra_id_3>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "263": {
+      "content": "<extra_id_4>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "264": {
+      "content": "<extra_id_5>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "265": {
+      "content": "<extra_id_6>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "266": {
+      "content": "<extra_id_7>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "267": {
+      "content": "<extra_id_8>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "268": {
+      "content": "<extra_id_9>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "269": {
+      "content": "<extra_id_10>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "270": {
+      "content": "<extra_id_11>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "271": {
+      "content": "<extra_id_12>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "272": {
+      "content": "<extra_id_13>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "273": {
+      "content": "<extra_id_14>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "274": {
+      "content": "<extra_id_15>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "275": {
+      "content": "<extra_id_16>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "276": {
+      "content": "<extra_id_17>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "277": {
+      "content": "<extra_id_18>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "278": {
+      "content": "<extra_id_19>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "279": {
+      "content": "<extra_id_20>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "280": {
+      "content": "<extra_id_21>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "281": {
+      "content": "<extra_id_22>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "282": {
+      "content": "<extra_id_23>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "283": {
+      "content": "<extra_id_24>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "284": {
+      "content": "<extra_id_25>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "285": {
+      "content": "<extra_id_26>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "286": {
+      "content": "<extra_id_27>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "287": {
+      "content": "<extra_id_28>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "288": {
+      "content": "<extra_id_29>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "289": {
+      "content": "<extra_id_30>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "290": {
+      "content": "<extra_id_31>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "291": {
+      "content": "<extra_id_32>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "292": {
+      "content": "<extra_id_33>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "293": {
+      "content": "<extra_id_34>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "294": {
+      "content": "<extra_id_35>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "295": {
+      "content": "<extra_id_36>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "296": {
+      "content": "<extra_id_37>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "297": {
+      "content": "<extra_id_38>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "298": {
+      "content": "<extra_id_39>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "299": {
+      "content": "<extra_id_40>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "300": {
+      "content": "<extra_id_41>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "301": {
+      "content": "<extra_id_42>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "302": {
+      "content": "<extra_id_43>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "303": {
+      "content": "<extra_id_44>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "304": {
+      "content": "<extra_id_45>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "305": {
+      "content": "<extra_id_46>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "306": {
+      "content": "<extra_id_47>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "307": {
+      "content": "<extra_id_48>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "308": {
+      "content": "<extra_id_49>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "309": {
+      "content": "<extra_id_50>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "310": {
+      "content": "<extra_id_51>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "311": {
+      "content": "<extra_id_52>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "312": {
+      "content": "<extra_id_53>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "313": {
+      "content": "<extra_id_54>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "314": {
+      "content": "<extra_id_55>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "315": {
+      "content": "<extra_id_56>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "316": {
+      "content": "<extra_id_57>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "317": {
+      "content": "<extra_id_58>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "318": {
+      "content": "<extra_id_59>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "319": {
+      "content": "<extra_id_60>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "320": {
+      "content": "<extra_id_61>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "321": {
+      "content": "<extra_id_62>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "322": {
+      "content": "<extra_id_63>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "323": {
+      "content": "<extra_id_64>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "324": {
+      "content": "<extra_id_65>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "325": {
+      "content": "<extra_id_66>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "326": {
+      "content": "<extra_id_67>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "327": {
+      "content": "<extra_id_68>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "328": {
+      "content": "<extra_id_69>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "329": {
+      "content": "<extra_id_70>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "330": {
+      "content": "<extra_id_71>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "331": {
+      "content": "<extra_id_72>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "332": {
+      "content": "<extra_id_73>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "333": {
+      "content": "<extra_id_74>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "334": {
+      "content": "<extra_id_75>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "335": {
+      "content": "<extra_id_76>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "336": {
+      "content": "<extra_id_77>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "337": {
+      "content": "<extra_id_78>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "338": {
+      "content": "<extra_id_79>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "339": {
+      "content": "<extra_id_80>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "340": {
+      "content": "<extra_id_81>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "341": {
+      "content": "<extra_id_82>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "342": {
+      "content": "<extra_id_83>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "343": {
+      "content": "<extra_id_84>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "344": {
+      "content": "<extra_id_85>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "345": {
+      "content": "<extra_id_86>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "346": {
+      "content": "<extra_id_87>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "347": {
+      "content": "<extra_id_88>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "348": {
+      "content": "<extra_id_89>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "349": {
+      "content": "<extra_id_90>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "350": {
+      "content": "<extra_id_91>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "351": {
+      "content": "<extra_id_92>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "352": {
+      "content": "<extra_id_93>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "353": {
+      "content": "<extra_id_94>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "354": {
+      "content": "<extra_id_95>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "355": {
+      "content": "<extra_id_96>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "356": {
+      "content": "<extra_id_97>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "357": {
+      "content": "<extra_id_98>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "358": {
+      "content": "<extra_id_99>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "359": {
+      "content": "<extra_id_100>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "360": {
+      "content": "<extra_id_101>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "361": {
+      "content": "<extra_id_102>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "362": {
+      "content": "<extra_id_103>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "363": {
+      "content": "<extra_id_104>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "364": {
+      "content": "<extra_id_105>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "365": {
+      "content": "<extra_id_106>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "366": {
+      "content": "<extra_id_107>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "367": {
+      "content": "<extra_id_108>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "368": {
+      "content": "<extra_id_109>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "369": {
+      "content": "<extra_id_110>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "370": {
+      "content": "<extra_id_111>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "371": {
+      "content": "<extra_id_112>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "372": {
+      "content": "<extra_id_113>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "373": {
+      "content": "<extra_id_114>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "374": {
+      "content": "<extra_id_115>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "375": {
+      "content": "<extra_id_116>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "376": {
+      "content": "<extra_id_117>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "377": {
+      "content": "<extra_id_118>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "378": {
+      "content": "<extra_id_119>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "379": {
+      "content": "<extra_id_120>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "380": {
+      "content": "<extra_id_121>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "381": {
+      "content": "<extra_id_122>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "382": {
+      "content": "<extra_id_123>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "383": {
+      "content": "<extra_id_124>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<extra_id_0>",
+    "<extra_id_1>",
+    "<extra_id_2>",
+    "<extra_id_3>",
+    "<extra_id_4>",
+    "<extra_id_5>",
+    "<extra_id_6>",
+    "<extra_id_7>",
+    "<extra_id_8>",
+    "<extra_id_9>",
+    "<extra_id_10>",
+    "<extra_id_11>",
+    "<extra_id_12>",
+    "<extra_id_13>",
+    "<extra_id_14>",
+    "<extra_id_15>",
+    "<extra_id_16>",
+    "<extra_id_17>",
+    "<extra_id_18>",
+    "<extra_id_19>",
+    "<extra_id_20>",
+    "<extra_id_21>",
+    "<extra_id_22>",
+    "<extra_id_23>",
+    "<extra_id_24>",
+    "<extra_id_25>",
+    "<extra_id_26>",
+    "<extra_id_27>",
+    "<extra_id_28>",
+    "<extra_id_29>",
+    "<extra_id_30>",
+    "<extra_id_31>",
+    "<extra_id_32>",
+    "<extra_id_33>",
+    "<extra_id_34>",
+    "<extra_id_35>",
+    "<extra_id_36>",
+    "<extra_id_37>",
+    "<extra_id_38>",
+    "<extra_id_39>",
+    "<extra_id_40>",
+    "<extra_id_41>",
+    "<extra_id_42>",
+    "<extra_id_43>",
+    "<extra_id_44>",
+    "<extra_id_45>",
+    "<extra_id_46>",
+    "<extra_id_47>",
+    "<extra_id_48>",
+    "<extra_id_49>",
+    "<extra_id_50>",
+    "<extra_id_51>",
+    "<extra_id_52>",
+    "<extra_id_53>",
+    "<extra_id_54>",
+    "<extra_id_55>",
+    "<extra_id_56>",
+    "<extra_id_57>",
+    "<extra_id_58>",
+    "<extra_id_59>",
+    "<extra_id_60>",
+    "<extra_id_61>",
+    "<extra_id_62>",
+    "<extra_id_63>",
+    "<extra_id_64>",
+    "<extra_id_65>",
+    "<extra_id_66>",
+    "<extra_id_67>",
+    "<extra_id_68>",
+    "<extra_id_69>",
+    "<extra_id_70>",
+    "<extra_id_71>",
+    "<extra_id_72>",
+    "<extra_id_73>",
+    "<extra_id_74>",
+    "<extra_id_75>",
+    "<extra_id_76>",
+    "<extra_id_77>",
+    "<extra_id_78>",
+    "<extra_id_79>",
+    "<extra_id_80>",
+    "<extra_id_81>",
+    "<extra_id_82>",
+    "<extra_id_83>",
+    "<extra_id_84>",
+    "<extra_id_85>",
+    "<extra_id_86>",
+    "<extra_id_87>",
+    "<extra_id_88>",
+    "<extra_id_89>",
+    "<extra_id_90>",
+    "<extra_id_91>",
+    "<extra_id_92>",
+    "<extra_id_93>",
+    "<extra_id_94>",
+    "<extra_id_95>",
+    "<extra_id_96>",
+    "<extra_id_97>",
+    "<extra_id_98>",
+    "<extra_id_99>",
+    "<extra_id_100>",
+    "<extra_id_101>",
+    "<extra_id_102>",
+    "<extra_id_103>",
+    "<extra_id_104>",
+    "<extra_id_105>",
+    "<extra_id_106>",
+    "<extra_id_107>",
+    "<extra_id_108>",
+    "<extra_id_109>",
+    "<extra_id_110>",
+    "<extra_id_111>",
+    "<extra_id_112>",
+    "<extra_id_113>",
+    "<extra_id_114>",
+    "<extra_id_115>",
+    "<extra_id_116>",
+    "<extra_id_117>",
+    "<extra_id_118>",
+    "<extra_id_119>",
+    "<extra_id_120>",
+    "<extra_id_121>",
+    "<extra_id_122>",
+    "<extra_id_123>",
+    "<extra_id_124>"
+  ],
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "extra_ids": 0,
+  "extra_special_tokens": {},
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "<pad>",
+  "tokenizer_class": "ByT5Tokenizer",
+  "unk_token": "<unk>"
+}