Publish mgpt2 sft checkpoint (step 1262, val_loss 1.240358)

Browse files

Files changed (11) hide show

README.md +159 -0
config.json +12 -0
model.py +184 -0
pytorch_model.pt +3 -0
tokenization_mgpt2.py +3 -0
tokenizer/__init__.py +15 -0
tokenizer/artifacts/mgpt2.model +3 -0
tokenizer/base.py +158 -0
tokenizer/hf_tokenizer.py +91 -0
tokenizer/patterns.py +13 -0
tokenizer/regex_tokenizer.py +246 -0

README.md ADDED Viewed

	@@ -0,0 +1,159 @@

+---
+language:
+- en
+- hi
+- kn
+license: mit
+tags:
+- causal-lm
+- multilingual
+- indic
+- hindi
+- kannada
+- instruction-tuned
+- text-generation-inference
+pipeline_tag: text-generation
+base_model: ace-1/mgpt2-pretrain
+---
+# mgpt2-sft — Multilingual GPT-2 (Instruction-Tuned)
+`mgpt2` fine-tuned on **30,000 multilingual instruction–response pairs** across 5 language variants:
+English, Hindi (Devanagari), Hindi (Latin transliteration), Kannada (Kannada script), and Kannada
+(Latin transliteration). Training data from ai4bharat/indic-align (Anudesh, Dolly-T, OpenAssistant-T).
+Built on top of the pretrained `mgpt2` base — same 124M architecture, same custom multilingual tokenizer.
+Uses masked cross-entropy (loss computed over response tokens only).
+## Quick start
+```python
+import sys, torch
+import torch.nn.functional as F
+from huggingface_hub import snapshot_download
+local = snapshot_download("ace-1/mgpt2-sft")
+sys.path.insert(0, local)
+from model import GPT
+from tokenizer.regex_tokenizer import RegexTokenizer
+ckpt = torch.load(f"{local}/pytorch_model.pt", weights_only=False, map_location="cpu")
+model = GPT(ckpt["config"])
+model.load_state_dict(ckpt["model"])
+model.eval()
+enc = RegexTokenizer()
+enc.load(f"{local}/tokenizer/artifacts/mgpt2.model")
+# Prompt: plain text, no special template needed
+prompts = [
+    "What is the capital of Karnataka?",                   # English
+    "कर्नाटक की राजधानी क्या है?",                          # Hindi (Devanagari)
+    "ಕರ್ನಾಟಕದ ರಾಜಧಾನಿ ಯಾವುದು?",                           # Kannada script
+]
+for prompt in prompts:
+    ids = enc.encode(prompt)
+    x = torch.tensor(ids, dtype=torch.long).unsqueeze(0)
+    with torch.no_grad():
+        for _ in range(120):
+            logits, _ = model(x[:, -1024:])
+            probs = F.softmax(logits[:, -1, :] / 0.7, dim=-1)
+            next_id = torch.multinomial(probs, num_samples=1)
+            if next_id.item() == 50256: break
+            x = torch.cat([x, next_id], dim=1)
+    print(f"Prompt : {prompt}")
+    print(f"Response: {enc.decode(x[0, len(ids):].tolist())}")
+    print()
+```
+## Intended use
+**Good for:**
+- Multilingual Q&A and instruction following (en/hi/kn, native + romanised scripts)
+- Downstream fine-tuning starting point for Indic NLP tasks
+- Research: multilingual instruction tuning at small scale
+**Not for:** Safety-critical applications. Native-script variants (Devanagari, Kannada) are more reliable than
+transliterated Latin variants, which are prone to mid-generation script drift (known limitation —
+see training notes).
+## Model details
+| Property | Value |
+|---|---|
+| Architecture | GPT-2 (12 layers / 12 heads / 768d) |
+| Parameters | ~124M |
+| Vocabulary | 50,257 (mgpt2 BPE) + padded to 50,304 |
+| Context length | 1,024 tokens |
+| Training stage | SFT (instruction-tuned) |
+| Git commit | `d07224070033` |
+## Training configuration
+| Parameter | Value |
+|---|---|
+| `seed` | `1337` |
+| `batch_size` | `64` |
+| `micro_batch_size` | `8` |
+| `epochs` | `3` |
+| `warmup_steps` | `50` |
+| `max_lr` | `0.0003` |
+| `min_lr_ratio` | `0.1` |
+| `weight_decay` | `0.1` |
+| `eval_interval` | `50` |
+## Evaluation
+| Metric | Value | Notes |
+|---|---|---|
+| Val loss (masked CE) | 1.2404 | Response tokens only, held-out SFT set |
+| Val PPL (SFT set) | 3.46 | Not comparable to pretrain LM PPL |
+| Training steps | 1262 | 3 epochs over 30K examples |
+> SFT val PPL is measured on the SFT held-out set (narrower domain) and is **not comparable**
+> to the pretrain LM eval PPL (12.4), which measures general language modelling ability.
+## Training data
+| Language | Count | Source |
+|---|---|---|
+| English (`eng_Latn`) | 16,500 | [ai4bharat/indic-align](https://huggingface.co/datasets/ai4bharat/indic-align) Anudesh |
+| Hindi Devanagari (`hin_Deva`) | 5,400 | indic-align Dolly-T + OpenAssistant-T |
+| Kannada script (`kan_Knda`) | 3,900 | indic-align Dolly-T + OpenAssistant-T |
+| Hindi Latin translit (`hin_Latn`) | 2,100 | indic-align Dolly-T + OpenAssistant-T |
+| Kannada Latin translit (`kan_Latn`) | 2,100 | indic-align Dolly-T + OpenAssistant-T |
+30,000 examples total. 90/10 train/val split. Masked CE — loss computed over response tokens only.
+## Tokenizer
+Custom multilingual regex + BPE tokenizer (`mgpt2`), trained on the same corpus mixture.
+Same vocabulary size as tiktoken-gpt2 (50,257 tokens), but with Indic-aware merge priorities:
+| Bucket | tiktoken-gpt2 | **mgpt2** | Δ |
+|---|---:|---:|---:|
+| Overall | 480 tok/kB | **223 tok/kB** | −54% |
+| Devanagari | 592 tok/kB | **215 tok/kB** | −64% |
+| Kannada | 981 tok/kB | **213 tok/kB** | −78% |
+| Latin | 257 tok/kB | **230 tok/kB** | −10% |
+Tokenizer published separately: [ace-1/mgpt2-tokenizer](https://huggingface.co/ace-1/mgpt2-tokenizer)
+## Known limitations
+- **Transliterated Latin script drift.** `hin_Latn` and `kan_Latn` may switch scripts mid-generation. Cause: ASCII tokens shared with English; no Unicode anchor. Mitigated but not eliminated at this data scale.
+- **124M parameters.** Factual accuracy and multi-step reasoning are limited.
+- **No safety alignment.** The SFT model was trained on benign instruction data only; it may attempt to answer harmful prompts. Use the DPO variant for light safety alignment.
+- **Research checkpoint** — not evaluated for production use.
+## Citation
+```bibtex
+@misc{mgpt2,
+  title     = {mgpt2: Multilingual GPT-2 with custom Indic tokenizer},
+  year      = {2026},
+  note      = {Pretrain → SFT → DPO pipeline for English/Hindi/Kannada},
+  url       = {https://huggingface.co/ace-1/mgpt2-sft}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "architectures": [
+    "GPT"
+  ],
+  "model_type": "mgpt2",
+  "block_size": 1024,
+  "vocab_size": 50304,
+  "n_layer": 12,
+  "n_head": 12,
+  "n_embd": 768,
+  "tokenizer_kind": "mgpt2_regex_bpe"
+}

model.py ADDED Viewed

	@@ -0,0 +1,184 @@

+from dataclasses import dataclass
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import inspect
+@dataclass
+class GPTConfig:
+    block_size: int = 1024 # sequence length
+    vocab_size: int = 50257 # number of tokens: 50,000 BPE merges + 256 byte tokens + 1 <|endoftext|> token
+    n_layer: int = 12 # number of layers
+    n_head: int = 12 # number of attention heads
+    n_embd: int = 768 # embedding dimension
+class CausalSelfAttention(nn.Module):
+    def __init__(self, config) -> None:
+        super().__init__()
+        assert config.n_embd % config.n_head == 0
+        self.c_attn= nn.Linear(config.n_embd, config.n_embd*3)
+        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
+        self.c_proj.NANOGPT_SCALE_INIT = 1
+        self.n_head = config.n_head
+        self.n_embd = config.n_embd
+    def forward(self, x):
+        B, T, C = x.size()
+        qkv = self.c_attn(x)
+        q, k, v = qkv.split(self.n_embd, dim=2)
+        q = q.reshape(B, T, self.n_head, C // self.n_head).transpose(1,2)
+        k = k.reshape(B, T, self.n_head, C // self.n_head).transpose(1,2)
+        v = v.reshape(B, T, self.n_head, C // self.n_head).transpose(1,2)
+        # att = q @ k.transpose(-2, -1) * (1.0 / math.sqrt(k.size(-1)))
+        # att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf"))
+        # att = F.softmax(att, dim=-1)
+        # y = att @ v
+        y = F.scaled_dot_product_attention(q, k, v, is_causal=True)
+        y = y.transpose(1, 2).contiguous().view(B,T,C)
+        y = self.c_proj(y)
+        return y
+class MLP(nn.Module):
+    def __init__(self, config: GPTConfig):
+        super().__init__()
+        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
+        self.gelu = nn.GELU(approximate="tanh")
+        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
+        self.c_proj.NANOGPT_SCALE_INIT = 1
+    def forward(self, x):
+        x = self.c_fc(x)
+        x = self.gelu(x)
+        x = self.c_proj(x)
+        return x
+class Block(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.ln_1 = nn.LayerNorm(config.n_embd)
+        self.attn = CausalSelfAttention(config)
+        self.ln_2 = nn.LayerNorm(config.n_embd)
+        self.mlp = MLP(config)
+    def forward(self, x):
+        x = x + self.attn(self.ln_1(x)) # (B, T, C)
+        x = x + self.mlp(self.ln_2(x)) # (B, T, C)
+        return x
+class GPT(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.transformer = nn.ModuleDict(dict(
+            wte=nn.Embedding(config.vocab_size, config.n_embd), # token embedding table
+            wpe=nn.Embedding(config.block_size, config.n_embd), # position embedding table
+            h=nn.ModuleList([Block(config) for _ in range(config.n_layer)]), # transformer layers
+            ln_f=nn.LayerNorm(config.n_embd), # final layer norm
+        ))
+        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False) # language modeling head
+        # weight sharing scheme
+        self.transformer.wte.weight = self.lm_head.weight
+        self.apply(self._init_weights)
+    def _init_weights(self, module):
+        if isinstance(module, nn.Linear):
+            std = 0.02
+            if hasattr(module, 'NANOGPT_SCALE_INIT'):
+                std *= (2 * self.config.n_layer) ** -0.5
+            torch.nn.init.normal_(module.weight, mean=0.0, std=std)
+            if module.bias is not None:
+                torch.nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+    def forward(self, idx, targets=None):
+        B, T = idx.size() # (B, T) = batch size, sequence length
+        assert T <= self.config.block_size, f"Cannot forward sequence of length {T}, block size is only {self.config.block_size}"
+        pos = torch.arange(0, T, dtype=torch.long, device = idx.device)
+        tok_emb = self.transformer.wte(idx) # (B, T, n_embd)
+        pos_emb = self.transformer.wpe(pos) # (T, n_embd)
+        x = tok_emb + pos_emb # (B, T, n_embd)
+        for block in self.transformer.h:
+            x = block(x)
+        x = self.transformer.ln_f(x) # (B, T, n_embd)
+        logits = self.lm_head(x) # (B, T, vocab_size)
+        loss = None
+        if targets is not None:
+            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
+        return logits, loss
+    @classmethod
+    def from_pretrained(cls, model_type):
+        assert model_type in {'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}
+        from transformers import GPT2LMHeadModel
+        print(f"loading weights from pretrained gpt {model_type}..")
+        config_args = {
+            "gpt2":         dict(n_layer=12, n_head=12, n_embd=768),
+            "gpt2-medium":  dict(n_layer=24, n_head=16, n_embd=1024),
+            "gpt2-large":   dict(n_layer=36, n_head=20, n_embd=1280),
+            "gpt2-xl":      dict(n_layer=48, n_head=25, n_embd=1600)
+        }[model_type]
+        config_args['vocab_size'] = 50257
+        config_args['block_size'] = 1024
+        config = GPTConfig(**config_args)
+        model = GPT(config)
+        sd = model.state_dict()
+        sd_keys = sd.keys()
+        sd_keys = [k for k in sd_keys if not k.endswith('.attn.bias')]
+        model_hf = GPT2LMHeadModel.from_pretrained(model_type)
+        sd_hf = model_hf.state_dict()
+        sd_keys_hf = sd_hf.keys()
+        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.bias')]
+        sd_keys_hf = [k for k in sd_keys_hf if not k.endswith('.attn.masked_bias')]
+        transposed_keys = ['attn.c_attn.weight', 'attn.c_proj.weight', 'mlp.c_fc.weight', 'mlp.c_proj.weight']
+        assert len(sd_keys_hf) == len(sd_keys), f"Mismatch: {len(sd_keys_hf)} != {len(sd_keys)}"
+        for k in sd_keys_hf:
+            if any(k.endswith(suffix) for suffix in transposed_keys):
+                assert sd_hf[k].shape[::-1] == sd[k].shape
+                with torch.no_grad():
+                    sd[k].copy_(sd_hf[k].T)
+            else:
+                assert sd_hf[k].shape == sd[k].shape
+                with torch.no_grad():
+                    sd[k].copy_(sd_hf[k])
+        return model
+    def configure_optimizers(self, weight_decay, learning_rate, device_type):
+        # start with all parameters that require gradients
+        param_dict = {pn: p for pn, p in self.named_parameters()}
+        param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
+        # create optim groups. Any parameters that are 2D ares going to be weight decayed.
+        # i.e all weight tensors in matmul + embedding. All biases and layernorms are not.
+        decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
+        non_decay_params = [p for n, p in param_dict.items() if p.dim() < 2]
+        optim_groups = [
+            {'params': decay_params, 'weight_decay': weight_decay},
+            {'params': non_decay_params, 'weight_decay': 0.0}
+        ]
+        # num_decay_params = sum(p.numel() for p in decay_params)
+        # num_non_decay_params = sum(p.numel() for p in non_decay_params)
+        # if master_process:
+        #     print(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
+        #     print(f"num non-decayed parameter tensors: {len(non_decay_params)}, with {num_non_decay_params:,} parameters")
+        # create AdamW optimizer and use fused version if it is available
+        fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
+        use_fused = fused_available and device_type == 'cuda'
+        # if master_process:
+        #     print(f"using fused AdamW: {use_fused}")
+        optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=(0.9, 0.95), eps=1e-8, fused=use_fused)
+        return optimizer

pytorch_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3876db428ed2c0ab1c6152b7e1221be21d8a01a9f52f752c6ffc17c988121cbb
+size 497958335

tokenization_mgpt2.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from tokenizer.hf_tokenizer import MGPT2Tokenizer
2	+
3	+ __all__ = ['MGPT2Tokenizer']

tokenizer/__init__.py ADDED Viewed

	@@ -0,0 +1,15 @@

+from .base import Tokenizer
+from .basic import BasicTokenizer
+from .regex_tokenizer import RegexTokenizer
+from .gpt4 import GPT4Tokenizer
+from .patterns import GPT4_SPLIT_PATTERN, INDIC_SPLIT_PATTERN
+__all__ = [
+    "Tokenizer",
+    "BasicTokenizer",
+    "RegexTokenizer",
+    "GPT4Tokenizer",
+    "GPT4_SPLIT_PATTERN",
+    "INDIC_SPLIT_PATTERN",
+]

tokenizer/artifacts/mgpt2.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f2911100f93f224a36cfd6a40de8739a12f3fe7b0b885cd0edc961c6e5e6c4b1
+size 463596

tokenizer/base.py ADDED Viewed

	@@ -0,0 +1,158 @@

+"""
+A minimal implementation of Byte-Pair Encoding (BPE) tokenization.
+BPE is a subword tokenization algorithm that iteratively merges the most frequent pairs of bytes or characters
+to build a vocabulary of subword tokens. This implementation is inspired by Andrej Karpathy's minbpe
+(https://github.com/karpathy/minbpe).
+"""
+import unicodedata
+def get_stats(ids, freq):
+    for pair in zip(ids[:-1], ids[1:]):
+        freq[pair] = freq.get(pair, 0) + 1
+def merge(ids, pair, idx):
+    newids = []
+    i = 0
+    while i < len(ids):
+        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
+            newids.append(idx)
+            i += 2
+        else:
+            newids.append(ids[i])
+            i += 1
+    return newids
+def visualise_tokens(token_values: list[bytes]) -> None:
+    background = [f"\u001b[48;5;{i}m" for i in [167, 179, 185, 77, 80, 68, 134]]
+    # If token boundaries do not occur at unicode character boundaries, it's unclear how best to
+    # visualise the token. Here, we'll just use the unicode replacement character to represent some
+    # fraction of a character.
+    unicode_token_values = [x.decode("utf-8", errors="replace") for x in token_values]
+    running_length = 0
+    last_color = None
+    for token in unicode_token_values:
+        color = background[running_length % len(background)]
+        if color == last_color:
+            color = background[(running_length + 1) % len(background)]
+            assert color != last_color
+        last_color = color
+        running_length += len(token)
+        print(color + token, end="")
+    print("\u001b[0m")
+# first two helper functions...
+def replace_control_characters(s: str) -> str:
+    # we don't want to print control characters
+    # which distort the output (e.g. \n or much worse)
+    # https://stackoverflow.com/questions/4324790/removing-control-characters-from-a-string-in-python/19016117#19016117
+    # http://www.unicode.org/reports/tr44/#GC_Values_Table
+    chars = []
+    for ch in s:
+        if unicodedata.category(ch)[0] != "C":
+            chars.append(ch) # this character is ok
+        else:
+            chars.append(f"\\u{ord(ch):04x}") # escape
+    return "".join(chars)
+def render_token(t: bytes) -> str:
+    # pretty print a token, escaping control characters
+    s = t.decode('utf-8', errors='replace')
+    s = replace_control_characters(s)
+    return s
+#--------------------------------------------------------------------------------------------------
+class Tokenizer:
+    def __init__(self):
+        self.merges = {} # (int, int) -> int
+        self.pattern = "" # str
+        self.special_tokens = {} # str -> int e.g {'<|endoftext|>': 100257}
+        self.inverse_special_tokens = {} # int -> str
+        self.vocab = self._build_vocab() # int -> bytes
+    def _build_vocab(self):
+        vocab = {idx: bytes([idx]) for idx in range(256)}
+        for (p0, p1), idx in self.merges.items():
+            vocab[idx] = vocab[p0] + vocab[p1]
+        return vocab
+    def train(self, text, vocab_size, verbose=False):
+        raise NotImplementedError
+    def decode(self, ids) -> str:
+        raise NotImplementedError
+    def encode(self, text, verbose=False) -> list[int]:
+        raise NotImplementedError
+    def save(self, file_prefix):
+        """
+        Saves two files: file_prefix.vocab and file_prefix.model
+        This is inspired (but not equivalent to!) sentencepiece's model saving:
+        - model file is the critical one, intended for load()
+        - vocab file is just a pretty printed version for human inspection only
+        """
+        # write the model: to be used in load() later
+        model_file = file_prefix + ".model"
+        with open(model_file, 'w') as f:
+            # write the version, pattern and merges, that's all that's needed
+            f.write("minbpe v1\n")
+            f.write(f"{self.pattern}\n")
+            # write the special tokens, first the number of them, then each one
+            f.write(f"{len(self.special_tokens)}\n")
+            for special, idx in self.special_tokens.items():
+                f.write(f"{special} {idx}\n")
+            # the merges dict
+            for idx1, idx2 in self.merges:
+                f.write(f"{idx1} {idx2}\n")
+        # write the vocab: for the human to look at
+        vocab_file = file_prefix + ".vocab"
+        inverted_merges = {idx: pair for pair, idx in self.merges.items()}
+        with open(vocab_file, "w", encoding="utf-8") as f:
+            for idx, token in self.vocab.items():
+                # note: many tokens may be partial utf-8 sequences
+                # and cannot be decoded into valid strings. Here we're using
+                # errors='replace' to replace them with the replacement char �.
+                # this also means that we couldn't possibly use .vocab in load()
+                # because decoding in this way is a lossy operation!
+                s = render_token(token)
+                # find the children of this token, if any
+                if idx in inverted_merges:
+                    # if this token has children, render it nicely as a merge
+                    idx0, idx1 = inverted_merges[idx]
+                    s0 = render_token(self.vocab[idx0])
+                    s1 = render_token(self.vocab[idx1])
+                    f.write(f"[{s0}][{s1}] -> [{s}] {idx}\n")
+                else:
+                    # otherwise this is leaf token, just print it
+                    # (this should just be the first 256 tokens, the bytes)
+                    f.write(f"[{s}] {idx}\n")
+    def load(self, model_file):
+        """Inverse of save() but only for the model file"""
+        assert model_file.endswith(".model")
+        # read the model file
+        merges = {}
+        special_tokens = {}
+        idx = 256
+        with open(model_file, 'r', encoding="utf-8") as f:
+            # read the version
+            version = f.readline().strip()
+            assert version == "minbpe v1"
+            # read the pattern
+            self.pattern = f.readline().strip()
+            # read the special tokens
+            num_special = int(f.readline().strip())
+            for _ in range(num_special):
+                special, special_idx = f.readline().strip().split()
+                special_tokens[special] = int(special_idx)
+            # read the merges
+            for line in f:
+                idx1, idx2 = map(int, line.split())
+                merges[(idx1, idx2)] = idx
+                idx += 1
+        self.merges = merges
+        self.special_tokens = special_tokens
+        self.inverse_special_tokens = {v: k for k, v in special_tokens.items()}
+        self.vocab = self._build_vocab()

tokenizer/hf_tokenizer.py ADDED Viewed

	@@ -0,0 +1,91 @@

+from __future__ import annotations
+import os
+from typing import Any, Optional
+from transformers import PreTrainedTokenizer
+from tokenizer.regex_tokenizer import RegexTokenizer
+class MGPT2Tokenizer(PreTrainedTokenizer):
+    """
+    Hugging Face-compatible (slow) tokenizer wrapper around `RegexTokenizer`.
+    This is intended for publishing alongside the model using `trust_remote_code=True`.
+    """
+    model_input_names = ["input_ids", "attention_mask"]
+    # Let `PreTrainedTokenizer.from_pretrained()` know which file it should pass to `__init__`.
+    vocab_files_names = {"model_file": "tokenizer.model"}
+    def __init__(self, model_file: str, **kwargs: Any):
+        if not model_file.endswith(".model"):
+            raise ValueError(f"model_file must end with .model, got: {model_file}")
+        self._tok = RegexTokenizer()
+        self._tok.load(model_file)
+        # Bind common special tokens if present in the trained tokenizer.
+        special = self._tok.special_tokens
+        kwargs.setdefault("eos_token", "<|endoftext|>" if "<|endoftext|>" in special else None)
+        kwargs.setdefault("unk_token", None)
+        kwargs.setdefault("pad_token", None)
+        kwargs.setdefault("bos_token", None)
+        super().__init__(**kwargs)
+        self.model_file = model_file
+    @property
+    def vocab_size(self) -> int:
+        # vocab is sparse only if merges are incomplete; generally size is max_id+1
+        return max(self._tok.vocab.keys()) + 1
+    def get_vocab(self) -> dict[str, int]:
+        # Provide a stable token-string mapping for HF internals.
+        inv_special = self._tok.inverse_special_tokens
+        vocab: dict[str, int] = {}
+        for i in range(self.vocab_size):
+            if i in inv_special:
+                vocab[inv_special[i]] = i
+            else:
+                vocab[f"<|bytebpe_{i}|>"] = i
+        return vocab
+    def _tokenize(self, text: str, **kwargs: Any) -> list[str]:
+        ids = self._tok.encode(text, allowed_special="all")
+        inv_special = self._tok.inverse_special_tokens
+        out: list[str] = []
+        for i in ids:
+            out.append(inv_special.get(i, f"<|bytebpe_{i}|>"))
+        return out
+    def _convert_token_to_id(self, token: str) -> int:
+        if token in self._tok.special_tokens:
+            return self._tok.special_tokens[token]
+        if token.startswith("<|bytebpe_") and token.endswith("|>"):
+            inner = token[len("<|bytebpe_") : -len("|>")]
+            return int(inner)
+        raise KeyError(f"Unknown token string: {token!r}")
+    def _convert_id_to_token(self, index: int) -> str:
+        return self._tok.inverse_special_tokens.get(index, f"<|bytebpe_{index}|>")
+    def convert_tokens_to_string(self, tokens: list[str]) -> str:
+        ids = [self._convert_token_to_id(t) for t in tokens]
+        return self._tok.decode(ids)
+    def build_inputs_with_special_tokens(self, token_ids_0: list[int], token_ids_1: Optional[list[int]] = None) -> list[int]:
+        if token_ids_1 is not None:
+            raise ValueError("This tokenizer does not support pair inputs.")
+        return token_ids_0
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> tuple[str]:
+        os.makedirs(save_directory, exist_ok=True)
+        prefix = filename_prefix or "tokenizer"
+        out_prefix = os.path.join(save_directory, prefix)
+        # Save in the native `.model`/`.vocab` format (human + machine readable for this repo).
+        self._tok.save(out_prefix)
+        return (out_prefix + ".model",)

tokenizer/patterns.py ADDED Viewed

	@@ -0,0 +1,13 @@

+"""
+Regex patterns used by tokenizers in this package.
+Keep patterns centralized so experiments + training scripts + notebooks
+stay in sync.
+"""
+# Default GPT-4-ish split pattern (as used in `RegexTokenizer` and `GPT4Tokenizer`)
+GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
+# Indic-focused experimental pattern (Hindi Devanagari + Kannada ranges and punctuation)
+INDIC_SPLIT_PATTERN = r"""(?i) 's|'t|'re|'ve|'m|'ll|'d| ?\b[\p{L}\u0900-\u0963|\u0966-\u097F]+\b| ?\b[\p{L}\u0C80-\u0C9E|\u0CA0-\u0CFF]+\b| ?[\p{N}]+| ?[.,!?;:'\"-]| ?[\u0964-\u0965]| ?[\u0C9E-\u0C9F]| ?[^\s\p{L}\p{N}\u0900-\u097F\u0C80-\u0CFF]+| \s+(?!\S)| \s+"""

tokenizer/regex_tokenizer.py ADDED Viewed

	@@ -0,0 +1,246 @@

+try:
+    from .base import get_stats, merge, visualise_tokens
+    from .basic import BasicTokenizer
+    from .patterns import GPT4_SPLIT_PATTERN
+except ImportError:  # allow running as a script from inside `tokenizer/`
+    from base import get_stats, merge, visualise_tokens
+    from basic import BasicTokenizer
+    from patterns import GPT4_SPLIT_PATTERN
+from collections import Counter, defaultdict
+import heapq
+import regex as re
+from tqdm import tqdm
+import time
+class RegexTokenizer(BasicTokenizer):
+    def __init__(self, regex: str = GPT4_SPLIT_PATTERN):
+        super().__init__()
+        self.pattern = regex
+        self.regex = re.compile(self.pattern)
+    def register_special_tokens(self, special_tokens: dict[str, int]):
+        self.special_tokens = special_tokens
+        self.inverse_special_tokens = {v: k for k, v in special_tokens.items()}
+    @staticmethod
+    def _merge_word(word: tuple[int, ...], pair: tuple[int, int], new_id: int) -> tuple[int, ...]:
+        """Merge all non-overlapping occurrences of `pair` in `word`."""
+        out: list[int] = []
+        i = 0
+        while i < len(word):
+            if i < len(word) - 1 and word[i] == pair[0] and word[i + 1] == pair[1]:
+                out.append(new_id)
+                i += 2
+            else:
+                out.append(word[i])
+                i += 1
+        return tuple(out)
+    @staticmethod
+    def _pair_occurrences(word: tuple[int, ...]) -> dict[tuple[int, int], int]:
+        """Return unweighted pair -> count for a single word/chunk."""
+        if len(word) < 2:
+            return {}
+        counts: dict[tuple[int, int], int] = {}
+        a = word[0]
+        for b in word[1:]:
+            p = (a, b)
+            counts[p] = counts.get(p, 0) + 1
+            a = b
+        return counts
+    def train(
+        self,
+        text: str,
+        vocab_size: int = 50_257,
+        verbose: bool = False,
+        *,
+        min_chunk_freq: int = 1,
+        max_chunks: int | None = None,
+    ):
+        assert vocab_size >= 256, "Vocab size must be at least 256"
+        num_merges = vocab_size - 256
+        # Count chunk frequencies without storing a giant list of chunks.
+        # Each unique chunk becomes a "word" in classic BPE training.
+        chunk_counts: Counter[bytes] = Counter()
+        for m in self.regex.finditer(text):
+            s = m.group(0)
+            if s:
+                chunk_counts[s.encode("utf-8")] += 1
+        # Heuristic speed knobs: ignore rare chunks and/or cap unique chunk types.
+        # This massively reduces training state on web-scale corpora and keeps code simple.
+        if min_chunk_freq > 1:
+            chunk_counts = Counter({b: f for b, f in chunk_counts.items() if f >= min_chunk_freq})
+        if max_chunks is not None and len(chunk_counts) > max_chunks:
+            chunk_counts = Counter(dict(chunk_counts.most_common(max_chunks)))
+        # words: tuple(symbol_ids) -> frequency
+        words: dict[tuple[int, ...], int] = {}
+        for b, freq in chunk_counts.items():
+            words[tuple(b)] = freq
+        # Global pair stats and a reverse index pair -> set(words containing it)
+        pair_counts: dict[tuple[int, int], int] = defaultdict(int)
+        pair_to_words: dict[tuple[int, int], set[tuple[int, ...]]] = defaultdict(set)
+        for w, freq in words.items():
+            local = self._pair_occurrences(w)
+            for p, occ in local.items():
+                pair_counts[p] += freq * occ
+                pair_to_words[p].add(w)
+        # Max-heap for fast "most frequent pair" selection (lazy updates).
+        heap: list[tuple[int, tuple[int, int]]] = [(-c, p) for p, c in pair_counts.items()]
+        heapq.heapify(heap)
+        merges = {}
+        vocab = {idx: bytes([idx]) for idx in range(256)}
+        def bump_pair(p: tuple[int, int], delta: int) -> None:
+            if delta == 0:
+                return
+            new = pair_counts.get(p, 0) + delta
+            if new <= 0:
+                pair_counts.pop(p, None)
+                pair_to_words.pop(p, None)
+                return
+            pair_counts[p] = new
+            heapq.heappush(heap, (-new, p))
+        for i in tqdm(range(num_merges), desc="Training tokenizer"):
+            start_time = time.time()
+            # Pop stale heap entries until the top matches current counts.
+            while heap:
+                negc, p = heap[0]
+                c = pair_counts.get(p, 0)
+                if c > 0 and -negc == c:
+                    break
+                heapq.heappop(heap)
+            if not heap:
+                break
+            pair = heap[0][1]
+            count = pair_counts.get(pair, 0)
+            if count <= 0:
+                break
+            idx = 256 + i
+            merges[pair] = idx
+            vocab[idx] = vocab[pair[0]] + vocab[pair[1]]
+            affected = list(pair_to_words.get(pair, ()))
+            if not affected:
+                pair_counts.pop(pair, None)
+                pair_to_words.pop(pair, None)
+                continue
+            # Apply merge to all words that contain the best pair.
+            for w in affected:
+                freq = words.get(w)
+                if not freq:
+                    continue
+                new_w = self._merge_word(w, pair, idx)
+                if new_w == w:
+                    continue
+                # Remove old word contributions
+                old_local = self._pair_occurrences(w)
+                for p, occ in old_local.items():
+                    bump_pair(p, -freq * occ)
+                    s = pair_to_words.get(p)
+                    if s is not None:
+                        s.discard(w)
+                        if not s:
+                            pair_to_words.pop(p, None)
+                # Update words dict (merge words that collapse to the same new tuple)
+                del words[w]
+                words[new_w] = words.get(new_w, 0) + freq
+                # Add new word contributions
+                new_local = self._pair_occurrences(new_w)
+                for p, occ in new_local.items():
+                    bump_pair(p, freq * occ)
+                    pair_to_words[p].add(new_w)
+            # This pair should be fully merged away.
+            pair_counts.pop(pair, None)
+            pair_to_words.pop(pair, None)
+            if verbose and i % 10 == 0:
+                time_taken = time.time() - start_time
+                tqdm.write(
+                    f"merge {i+1}/{num_merges}: {pair} -> {idx} ({vocab[idx]}) "
+                    f"had {count} occurrences (took {time_taken:.2f}s)"
+                )
+        self.merges = merges
+        self.vocab = vocab
+    def decode(self, ids) -> str:
+        part_bytes = []
+        for id in ids:
+            if id in self.vocab:
+                part_bytes.append(self.vocab[id]) # id can be > 256 after merging
+            elif id in getattr(self, "inverse_special_tokens", {}):
+                part_bytes.append(self.inverse_special_tokens[id].encode("utf-8"))
+            else:
+                raise ValueError(f"id={id} not in vocab or special_tokens")
+        text_bytes = b"".join(part_bytes)
+        text = text_bytes.decode(encoding="utf-8", errors="replace")
+        return text
+    def _encode_chunk(self, chunk_bytes: bytes, verbose=False) -> list[int]:
+        tokens = list(chunk_bytes)
+        while len(tokens) >= 2:
+            if verbose:
+                visualise_tokens([self.vocab[token] for token in tokens]) # token can be > 256 after merging
+            stats = {}
+            get_stats(tokens, stats)
+            pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))
+            if not pair in self.merges:
+                break
+            idx = self.merges[pair]
+            tokens = merge(tokens, pair, idx)
+        return tokens
+    def encode_ordinary(self, text, verbose=False) -> list[int]:
+        chunk_texts = re.findall(self.regex, text)
+        ids_list = []
+        for i, text in enumerate(chunk_texts):
+            if verbose:
+                print()
+                print(f"encoding chunk {i+1}/{len(chunk_texts)}: {text}")
+            chunk_bytes = text.encode("utf-8") # raw bytes
+            ids = self._encode_chunk(chunk_bytes, verbose)
+            ids_list.extend(ids)
+        return ids_list
+    def encode(self, text, verbose=False, allowed_special="none") -> list[int]:
+        special = {}
+        if allowed_special == "all":
+            special = self.special_tokens
+        elif allowed_special == "none":
+            special = {}
+        elif allowed_special == "none_raise":
+            special = {}
+            assert all(token not in text for token in self.special_tokens), "Text contains special tokens that are not allowed"
+        elif isinstance(allowed_special, set):
+            special = {k: v for k, v in self.special_tokens.items() if k in allowed_special}
+        else:
+            raise ValueError(f"allowed_special={allowed_special} not understood.")
+        if not special:
+            return self.encode_ordinary(text, verbose)
+        special_pattern = "(" + "|".join(re.escape(token) for token in special) + ")"
+        parts = re.split(special_pattern, text)
+        ids = []
+        for part in parts:
+            if part in special:
+                ids.append(special[part])
+            else:
+                ids.extend(self.encode_ordinary(part, verbose))
+        return ids