--- library_name: transformers license: mit datasets: - HuggingFaceFW/fineweb language: - en base_model: - sign/utf8-lm-tiny --- # UTF16-LM-tiny This model is a fine-tuned version of [sign/utf8-lm-tiny](https://huggingface.co/sign/utf8-lm-tiny) on the HuggingFaceFW/fineweb dataset. Using [this](https://github.com/sign/utf8-tokenizer/blob/main/experiments/language-modelling/run_clm.py) training script, from [utf8-tokenizer](https://github.com/sign/utf8-tokenizer/tree/main). Unlike the base model, where we train directly on UTF-8 bytes, here we train on UTF-16 code units. Each Unicode character is represented as **one or two UTF-16 code units** (surrogate pairs for non-BMP characters). Each code unit is decomposed into two bytes, which are encoded independently and then concatenated. | Character | UTF-8 | UTF-16 | UTF-16 Decomposed (bytes) | | --------- | ------------------ | ----------------------- | ------------------------- | | A | `\x41` | `U+0041` | `[0, 65]` | | é | `\xC3\xA9` | `U+00E9` | `[0, 233]` | | € | `\xE2\x82\xAC` | `U+20AC` | `[32, 172]` | | 😀 | `\xF0\x9F\x98\x80` | `U+D83D U+DE00` | `[216, 61] , [222, 0]` | This replaces UTF-8’s **1–4 byte** variable-length encoding with a **1–2 code-unit** representation. While still variable-width, UTF-16 substantially reduces sequence-length variance and upper-bounds expansion for complex scripts and emoji-heavy text compared to UTF-8. By contrast, the [utf32-lm-tiny](https://huggingface.co/sign/utf32-lm-tiny) model uses four code units per character (fixed-width), yielding the simplest mapping from Unicode scalars to token sequences at the cost of longer overall sequences. ## Usage ```python from transformers import AutoModelForCausalLM, LogitsProcessorList import torch from utf8_tokenizer.logits_processor import UTF8ValidationLogitsProcessor from utf8_tokenizer.char_causal_lm import CharacterCausalLMWrapper from utf8_tokenizer import UTF8Tokenizer model_id = "sign/utf16-lm-tiny" tokenizer = UTF8Tokenizer() model = AutoModelForCausalLM.from_pretrained(model_id) prompt = "My name is" inputs = tokenizer([prompt], return_tensors="pt", padding=True, add_special_tokens=True) # We need to remove the EOS token inputs["input_ids"] = inputs["input_ids"][:, :-1] inputs["attention_mask"] = inputs["attention_mask"][:, :-1] with torch.no_grad(): out = model.generate( **inputs, max_new_tokens=256, ) print(tokenizer.decode(out[0], skip_special_tokens=False)) ``` ## Training procedure ```shell python run_clm.py \ --use_bit_embeddings False \ --encoding utf16 \ --output_dir ./output-tiny-lm-fineweb-groups \ --dataset_name HuggingFaceFW/fineweb \ --streaming True \ --dataloader_num_workers 1 \ --dataloader_prefetch_factor 4 \ --dataloader_pin_memory True \ --dataloader_persistent_workers True \ --do_train True \ --save_strategy steps \ --max_steps 100000 \ --save_steps 1000 \ --save_total_limit 1 \ --logging_steps 100 \ --logging_strategy steps \ --model_name_or_path sbintuitions/tiny-lm \ --per_device_train_batch_size 256 \ --block_size 256 \ --optim adamw_torch_fused \ --learning_rate 3e-4 \ --lr_scheduler_type cosine \ --warmup_ratio 0.01 \ --weight_decay 0.1 \ --adam_beta1 0.9 \ --adam_beta2 0.95 \ --max_grad_norm 1.0 \ --gradient_checkpointing True \ --bf16 True \ --seed 42 \ --report_to wandb \ --include_num_input_tokens_seen True ``` ### Framework versions - Transformers 4.57.3 - Pytorch 2.9.1+cu130 - Datasets 4.4.1 - Tokenizers 0.22.1