--- library_name: transformers license: mit base_model: sbintuitions/tiny-lm tags: - generated_from_trainer datasets: - HuggingFaceFW/fineweb model-index: - name: output-tiny-lm-fineweb results: [] language: - en --- # UTF8-LM-tiny This model is a fine-tuned version of [sbintuitions/tiny-lm](https://huggingface.co/sbintuitions/tiny-lm) on the HuggingFaceFW/fineweb dataset. Using [this](https://github.com/sign/utf8-tokenizer/blob/main/experiments/language-modelling/run_clm.py) training script, from [utf8-tokenizer](https://github.com/sign/utf8-tokenizer/tree/main). The repository includes the joined model for ease of use, and the [bit_projection_weights.pt](https://huggingface.co/sign/utf8-lm-tiny/blob/main/bit_projection_weights.pt) for further analysis. ## Usage ```python from transformers import AutoModelForCausalLM import torch from utf8_tokenizer import UTF8Tokenizer model_id = "sign/utf8-lm-tiny" tokenizer = UTF8Tokenizer() model = AutoModelForCausalLM.from_pretrained(model_id) prompt = "My name is" inputs = tokenizer([prompt], return_tensors="pt", padding=True, add_special_tokens=True) inputs["input_ids"] = inputs["input_ids"].to(torch.long) # We need to remove the EOS token inputs["input_ids"] = inputs["input_ids"][:, :-1] inputs["attention_mask"] = inputs["attention_mask"][:, :-1] with torch.no_grad(): out = model.generate( **inputs, max_new_tokens=64, ) print(tokenizer.decode(out[0], skip_special_tokens=False)) ``` ## Training procedure ```shell python run_clm.py \ --use_bit_embeddings True \ --output_dir ./output-tiny-lm-fineweb \ --dataset_name HuggingFaceFW/fineweb \ --streaming True \ --dataloader_num_workers 1 \ --dataloader_prefetch_factor 4 \ --dataloader_pin_memory True \ --dataloader_persistent_workers True \ --do_train True \ --save_strategy steps \ --max_steps 20000 \ --save_steps 1000 \ --save_total_limit 2 \ --logging_steps 100 \ --logging_strategy steps \ --model_name_or_path sbintuitions/tiny-lm \ --per_device_train_batch_size 128 \ --block_size 256 \ --optim adamw_torch_fused \ --learning_rate 3e-4 \ --lr_scheduler_type cosine \ --warmup_ratio 0.01 \ --weight_decay 0.1 \ --adam_beta1 0.9 \ --adam_beta2 0.95 \ --max_grad_norm 1.0 \ --gradient_checkpointing True \ --bf16 True \ --seed 42 \ --report_to wandb \ --include_num_input_tokens_seen True ``` ### Framework versions - Transformers 4.57.3 - Pytorch 2.9.1+cu130 - Datasets 4.4.1 - Tokenizers 0.22.1