|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
base_model: sbintuitions/tiny-lm |
|
|
tags: |
|
|
- generated_from_trainer |
|
|
datasets: |
|
|
- HuggingFaceFW/fineweb |
|
|
model-index: |
|
|
- name: output-tiny-lm-fineweb |
|
|
results: [] |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
|
|
# UTF8-LM-tiny |
|
|
|
|
|
This model is a fine-tuned version of [sbintuitions/tiny-lm](https://huggingface.co/sbintuitions/tiny-lm) on the HuggingFaceFW/fineweb dataset. |
|
|
|
|
|
Using [this](https://github.com/sign/utf8-tokenizer/blob/main/experiments/language-modelling/run_clm.py) training script, from [utf8-tokenizer](https://github.com/sign/utf8-tokenizer/tree/main). |
|
|
|
|
|
The repository includes the joined model for ease of use, and the [bit_projection_weights.pt](https://huggingface.co/sign/utf8-lm-tiny/blob/main/bit_projection_weights.pt) for further analysis. |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
from utf8_tokenizer import UTF8Tokenizer |
|
|
|
|
|
model_id = "sign/utf8-lm-tiny" |
|
|
|
|
|
tokenizer = UTF8Tokenizer() |
|
|
model = AutoModelForCausalLM.from_pretrained(model_id) |
|
|
|
|
|
prompt = "My name is" |
|
|
|
|
|
inputs = tokenizer([prompt], return_tensors="pt", |
|
|
padding=True, |
|
|
add_special_tokens=True) |
|
|
inputs["input_ids"] = inputs["input_ids"].to(torch.long) |
|
|
# We need to remove the EOS token |
|
|
inputs["input_ids"] = inputs["input_ids"][:, :-1] |
|
|
inputs["attention_mask"] = inputs["attention_mask"][:, :-1] |
|
|
|
|
|
|
|
|
with torch.no_grad(): |
|
|
out = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=64, |
|
|
) |
|
|
|
|
|
print(tokenizer.decode(out[0], skip_special_tokens=False)) |
|
|
``` |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
```shell |
|
|
python run_clm.py \ |
|
|
--use_bit_embeddings True \ |
|
|
--output_dir ./output-tiny-lm-fineweb \ |
|
|
--dataset_name HuggingFaceFW/fineweb \ |
|
|
--streaming True \ |
|
|
--dataloader_num_workers 1 \ |
|
|
--dataloader_prefetch_factor 4 \ |
|
|
--dataloader_pin_memory True \ |
|
|
--dataloader_persistent_workers True \ |
|
|
--do_train True \ |
|
|
--save_strategy steps \ |
|
|
--max_steps 20000 \ |
|
|
--save_steps 1000 \ |
|
|
--save_total_limit 2 \ |
|
|
--logging_steps 100 \ |
|
|
--logging_strategy steps \ |
|
|
--model_name_or_path sbintuitions/tiny-lm \ |
|
|
--per_device_train_batch_size 128 \ |
|
|
--block_size 256 \ |
|
|
--optim adamw_torch_fused \ |
|
|
--learning_rate 3e-4 \ |
|
|
--lr_scheduler_type cosine \ |
|
|
--warmup_ratio 0.01 \ |
|
|
--weight_decay 0.1 \ |
|
|
--adam_beta1 0.9 \ |
|
|
--adam_beta2 0.95 \ |
|
|
--max_grad_norm 1.0 \ |
|
|
--gradient_checkpointing True \ |
|
|
--bf16 True \ |
|
|
--seed 42 \ |
|
|
--report_to wandb \ |
|
|
--include_num_input_tokens_seen True |
|
|
``` |
|
|
|
|
|
|
|
|
### Framework versions |
|
|
|
|
|
- Transformers 4.57.3 |
|
|
- Pytorch 2.9.1+cu130 |
|
|
- Datasets 4.4.1 |
|
|
- Tokenizers 0.22.1 |