AmitMY commited on
Commit
f0d0cc2
·
verified ·
1 Parent(s): beef6fa

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -0
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: mit
4
+ datasets:
5
+ - HuggingFaceFW/fineweb
6
+ language:
7
+ - en
8
+ base_model:
9
+ - sign/utf8-lm-tiny
10
+ ---
11
+
12
+ # UTF32-LM-tiny
13
+
14
+ This model is a fine-tuned version of [sign/utf8-lm-tiny](https://huggingface.co/sign/utf8-lm-tiny) on the HuggingFaceFW/fineweb dataset.
15
+
16
+ Using [this](https://github.com/sign/utf8-tokenizer/blob/main/experiments/language-modelling/run_clm.py) training script, from [utf8-tokenizer](https://github.com/sign/utf8-tokenizer/tree/main).
17
+
18
+ Unlike the base model, where we train directly on UTF-8 bytes - here, we train on characters (UTF-32 blocks), where each character is decomposed into a fixed, four "bytes" which are encoded independently and then concatenated.
19
+
20
+ | Character | UTF-8 | UTF-32 | UTF-32 Decomposed (bytes) |
21
+ | --------- | ------------------ | ------------ | ------------------------- |
22
+ | A | `\x41` | `U+00000041` | `[0, 0, 0, 65]` |
23
+ | é | `\xC3\xA9` | `U+000000E9` | `[0, 0, 0, 233]` |
24
+ | € | `\xE2\x82\xAC` | `U+000020AC` | `[0, 0, 32, 172]` |
25
+ | 😀 | `\xF0\x9F\x98\x80` | `U+0001F600` | `[0, 1, 246, 0]` |
26
+
27
+
28
+ This is effectively switching from variable-width canonical UTF-8 byte sequences, to fixed size groups, making training/inference up to 4x more efficient for complex scripts.
29
+
30
+
31
+ ## Usage
32
+
33
+ ```python
34
+ from transformers import AutoModelForCausalLM, LogitsProcessorList
35
+ import torch
36
+ from utf8_tokenizer.logits_processor import UTF8ValidationLogitsProcessor
37
+ from utf8_tokenizer.char_causal_lm import CharacterCausalLMWrapper
38
+
39
+ from utf8_tokenizer import UTF8Tokenizer
40
+
41
+ model_id = "sign/utf32-lm-tiny"
42
+
43
+ tokenizer = UTF8Tokenizer()
44
+ model = AutoModelForCausalLM.from_pretrained(model_id)
45
+
46
+ prompt = "My name is"
47
+
48
+ inputs = tokenizer([prompt], return_tensors="pt",
49
+ padding=True,
50
+ add_special_tokens=True)
51
+ # We need to remove the EOS token
52
+ inputs["input_ids"] = inputs["input_ids"][:, :-1]
53
+ inputs["attention_mask"] = inputs["attention_mask"][:, :-1]
54
+
55
+
56
+ with torch.no_grad():
57
+ out = model.generate(
58
+ **inputs,
59
+ max_new_tokens=256,
60
+ )
61
+
62
+ print(tokenizer.decode(out[0], skip_special_tokens=False))
63
+
64
+ ```
65
+
66
+ ## Training procedure
67
+
68
+ ```shell
69
+ python run_clm.py \
70
+ --use_bit_embeddings True \
71
+ --encoding utf32 \
72
+ --output_dir ./output-tiny-lm-fineweb-groups \
73
+ --dataset_name HuggingFaceFW/fineweb \
74
+ --streaming True \
75
+ --dataloader_num_workers 1 \
76
+ --dataloader_prefetch_factor 4 \
77
+ --dataloader_pin_memory True \
78
+ --dataloader_persistent_workers True \
79
+ --do_train True \
80
+ --save_strategy steps \
81
+ --max_steps 100000 \
82
+ --save_steps 1000 \
83
+ --save_total_limit 1 \
84
+ --logging_steps 100 \
85
+ --logging_strategy steps \
86
+ --model_name_or_path sbintuitions/tiny-lm \
87
+ --per_device_train_batch_size 256 \
88
+ --block_size 256 \
89
+ --optim adamw_torch_fused \
90
+ --learning_rate 3e-4 \
91
+ --lr_scheduler_type cosine \
92
+ --warmup_ratio 0.01 \
93
+ --weight_decay 0.1 \
94
+ --adam_beta1 0.9 \
95
+ --adam_beta2 0.95 \
96
+ --max_grad_norm 1.0 \
97
+ --gradient_checkpointing True \
98
+ --bf16 True \
99
+ --seed 42 \
100
+ --report_to wandb \
101
+ --include_num_input_tokens_seen True
102
+ ```
103
+
104
+
105
+ ### Framework versions
106
+
107
+ - Transformers 4.57.3
108
+ - Pytorch 2.9.1+cu130
109
+ - Datasets 4.4.1
110
+ - Tokenizers 0.22.1