sign
/

utf8-lm-tiny

@@ -9,43 +9,57 @@ datasets:
 model-index:
 - name: output-tiny-lm-fineweb
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
-# output-tiny-lm-fineweb
 This model is a fine-tuned version of [sbintuitions/tiny-lm](https://huggingface.co/sbintuitions/tiny-lm) on the HuggingFaceFW/fineweb dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
 ## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 0.0003
-- train_batch_size: 128
-- eval_batch_size: 8
-- seed: 42
-- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.95) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_ratio: 0.01
-- training_steps: 20000
-### Training results
 ### Framework versions
@@ -53,4 +67,4 @@ The following hyperparameters were used during training:
 - Transformers 4.57.3
 - Pytorch 2.9.1+cu130
 - Datasets 4.4.1
-- Tokenizers 0.22.1

 model-index:
 - name: output-tiny-lm-fineweb
   results: []
+language:
+- en
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
+# UTF8-LM-tiny
 This model is a fine-tuned version of [sbintuitions/tiny-lm](https://huggingface.co/sbintuitions/tiny-lm) on the HuggingFaceFW/fineweb dataset.
+Using [this](https://github.com/sign/utf8-tokenizer/blob/main/experiments/language-modelling/run_clm.py) training script, from [utf8-tokenizer](https://github.com/sign/utf8-tokenizer/tree/main).
+The repository includes the joined model for ease of use, and the [bit_projection_weights.pt](https://huggingface.co/sign/utf8-lm-tiny/blob/main/bit_projection_weights.pt) for further analysis.
 ## Training procedure
+```shell
+python run_clm.py \
+  --use_bit_embeddings True \
+  --output_dir ./output-tiny-lm-fineweb \
+  --dataset_name HuggingFaceFW/fineweb \
+  --streaming True \
+  --dataloader_num_workers 1 \
+  --dataloader_prefetch_factor 4 \
+  --dataloader_pin_memory True \
+  --dataloader_persistent_workers True \
+  --do_train True \
+  --save_strategy steps \
+  --max_steps 20000 \
+  --save_steps 1000 \
+  --save_total_limit 2 \
+  --logging_steps 100 \
+  --logging_strategy steps \
+  --model_name_or_path sbintuitions/tiny-lm \
+  --per_device_train_batch_size 128 \
+  --block_size 256 \
+  --optim adamw_torch_fused \
+  --learning_rate 3e-4 \
+  --lr_scheduler_type cosine \
+  --warmup_ratio 0.01 \
+  --weight_decay 0.1 \
+  --adam_beta1 0.9 \
+  --adam_beta2 0.95 \
+  --max_grad_norm 1.0 \
+  --gradient_checkpointing True \
+  --bf16 True \
+  --seed 42 \
+  --report_to wandb \
+  --include_num_input_tokens_seen True
+```
 ### Framework versions
 - Transformers 4.57.3
 - Pytorch 2.9.1+cu130
 - Datasets 4.4.1
+- Tokenizers 0.22.1