| |
|
| | --- |
| | |
| | library_name: transformers |
| | license: apache-2.0 |
| | base_model: Qwen/Qwen2.5-7B |
| | datasets: |
| | - allenai/tulu-3-sft-mixture |
| |
|
| | --- |
| | |
| | [](https://hf.co/QuantFactory) |
| |
|
| |
|
| | # QuantFactory/Teleut-7b-GGUF |
| | This is quantized version of [allura-org/Teleut-7b](https://huggingface.co/allura-org/Teleut-7b) created using llama.cpp |
| |
|
| | # Original Model Card |
| |
|
| |
|
| | # Teleut 7b |
| |
|
| |  |
| |
|
| | A replication attempt of Tulu 3 on the Qwen 2.5 base models. |
| |
|
| | ## Evals (so far) |
| | | | Teleut 7B (measured) | Tülu 3 SFT 8B (reported) | Qwen 2.5 7B Instruct (reported) | Ministral 8B (reported) | Mistral 7B v0.3 (reported) |
| | |-------------------------|----------------------|--------------------------|---------------------------------|-------------------------|--------------------------- |
| | |BBH (3 shot, CoT) |*64.4%* |**67.9%** |21.7% |56.2% |47.0%<sup>NLL</sup> |
| | |GSM8K (8 shot, CoT) |78.5% |76.2% |**83.8%** |*80.0%* |xx.x% |
| | |IFEval (prompt loose) |66.3% |*72.8%* |**74.7%** |56.4% |53.0% |
| | |MMLU (0 shot, CoT) |*73.2%* |65.9% |**76.6%** |68.5% |30.7%<sup>5-shot</sup> |
| | |MMLU Pro (0 shot, CoT) |*48.3%* |44.3% |**56.3%**<sup>Unknown</sup> |32.9%<sup>5-shot</sup> |30.7%<sup>5-shot</sup> |
| | |PopQA (15 shot) |18.9% |**29.3%** |18.1% |*20.2%* |xx.x% |
| | |TruthfulQA |47.2% |46.8% |**63.1%** |*55.5%* |xx.x% |
| |
|
| | ## Credits |
| | Big thanks to Retis Labs for being providing my 8xH100 polycule used to train and test this model! |
| | Another big thanks to AllenAI for publishing the Tülu 3 data and model series (as well as the paper and details on training), as well as Alibaba for training the original Qwen 2.5 base model series! |
| |
|
| | ``` |
| | @article{lambert2024tulu3, |
| | title = {Tülu 3: Pushing Frontiers in Open Language Model Post-Training}, |
| | author = { |
| | Nathan Lambert and |
| | Jacob Morrison and |
| | Valentina Pyatkin and |
| | Shengyi Huang and |
| | Hamish Ivison and |
| | Faeze Brahman and |
| | Lester James V. Miranda and |
| | Alisa Liu and |
| | Nouha Dziri and |
| | Shane Lyu and |
| | Yuling Gu and |
| | Saumya Malik and |
| | Victoria Graf and |
| | Jena D. Hwang and |
| | Jiangjiang Yang and |
| | Ronan Le Bras and |
| | Oyvind Tafjord and |
| | Chris Wilhelm and |
| | Luca Soldaini and |
| | Noah A. Smith and |
| | Yizhong Wang and |
| | Pradeep Dasigi and |
| | Hannaneh Hajishirzi |
| | }, |
| | year = {2024}, |
| | email = {tulu@allenai.org} |
| | } |
| | ``` |
| |
|
| | ## Training procedure |
| |
|
| | [<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl) |
| |
|
| | ### Training hyperparameters |
| |
|
| | The following hyperparameters were used during training: |
| | - learning_rate: 3.5e-06 |
| | - train_batch_size: 8 |
| | - eval_batch_size: 8 |
| | - seed: 42 |
| | - distributed_type: multi-GPU |
| | - num_devices: 8 |
| | - gradient_accumulation_steps: 2 |
| | - total_train_batch_size: 128 |
| | - total_eval_batch_size: 64 |
| | - optimizer: Use paged_ademamix_8bit and the args are: |
| | No additional optimizer arguments |
| | - lr_scheduler_type: cosine |
| | - lr_scheduler_warmup_steps: 370 |
| | - num_epochs: 1 |
| | |
| | ### Framework versions |
| | |
| | - Transformers 4.46.3 |
| | - Pytorch 2.5.1+cu124 |
| | - Datasets 3.1.0 |
| | - Tokenizers 0.20.3 |
| | |
| | ### Configuration |
| | <details><summary>See axolotl config</summary> |
| | |
| | axolotl version: `0.5.2` |
| | ```yaml |
| | base_model: Qwen/Qwen2.5-7B |
| |
|
| | plugins: |
| | - axolotl.integrations.liger.LigerPlugin |
| | liger_rope: true |
| | liger_rms_norm: true |
| | liger_glu_activation: true |
| | liger_fused_linear_cross_entropy: true |
| | |
| | strict: false |
| | |
| | chat_template: chatml |
| | datasets: |
| | - path: allenai/tulu-3-sft-mixture |
| | type: chat_template |
| | split: train |
| | field_messages: messages |
| | |
| | dataset_prepared_path: last_run_prepared |
| | #val_set_size: 0.02 |
| | output_dir: ./ckpts |
| | |
| | sequence_len: 8192 |
| | #sample_packing: true |
| | pad_to_sequence_len: true |
| |
|
| | wandb_project: qwen-2.5-7b-sft |
| | wandb_entity: |
| | wandb_watch: |
| | wandb_name: |
| | wandb_log_model: |
| |
|
| | gradient_accumulation_steps: 2 |
| | micro_batch_size: 8 |
| | num_epochs: 1 |
| | optimizer: paged_ademamix_8bit |
| | lr_scheduler: cosine |
| | learning_rate: 3.5e-6 |
| | |
| | train_on_inputs: false |
| | group_by_length: false |
| | bf16: auto |
| | fp16: |
| | tf32: false |
| | |
| | gradient_checkpointing: true |
| | gradient_checkpointing_kwargs: |
| | use_reentrant: false |
| | early_stopping_patience: |
| | resume_from_checkpoint: |
| | logging_steps: 1 |
| | xformers_attention: |
| | flash_attention: true |
| |
|
| | deepspeed: deepspeed_configs/zero3_bf16.json |
| |
|
| | warmup_steps: 370 |
| | #evals_per_epoch: 4 |
| | eval_table_size: |
| | saves_per_epoch: 2 |
| | debug: |
| | weight_decay: 0.0 |
| |
|
| | ``` |
| | |
| | </details><br> |
| | |