Update README.md

384924e verified over 1 year ago

5.29 kB

	---
	license: apache-2.0
	datasets:
	- mlfoundations/dclm-baseline-1.0-parquet
	- bigcode/starcoderdata
	- open-web-math/open-web-math
	- allenai/dolma
	language:
	- en
	library_name: transformers
	---
	PhoneLM-1.5B is a 1.5 billion parameter decoder-only language model pre-trained on 1.1 trillion tokens.

	## Usage
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_name = 'mllmTeam/PhoneLM-1.5B'
	model = AutoModelForCausalLM.from_pretrained(model_name, device_map='cuda', trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	inp = tokenizer("Machine Learning is ", return_tensors="pt")
	inp = {k: v.to('cuda') for k, v in inp.items()}
	out = model.generate(**inp,
	max_length=256,
	do_sample=True,
	temperature=0.7,
	top_p=0.7
	)
	text = tokenizer.decode(out[0], skip_special_tokens=True)
	print(text)
	```
	## Model Details

	* Developed by: mllmTeam
	* Model type: `PhoneLM 1.5B` models are auto-regressive language models based on the transformer decoder architecture.
	* Language(s): English
	* Paper: [PhoneLM Technical Report]()
	* Library: [PhoneLM](https://github.com/UbiquitousLearning/PhoneLM)

	### Model Architecture

	The model is a decoder-only transformer architecture with the following modifications:

	\| Hidden Size \| Layers \| Heads \| Sequence Length \|
	\|-------------\|--------\|-------\|-----------------\|
	\| 2560 \| 19 \| 16 \| 2048 \|

	* Position Embeddings: Rotary Position Embeddings ([Su et al., 2021](https://arxiv.org/abs/2104.09864)) applied to the first 25% of head embedding dimensions for improved throughput following [Black et al. (2022)](https://arxiv.org/pdf/2204.06745.pdf). PhoneLM quantized the sin and cos values in Rotary Position Embeddings to 8-bit integers.
	* Normalization: LayerNorm ([Ba et al., 2016](https://arxiv.org/abs/1607.06450)) with learned bias terms as opposed to RMSNorm ([Zhang & Sennrich, 2019](https://arxiv.org/abs/1910.07467)).
	* Biases: We remove all bias terms from the feed-forward networks and multi-head self-attention layers, except for the biases of the query, key, and value projections ([Bai et al., 2023](https://arxiv.org/abs/2309.16609)).
	* ReLU Activation Function: ReLU([Glorot et al., 2011](https://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf)) activation functions are adopted in feed-forward networks.
	* Tokenizer: We use the SmolLM([Allal et al., 2024](https://huggingface.co/blog/smollm))'s tokenizer with a vocabulary size of 49,152.

	## Training Dataset

	The training dataset PhoneLM used is comprised of a filtered mixture of open-source large-scale datasets available on the [HuggingFace Hub](https://huggingface.co/datasets): DCLM-baseline([Li et al., 2024](https://arxiv.org/abs/2406.11794)), StarCoder ([Li et al., 2023](https://arxiv.org/abs/2305.06161)), OpenWebMath ([Paster et al., 2023](https://arxiv.org/abs/2310.06786)) and Dolma ([Soldaini et al., 2024](https://aclanthology.org/2024.acl-long.840/)).

	## Evaluation Results
	\| Model \| HellaSwag \| WinoGrande \| PIQA \| SciQ \| BoolQ \| ARC Easy \| ARC Challenge \| Average \|
	\|-----------\|-----------\|------------\|------\|------\|-------\|----------\|---------------\|---------\|
	\| PhoneLM-1.5B \| 66.9 \| 63.0 \| 77.3 \| 88.8 \| 65.5 \| 69.7 \| 39.9 \| 67.31 \|
	\| Pythia-1.4B \| 52.0 \| 57.2 \| 71.1 \| 79.2 \| 63.2 \| 53.9 \| 28.3 \| 57.84 \|
	\| OPT-1.3B \| 53.7 \| 59.0 \| 71.0 \| 78.1 \| 57.2 \| 51.3 \| 28.0 \| 56.90 \|
	\| BLOOM-1.1B \| 43.0 \| 54.9 \| 67.2 \| 74.6 \| 59.1 \| 45.4 \| 25.6 \| 52.83 \|
	\| TinyLlama-1.1B \| 59.1 \| 58.9 \| 73.0 \| 82.3 \| 58.6 \| 55.7 \| 31.0 \| 59.80 \|
	\| MobileLLaMA-1.4B \| 56.1 \| 59.4 \| 73.0 \| 81.9 \| 56.7 \| 55.8 \| 30.3 \| 59.03 \|
	\| MobiLlama-1B \| 62.2 \| 59.3 \| 74.8 \| 82.8 \| 60.3 \| 56.4 \| 31.7 \| 61.07 \|
	\| OpenELM-1.1B \| 64.8 \| 61.7 \| 75.6 \| 83.6 \| 63.6 \| 55.4 \| 32.3 \| 62.43 \|
	\| DCLM-1.4B \| 53.6 \| 66.3 \| 77.0 \| 94.0 \| 71.4 \| 74.8 \| 41.2 \| 68.33 \|
	\| SmolLM-1.7B \| 49.6 \| 60.9 \| 75.8 \| 93.2 \| 66.0 \| 76.4 \| 43.5 \| 66.49 \|
	\| Qwen 1.5-1.8B \| 60.9 \| 60.5 \| 74.2 \| 89.4 \| 66.5 \| 59.1 \| 34.7 \| 63.61 \|
	\| Galactica-1.3B \| 41.0 \| 54.4 \| 63.8 \| 87.7 \| 62.0 \| 58.6 \| 30.5 \| 56.86 \|
	\| StableLM 2-1.6B \| 68.8 \| 64.1 \| 75.1 \| 76.9 \| 80.0 \| 60.3 \| 39.2 \| 66.34 \|
	\| Cerebras-GPT-1.3B \| 38.4 \| 51.9 \| 66.8 \| 73.0 \| 59.3 \| 45.8 \| 25.3 \| 51.50 \|
	\| MiniCPM-1B \| 67.5 \| 63.7 \| 75.1 \| 91.0 \| 70.5 \| 62.9 \| 38.1 \| 66.97 \|
	\| MiniCPM-2B \| 67.2 \| 63.9 \| 76.1 \| 92.5 \| 74.6 \| 69.0 \| 42.7 \| 69.43 \|
	\| Gemma-2B \| 71.4 \| 65.2 \| 78.4 \| 91.4 \| 69.9 \| 72.3 \| 42.0 \| 70.09 \|
	\| Gemma 2-2B \| 55.0 \| 68.7 \| 78.7 \| 96.0 \| 73.6 \| 80.3 \| 46.9 \| 71.31 \|

	## License
	* This repository is released under the [Apache-2.0](https://huggingface.co/mllmTeam/PhoneLM-1.5B/blob/main/LICENSE) License.

	## Citation
	```
	@misc{yi2024phonelmanefficientcapablesmall,
	title={PhoneLM:an Efficient and Capable Small Language Model Family through Principled Pre-training},
	author={Rongjie Yi and Xiang Li and Weikai Xie and Zhenyan Lu and Chenghua Wang and Ao Zhou and Shangguang Wang and Xiwen Zhang and Mengwei Xu},
	year={2024},
	eprint={2411.05046},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2411.05046},
	}
	```