whynlp
/

tinyllama-zh

Text Generation

text-generation-inference

Model card Files Files and versions

whynlp commited on Mar 11, 2024

Commit

bb24762

·

verified ·

1 Parent(s): 9ffb4b7

Update README.md

Files changed (1) hide show

README.md +36 -0

README.md CHANGED Viewed

@@ -1,3 +1,39 @@
 ---
 license: mit
 ---

 ---
 license: mit
+datasets:
+- p208p2002/wudao
+language:
+- zh
 ---
+# Chinese TinyLlama
+A demo project that pretrains a tinyllama on Chinese corpora, with minimal modification to the huggingface transformers code. It serves as a use case to demonstrate how to use the huggingface version [TinyLlama](https://github.com/whyNLP/tinyllama) to pretrain a model on a large corpus.
+See the [Github Repo](https://github.com/whyNLP/tinyllama-zh) for more details.
+## Usage
+```python
+# Load model directly
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("whynlp/tinyllama-zh", trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained("whynlp/tinyllama-zh")
+```
+## Model Details
+### Model Description
+This model is trained on [WuDaoCorpora Text](https://www.scidb.cn/en/detail?dataSetId=c6a3fe684227415a9db8e21bac4a15ab). The dataset contains about 45B tokens and the model is trained for 2 epochs. The training takes about 6 days on 8 A100 GPUs.
+The model uses the `THUDM/chatglm3-6b` tokenizer from huggingface.
+- **Model type:** Llama
+- **Language(s) (NLP):** Chinese
+- **License:** MIT
+- **Finetuned from model [optional]:** TinyLlama-2.5T checkpoint
+## Uses
+The model does not perform very well (The CMMLU result is slightly above 25). For better performance, one may use a better corpus (e.g. [wanjuan](https://opendatalab.org.cn/OpenDataLab/WanJuan1_dot_0)). Again, this project only serves as a demonstration of how to pretrain a TinyLlama on a large corpus.