| # gpt-2 | |
| This is a C++ example running GPT-2 inference using the [ggml](https://github.com/ggerganov/ggml) library. | |
| The program runs on the CPU - no video card is required. | |
| The [Cerebras-GPT](https://huggingface.co/cerebras) models are also supported. | |
| The example supports the following GPT-2 models: | |
| | Model | Description | Disk Size | | |
| | --- | --- | --- | | |
| | 117M | Small model | 240 MB | | |
| | 345M | Medium model | 680 MB | | |
| | 774M | Large model | 1.5 GB | | |
| | 1558M | XL model | 3.0 GB | | |
| Sample performance on MacBook M1 Pro: | |
| | Model | Size | Time / Token | | |
| | --- | --- | --- | | |
| | GPT-2 | 117M | 5 ms | | |
| | GPT-2 | 345M | 12 ms | | |
| | GPT-2 | 774M | 23 ms | | |
| | GPT-2 | 1558M | 42 ms | | |
| *TODO: add tables for Cerebras-GPT models* | |
| Sample output: | |
| ``` | |
| $ ./bin/gpt-2 -h | |
| usage: ./bin/gpt-2 [options] | |
| options: | |
| -h, --help show this help message and exit | |
| -s SEED, --seed SEED RNG seed (default: -1) | |
| -t N, --threads N number of threads to use during computation (default: 8) | |
| -p PROMPT, --prompt PROMPT | |
| prompt to start generation with (default: random) | |
| -n N, --n_predict N number of tokens to predict (default: 200) | |
| --top_k N top-k sampling (default: 40) | |
| --top_p N top-p sampling (default: 0.9) | |
| --temp N temperature (default: 1.0) | |
| -b N, --batch_size N batch size for prompt processing (default: 8) | |
| -m FNAME, --model FNAME | |
| model path (default: models/gpt-2-117M/ggml-model.bin) | |
| $ ./bin/gpt-2 | |
| gpt2_model_load: loading model from 'models/gpt-2-117M/ggml-model.bin' | |
| gpt2_model_load: n_vocab = 50257 | |
| gpt2_model_load: n_ctx = 1024 | |
| gpt2_model_load: n_embd = 768 | |
| gpt2_model_load: n_head = 12 | |
| gpt2_model_load: n_layer = 12 | |
| gpt2_model_load: f16 = 1 | |
| gpt2_model_load: ggml ctx size = 311.12 MB | |
| gpt2_model_load: memory size = 72.00 MB, n_mem = 12288 | |
| gpt2_model_load: model size = 239.08 MB | |
| main: number of tokens in prompt = 1 | |
| So this is going to be the end of the line for us. | |
| If the Dolphins continue to do their business, it's possible that the team could make a bid to bring in new defensive coordinator Scott Linehan. | |
| Linehan's job is a little daunting, but he's a great coach and an excellent coach. I don't believe we're going to make the playoffs. | |
| We're going to have to work hard to keep our heads down and get ready to go.<|endoftext|> | |
| main: mem per token = 2048612 bytes | |
| main: load time = 106.32 ms | |
| main: sample time = 7.10 ms | |
| main: predict time = 506.40 ms / 5.06 ms per token | |
| main: total time = 629.84 ms | |
| ``` | |
| ## Downloading and converting the original models (GPT-2) | |
| You can download the original model files using the [download-model.sh](download-model.sh) Bash script. The models are | |
| in Tensorflow format, so in order to use them with ggml, you need to convert them to appropriate format. This is done | |
| via the [convert-ckpt-to-ggml.py](convert-ckpt-to-ggml.py) python script. | |
| Here is the entire process for the GPT-2 117M model (download from official site + conversion): | |
| ``` | |
| cd ggml/build | |
| ../examples/gpt-2/download-model.sh 117M | |
| Downloading model 117M ... | |
| models/gpt-2-117M/checkpoint 100%[=============================>] 77 --.-KB/s in 0s | |
| models/gpt-2-117M/encoder.json 100%[=============================>] 1018K 1.20MB/s in 0.8s | |
| models/gpt-2-117M/hparams.json 100%[=============================>] 90 --.-KB/s in 0s | |
| models/gpt-2-117M/model.ckpt.data-00000-of-00001 100%[=============================>] 474.70M 1.21MB/s in 8m 39s | |
| models/gpt-2-117M/model.ckpt.index 100%[=============================>] 5.09K --.-KB/s in 0s | |
| models/gpt-2-117M/model.ckpt.meta 100%[=============================>] 460.11K 806KB/s in 0.6s | |
| models/gpt-2-117M/vocab.bpe 100%[=============================>] 445.62K 799KB/s in 0.6s | |
| Done! Model '117M' saved in 'models/gpt-2-117M/' | |
| Run the convert-ckpt-to-ggml.py script to convert the model to ggml format. | |
| python /Users/john/ggml/examples/gpt-2/convert-ckpt-to-ggml.py models/gpt-2-117M/ 1 | |
| ``` | |
| This conversion requires that you have python and Tensorflow installed on your computer. Still, if you want to avoid | |
| this, you can download the already converted ggml models as described below. | |
| ## Downloading and converting the original models (Cerebras-GPT) | |
| Clone the respective repository from here: https://huggingface.co/cerebras | |
| Use the [convert-cerebras-to-ggml.py](convert-cerebras-to-ggml.py) script to convert the model to `ggml` format: | |
| ``` | |
| cd ggml/build | |
| git clone https://huggingface.co/cerebras/Cerebras-GPT-111M models/ | |
| python ../examples/gpt-2/convert-cerebras-to-ggml.py models/Cerebras-GPT-111M/ | |
| ``` | |
| ## Downloading the ggml model directly (GPT-2) | |
| For convenience, I will be hosting the converted ggml model files in order to make it easier to run the examples. This | |
| way, you can directly download a single binary file and start using it. No python or Tensorflow is required. | |
| Here is how to get the 117M ggml model: | |
| ``` | |
| cd ggml/build | |
| ../examples/gpt-2/download-ggml-model.sh 117M | |
| Downloading ggml model 117M ... | |
| models/gpt-2-117M/ggml-model.bin 100%[===============================>] 239.58M 8.52MB/s in 28s | |
| Done! Model '117M' saved in 'models/gpt-2-117M/ggml-model.bin' | |
| You can now use it like this: | |
| $ ./bin/gpt-2 -m models/gpt-2-117M/ggml-model.bin -p "This is an example" | |
| ``` | |
| At some point, I might decide to stop hosting these models. So in that case, simply revert to the manual process above. | |
| ## Quantizing the models | |
| You can also try to quantize the `ggml` models via 4-bit integer quantization. | |
| Keep in mind that for smaller models, this will render them completely useless. | |
| You generally want to quantize larger models. | |
| ``` | |
| # quantize GPT-2 F16 to Q4_0 (faster but less precise) | |
| ./bin/gpt-2-quantize models/gpt-2-1558M/ggml-model-f16.bin models/gpt-2-1558M/ggml-model-q4_0.bin 2 | |
| ./bin/gpt-2 -m models/gpt-2-1558M/ggml-model-q4_0.bin -p "This is an example" | |
| # quantize Cerebras F16 to Q4_1 (slower but more precise) | |
| ./bin/gpt-2-quantize models/Cerebras-GPT-6.7B/ggml-model-f16.bin models/Cerebras-GPT-6.7B/ggml-model-q4_1.bin 3 | |
| ./bin/gpt-2 -m models/Cerebras-GPT-6.7B/ggml-model-q4_1.bin -p "This is an example" | |
| ``` | |
| ## Batched generation example | |
| You can try the batched generation from a given prompt using the gpt-2-batched binary. | |
| Sample output: | |
| ``` | |
| $ gpt-2-batched -np 5 -m models/gpt-2-117M/ggml-model.bin -p "Hello my name is" -n 50 | |
| main: seed = 1697037431 | |
| gpt2_model_load: loading model from 'models/gpt-2-117M/ggml-model.bin' | |
| gpt2_model_load: n_vocab = 50257 | |
| gpt2_model_load: n_ctx = 1024 | |
| gpt2_model_load: n_embd = 768 | |
| gpt2_model_load: n_head = 12 | |
| gpt2_model_load: n_layer = 12 | |
| gpt2_model_load: ftype = 1 | |
| gpt2_model_load: qntvr = 0 | |
| gpt2_model_load: ggml tensor size = 320 bytes | |
| gpt2_model_load: backend buffer size = 312.72 MB | |
| ggml_init_cublas: found 1 CUDA devices: | |
| Device 0: NVIDIA GeForce GTX 1660, compute capability 7.5 | |
| gpt2_model_load: using CPU backend | |
| gpt2_model_load: memory size = 72.00 MB, n_mem = 12288 | |
| gpt2_model_load: model size = 239.08 MB | |
| extract_tests_from_file : No test file found. | |
| test_gpt_tokenizer : 0 tests failed out of 0 tests. | |
| main: compute buffer size: 3.26 MB | |
| main: generating 5 sequences ... | |
| main: prompt: 'Hello my name is' | |
| main: number of tokens in prompt = 4, first 8 tokens: 15496 616 1438 318 | |
| sequence 0: | |
| Hello my name is John. You can call me any way you want, if you want, but for my very first date, I will be on the phone with you. We're both in our early 20s, but I feel like it's all | |
| sequence 1: | |
| Hello my name is Robert, and I want to say that we're proud to have your company here on the world's largest platform for sharing your stories with us. This is a huge opportunity for our community. We have hundreds of people on this team and | |
| sequence 2: | |
| Hello my name is Jack. I'm the one who created you. | |
| Jack is a boy with a big smile and a big heart. He is a handsome guy. He loves the outdoors and loves the people he meets. He wants to be a | |
| sequence 3: | |
| Hello my name is John. I am a Canadian citizen with a large number of family in Quebec and I am interested in studying. My aim is to take up a post in the Journal of the International Academy of Sciences of Canada which I am currently finishing. | |
| sequence 4: | |
| Hello my name is Dan. I am an entrepreneur. I am a great father. I am a great husband. I am a great husband. I am a great dad. And I am a great husband. | |
| I love my life. I love | |
| main: load time = 880.80 ms | |
| main: sample time = 91.43 ms | |
| main: predict time = 2518.29 ms | |
| main: total time = 3544.32 ms | |
| ``` | |