Instructions to use nickypro/tinyllama-15M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nickypro/tinyllama-15M with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nickypro/tinyllama-15M")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("nickypro/tinyllama-15M") model = AutoModelForCausalLM.from_pretrained("nickypro/tinyllama-15M") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nickypro/tinyllama-15M with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nickypro/tinyllama-15M" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nickypro/tinyllama-15M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/nickypro/tinyllama-15M
- SGLang
How to use nickypro/tinyllama-15M with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nickypro/tinyllama-15M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nickypro/tinyllama-15M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nickypro/tinyllama-15M" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nickypro/tinyllama-15M", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use nickypro/tinyllama-15M with Docker Model Runner:
docker model run hf.co/nickypro/tinyllama-15M
Convert hugging face tokenizer.json to tokenizer.model - inference failed
I am trying to convert this tokenizer.json into tokenizer.model inorder to run Karpathy's llama2.c - https://github.com/karpathy/llama2.c/
I tried the following steps:
- Extract vocabulary from tokenizer.json
- Train the sentencepiece tokenizer using spm_train with the extracted vocabulary (vocab_size = 32000). This generates tokenizer.model
- Use tokenizer.py to convert the tokenizer.model to tokenizer.bin.
Even though the above steps resulted in the generation of tokenizer.model, the inference wasnot successful. I expected the model to generate a tiny story (because the model is trained on TinyStories dataset), but the output I got was random gibberish sentences.
I assume this has something to do with the tokenizer.model that was generated.
My question is: Can hugging face tokenizer be converted to tokenizer.model, and used with llama2.c?If yes, how can this be done?
If anyone could assist with this, it would be really helpful.
Not sure how conversion works, I have a similar thread in reverse that was never solved:
https://github.com/karpathy/llama2.c/issues/411
However, since I imported this model from llama2.c, I think you should be able to just use the default tokenizer from the llama2.c repo and I assume it would work fine.
Yes, it's works fine with the tokenizer.bin from the git repo https://github.com/nickypro/llama2.c.git. But I need to work with different hugging face llama models and tokenizers range from 15M to 3B. Also I wrote a simple python to convert the tokenizer.json to tokenizer.bin in the format in which the llama C code is expecting. Got it working, but there arises another issue,
I read that, In llama, words start with ▁ (underscore), which represents a space before the word. This is part of training, llama learned to treat ▁Hello as " Hello", but the tokenizer encodes that as a special token. So it must preserve "▁" in vocab file and tokens, otherwise, the model will not tokenize inputs correctly or decode outputs to the right words. But when converting token IDs back to strings (i.e. generation output), you should convert "▁" to a real space " ". But I haven't seen any replacing logic in run.c for "_" to " ". Is this excluded due to some special case?
Yeah the tokenizer for Llama2 specifically seems to be quite inconsistent with the way it handles spaces. My friend made a blog post about his issues with it here: https://davidquarel.github.io/2024/10/01/tokenizer-bad.html
Maybe you are getting similar issues?
My understanding is that the same tokenizer was used for 15M as the original Llama 2 model family but I haven't confirmed this.