Instructions to use afrideva/smol_llama-101M-GQA-python-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use afrideva/smol_llama-101M-GQA-python-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="afrideva/smol_llama-101M-GQA-python-GGUF", filename="smol_llama-101m-gqa-python.fp16.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use afrideva/smol_llama-101M-GQA-python-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M
Use Docker
docker model run hf.co/afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use afrideva/smol_llama-101M-GQA-python-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "afrideva/smol_llama-101M-GQA-python-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "afrideva/smol_llama-101M-GQA-python-GGUF", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M
- Ollama
How to use afrideva/smol_llama-101M-GQA-python-GGUF with Ollama:
ollama run hf.co/afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M
- Unsloth Studio new
How to use afrideva/smol_llama-101M-GQA-python-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for afrideva/smol_llama-101M-GQA-python-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for afrideva/smol_llama-101M-GQA-python-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for afrideva/smol_llama-101M-GQA-python-GGUF to start chatting
- Docker Model Runner
How to use afrideva/smol_llama-101M-GQA-python-GGUF with Docker Model Runner:
docker model run hf.co/afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M
- Lemonade
How to use afrideva/smol_llama-101M-GQA-python-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.smol_llama-101M-GQA-python-GGUF-Q4_K_M
List all available models
lemonade list
output = llm(
"Once upon a time,",
max_tokens=512,
echo=True
)
print(output)BEE-spoke-data/smol_llama-101M-GQA-python-GGUF
Quantized GGUF model files for smol_llama-101M-GQA-python from BEE-spoke-data
| Name | Quant method | Size |
|---|---|---|
| smol_llama-101m-gqa-python.fp16.gguf | fp16 | 203.28 MB |
| smol_llama-101m-gqa-python.q2_k.gguf | q2_k | 50.93 MB |
| smol_llama-101m-gqa-python.q3_k_m.gguf | q3_k_m | 57.06 MB |
| smol_llama-101m-gqa-python.q4_k_m.gguf | q4_k_m | 65.41 MB |
| smol_llama-101m-gqa-python.q5_k_m.gguf | q5_k_m | 74.34 MB |
| smol_llama-101m-gqa-python.q6_k.gguf | q6_k | 83.83 MB |
| smol_llama-101m-gqa-python.q8_0.gguf | q8_0 | 108.35 MB |
Original Model Card:
smol_llama-101M-GQA: python
400MB of buzz: pure Python programming nectar! ๐ฏ
This model is the general pre-trained checkpoint BEE-spoke-data/smol_llama-101M-GQA trained on a deduped version of pypi for +1 epoch. Play with the model in this demo space.
- Its architecture is the same as the base, with some new Python-related tokens added to vocab prior to training.
- It can generate basic Python code and markdown in README style, but will struggle with harder planning/reasoning tasks
- This is an experiment to test the abilities of smol-sized models in code generation; meaning both its capabilities and limitations
Use with care & understand that there may be some bugs ๐ still to be worked out.
Usage
๐ Be sure to note:
- The model uses the "slow" llama2 tokenizer. Set use_fast=False when loading the tokenizer.
- Use transformers library version 4.33.3 due to a known issue in version 4.34.1 (at time of writing)
Which llama2 tokenizer the API widget uses is an age-old mystery, and may cause minor whitespace issues (widget only).
To install the necessary packages and load the model:
# Install necessary packages
# pip install transformers==4.33.3 accelerate sentencepiece
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(
"BEE-spoke-data/smol_llama-101M-GQA-python",
use_fast=False,
)
model = AutoModelForCausalLM.from_pretrained(
"BEE-spoke-data/smol_llama-101M-GQA-python",
device_map="auto",
)
# The model can now be used as any other decoder
longer code-gen example
Below is a quick script that can be used as a reference/starting point for writing your own, better one :)
๐ฅ Unleash the Power of Code Generation! Click to Reveal the Magic! ๐ฎ
Are you ready to witness the incredible possibilities of code generation? ๐. Brace yourself for an exceptional journey into the world of artificial intelligence and programming. Observe a script that will change the way you create and finalize code.
This script provides entry to a planet where machines can write code with remarkable precision and imagination.
"""
simple script for testing model(s) designed to generate/complete code
See details/args with the below.
python textgen_inference_code.py --help
"""
import logging
import random
import time
from pathlib import Path
import fire
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
logging.basicConfig(format="%(levelname)s - %(message)s", level=logging.INFO)
class Timer:
"""
Basic timer utility.
"""
def __enter__(self):
self.start_time = time.perf_counter()
return self
def __exit__(self, exc_type, exc_value, traceback):
self.end_time = time.perf_counter()
self.elapsed_time = self.end_time - self.start_time
logging.info(f"Elapsed time: {self.elapsed_time:.4f} seconds")
def load_model(model_name, use_fast=False):
""" util for loading model and tokenizer"""
logging.info(f"Loading model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=use_fast)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
model = torch.compile(model)
return tokenizer, model
def run_inference(prompt, model, tokenizer, max_new_tokens: int = 256):
"""
run_inference
Args:
prompt (TYPE): Description
model (TYPE): Description
tokenizer (TYPE): Description
max_new_tokens (int, optional): Description
Returns:
TYPE: Description
"""
logging.info(f"Running inference with max_new_tokens={max_new_tokens} ...")
with Timer() as timer:
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
min_new_tokens=8,
renormalize_logits=True,
no_repeat_ngram_size=8,
repetition_penalty=1.04,
num_beams=4,
early_stopping=True,
)
text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
logging.info(f"Output text:\n\n{text}")
return text
def main(
model_name="BEE-spoke-data/smol_llama-101M-GQA-python",
prompt:str=None,
use_fast=False,
n_tokens: int = 256,
):
"""Summary
Args:
model_name (str, optional): Description
prompt (None, optional): specify the prompt directly (default: random choice from list)
n_tokens (int, optional): max new tokens to generate
"""
logging.info(f"Inference with:\t{model_name}, max_new_tokens:{n_tokens}")
if prompt is None:
prompt_list = [
'''
def print_primes(n: int):
"""
Print all primes between 1 and n
"""''',
"def quantum_analysis(",
"def sanitize_filenames(target_dir:str, recursive:False, extension",
]
prompt = random.SystemRandom().choice(prompt_list)
logging.info(f"Using prompt:\t{prompt}")
tokenizer, model = load_model(model_name, use_fast=use_fast)
run_inference(prompt, model, tokenizer, n_tokens)
if __name__ == "__main__":
fire.Fire(main)
Wowoweewa!! It can create some file cleaning utilities.
- Downloads last month
- 83
Model tree for afrideva/smol_llama-101M-GQA-python-GGUF
Base model
BEE-spoke-data/smol_llama-101M-GQA-python
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="afrideva/smol_llama-101M-GQA-python-GGUF", filename="", )