Instructions to use afrideva/smol_llama-101M-GQA-python-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use afrideva/smol_llama-101M-GQA-python-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="afrideva/smol_llama-101M-GQA-python-GGUF",
	filename="smol_llama-101m-gqa-python.fp16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use afrideva/smol_llama-101M-GQA-python-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M

Use Docker

docker model run hf.co/afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use afrideva/smol_llama-101M-GQA-python-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "afrideva/smol_llama-101M-GQA-python-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "afrideva/smol_llama-101M-GQA-python-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M

Ollama
How to use afrideva/smol_llama-101M-GQA-python-GGUF with Ollama:
```
ollama run hf.co/afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M
```

Unsloth Studio

How to use afrideva/smol_llama-101M-GQA-python-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for afrideva/smol_llama-101M-GQA-python-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for afrideva/smol_llama-101M-GQA-python-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for afrideva/smol_llama-101M-GQA-python-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use afrideva/smol_llama-101M-GQA-python-GGUF with Docker Model Runner:
```
docker model run hf.co/afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M
```

Lemonade

How to use afrideva/smol_llama-101M-GQA-python-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull afrideva/smol_llama-101M-GQA-python-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.smol_llama-101M-GQA-python-GGUF-Q4_K_M

List all available models

lemonade list

BEE-spoke-data/smol_llama-101M-GQA-python-GGUF

Quantized GGUF model files for smol_llama-101M-GQA-python from BEE-spoke-data

Name	Quant method	Size
smol_llama-101m-gqa-python.fp16.gguf	fp16	203.28 MB
smol_llama-101m-gqa-python.q2_k.gguf	q2_k	50.93 MB
smol_llama-101m-gqa-python.q3_k_m.gguf	q3_k_m	57.06 MB
smol_llama-101m-gqa-python.q4_k_m.gguf	q4_k_m	65.41 MB
smol_llama-101m-gqa-python.q5_k_m.gguf	q5_k_m	74.34 MB
smol_llama-101m-gqa-python.q6_k.gguf	q6_k	83.83 MB
smol_llama-101m-gqa-python.q8_0.gguf	q8_0	108.35 MB

Original Model Card:

smol_llama-101M-GQA: python

400MB of buzz: pure Python programming nectar! 🍯

This model is the general pre-trained checkpoint BEE-spoke-data/smol_llama-101M-GQA trained on a deduped version of pypi for +1 epoch. Play with the model in this demo space.

Its architecture is the same as the base, with some new Python-related tokens added to vocab prior to training.
It can generate basic Python code and markdown in README style, but will struggle with harder planning/reasoning tasks
This is an experiment to test the abilities of smol-sized models in code generation; meaning both its capabilities and limitations

Use with care & understand that there may be some bugs 🐛 still to be worked out.

Usage

📌 Be sure to note:

The model uses the "slow" llama2 tokenizer. Set use_fast=False when loading the tokenizer.
Use transformers library version 4.33.3 due to a known issue in version 4.34.1 (at time of writing)

Which llama2 tokenizer the API widget uses is an age-old mystery, and may cause minor whitespace issues (widget only).

To install the necessary packages and load the model:

# Install necessary packages
# pip install transformers==4.33.3 accelerate sentencepiece

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(
    "BEE-spoke-data/smol_llama-101M-GQA-python",
    use_fast=False,
)
model = AutoModelForCausalLM.from_pretrained(
    "BEE-spoke-data/smol_llama-101M-GQA-python",
    device_map="auto",
)

# The model can now be used as any other decoder

longer code-gen example

Below is a quick script that can be used as a reference/starting point for writing your own, better one :)

🔥 Unleash the Power of Code Generation! Click to Reveal the Magic! 🔮

Are you ready to witness the incredible possibilities of code generation? 🚀. Brace yourself for an exceptional journey into the world of artificial intelligence and programming. Observe a script that will change the way you create and finalize code.

This script provides entry to a planet where machines can write code with remarkable precision and imagination.

"""
simple script for testing model(s) designed to generate/complete code

See details/args with the below. 
    python textgen_inference_code.py --help
"""
import logging
import random
import time
from pathlib import Path

import fire
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

logging.basicConfig(format="%(levelname)s - %(message)s", level=logging.INFO)


class Timer:
    """
    Basic timer utility.
    """

    def __enter__(self):

        self.start_time = time.perf_counter()
        return self

    def __exit__(self, exc_type, exc_value, traceback):

        self.end_time = time.perf_counter()
        self.elapsed_time = self.end_time - self.start_time
        logging.info(f"Elapsed time: {self.elapsed_time:.4f} seconds")


def load_model(model_name, use_fast=False):
    """ util for loading model and tokenizer"""
    logging.info(f"Loading model: {model_name}")
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=use_fast)
    model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype="auto", device_map="auto"
    )
    model = torch.compile(model)
    return tokenizer, model


def run_inference(prompt, model, tokenizer, max_new_tokens: int = 256):
    """
    run_inference

    Args:
        prompt (TYPE): Description
        model (TYPE): Description
        tokenizer (TYPE): Description
        max_new_tokens (int, optional): Description

    Returns:
        TYPE: Description
    """
    logging.info(f"Running inference with max_new_tokens={max_new_tokens} ...")
    with Timer() as timer:
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            min_new_tokens=8,
            renormalize_logits=True,
            no_repeat_ngram_size=8,
            repetition_penalty=1.04,
            num_beams=4,
            early_stopping=True,
        )
    text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
    logging.info(f"Output text:\n\n{text}")
    return text


def main(
    model_name="BEE-spoke-data/smol_llama-101M-GQA-python",
    prompt:str=None,
    use_fast=False,
    n_tokens: int = 256,
):
    """Summary

    Args:
        model_name (str, optional): Description
        prompt (None, optional): specify the prompt directly (default: random choice from list)
        n_tokens (int, optional): max new tokens to generate
    """
    logging.info(f"Inference with:\t{model_name}, max_new_tokens:{n_tokens}")

    if prompt is None:
        prompt_list = [
            '''
            def print_primes(n: int):
               """
               Print all primes between 1 and n
               """''',
            "def quantum_analysis(",
            "def sanitize_filenames(target_dir:str, recursive:False, extension",
        ]
        prompt = random.SystemRandom().choice(prompt_list)

    logging.info(f"Using prompt:\t{prompt}")

    tokenizer, model = load_model(model_name, use_fast=use_fast)

    run_inference(prompt, model, tokenizer, n_tokens)


if __name__ == "__main__":
    fire.Fire(main)

Wowoweewa!! It can create some file cleaning utilities.

Downloads last month: 52

GGUF

Model size

0.1B params

Architecture

llama

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

View +1 variant

Model tree for afrideva/smol_llama-101M-GQA-python-GGUF

Base model

BEE-spoke-data/smol_llama-101M-GQA-python

Quantized

(1)

this model