Instructions to use SanudaDev/SinLlama-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SanudaDev/SinLlama-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="SanudaDev/SinLlama-GGUF",
	filename="sinllama-q4_k_m.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use SanudaDev/SinLlama-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf SanudaDev/SinLlama-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf SanudaDev/SinLlama-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf SanudaDev/SinLlama-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf SanudaDev/SinLlama-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf SanudaDev/SinLlama-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf SanudaDev/SinLlama-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf SanudaDev/SinLlama-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf SanudaDev/SinLlama-GGUF:Q4_K_M

Use Docker

docker model run hf.co/SanudaDev/SinLlama-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use SanudaDev/SinLlama-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SanudaDev/SinLlama-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SanudaDev/SinLlama-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/SanudaDev/SinLlama-GGUF:Q4_K_M

Ollama
How to use SanudaDev/SinLlama-GGUF with Ollama:
```
ollama run hf.co/SanudaDev/SinLlama-GGUF:Q4_K_M
```

Unsloth Studio

How to use SanudaDev/SinLlama-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for SanudaDev/SinLlama-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for SanudaDev/SinLlama-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for SanudaDev/SinLlama-GGUF to start chatting

Docker Model Runner
How to use SanudaDev/SinLlama-GGUF with Docker Model Runner:
```
docker model run hf.co/SanudaDev/SinLlama-GGUF:Q4_K_M
```

Lemonade

How to use SanudaDev/SinLlama-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull SanudaDev/SinLlama-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.SinLlama-GGUF-Q4_K_M

List all available models

lemonade list

SinLlama GGUF — Sinhala Language Model for Low-End Hardware

Why This Model Exists

The original polyglots/SinLlama_v01 is a powerful Sinhala language model built on Meta-Llama-3-8B. However, it comes with a critical barrier:

The Problem with the Original Model

Requirement	Original SinLlama	This GGUF Version
Model Size	~16 GB (FP16)	~4.65 GB
RAM Required	16–32 GB+	6–8 GB
GPU Required	Yes (CUDA-capable, 16GB+ VRAM)	No (runs on CPU)
Software Stack	PyTorch, Transformers, CUDA	Ollama / llama.cpp
Setup Complexity	High (Python environment, CUDA drivers, etc.)	Minimal
Suitable for Low-End PCs	❌ No	✅ Yes

The original HuggingFace model demands expensive GPU hardware and a complex Python/CUDA environment. Most Sri Lankan developers and students don't have access to high-end machines with 16GB+ VRAM GPUs to even load the model. This creates an accessibility gap — the people who need Sinhala AI the most are locked out from using it.

How This GGUF Model Solves It

This is a Q4_K_M quantized GGUF version of SinLlama that:

Runs on ordinary laptops and desktops — no GPU required
Reduced from ~16 GB to ~4.65 GB — fits in modest RAM
Works with Ollama — one-command setup, no Python knowledge needed
Works with llama.cpp — lightweight, cross-platform inference
Preserves Sinhala language quality — Q4_K_M is the sweet spot between size and accuracy
Extended Sinhala tokenizer — optimized 139K vocabulary for better Sinhala text handling

Model Details

Property	Value
Base Model	Meta-Llama-3-8B
Fine-tuned Model	polyglots/SinLlama_v01
Quantization	Q4_K_M (4-bit, mixed precision)
Format	GGUF (llama.cpp compatible)
File Size	4.65 GB
Context Length	2048 tokens
Vocabulary Size	~139,000 tokens (extended Sinhala)
Languages	Sinhala (සිංහල), English

Quick Start

Option 1: Using Ollama (Recommended — Easiest)

Install Ollama → https://ollama.com/download
Create a Modelfile:

FROM sinllama-q4_k_m.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 2048

SYSTEM You are SinLlama, a helpful AI assistant that can communicate in Sinhala (සිංහල). You are based on Meta-Llama-3-8B and have been specially trained on Sinhala language data.

Create and run:

ollama create sinllama -f Modelfile
ollama run sinllama

Start chatting in Sinhala:

>>> හෙලෝ, ඔබට කෙසේද?

Option 2: Using llama.cpp

# Clone and build llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Run the model
./main -m sinllama-q4_k_m.gguf \
  -p "ශ්‍රී ලංකාව පිළිබඳ කියන්න" \
  -n 256 \
  --temp 0.7

Option 3: Using Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="sinllama-q4_k_m.gguf",
    n_ctx=2048,
    n_threads=4,  # Adjust to your CPU cores
)

output = llm(
    "ශ්‍රී ලංකාව පිළිබඳ කියන්න",
    max_tokens=256,
    temperature=0.7,
    top_p=0.9,
)

print(output["choices"][0]["text"])

Intended Use

Sinhala language text generation — articles, creative writing, summaries
Conversational AI — chatbots and virtual assistants in Sinhala
Education — Sinhala language learning tools
Research — low-resource NLP research for Sinhala
Customer service — automated Sinhala-language support systems

Limitations

This is a quantized model; there may be minor quality loss compared to the full-precision original
Context window is limited to 2048 tokens
The model may occasionally generate incorrect or nonsensical Sinhala text
Not suitable for critical applications without human oversight
Inherits limitations and biases from the base Llama-3-8B model

Hardware Requirements

Minimum (CPU-only)

RAM: 6 GB available
Storage: 5 GB free disk space
CPU: Any modern x86_64 processor (Intel/AMD)
OS: Windows, macOS, or Linux

Quantization Details

The model was quantized using llama.cpp with the Q4_K_M method:

Q4_K_M uses 4-bit quantization with medium-sized key-value cache
Provides the best balance between model size, inference speed, and output quality
Recommended by the llama.cpp community as the default quantization for most use cases

Quantization	Size	Quality	Speed
FP16 (Original)	~16 GB	★★★★★	Slow (needs GPU)
Q8_0	~8.5 GB	★★★★☆	Moderate
Q4_K_M (This)	~4.65 GB	★★★★☆	Fast
Q4_0	~4.3 GB	★★★☆☆	Fastest

Copyright & License

Model License

This model is distributed under the Meta Llama 3 Community License. By downloading or using this model, you agree to the terms of the Meta Llama 3 Community License Agreement.

Quantization & Distribution

The GGUF quantization and this distribution were prepared by the repository maintainer
The original SinLlama fine-tuning was done by polyglots
All rights to the base architecture belong to Meta Platforms, Inc.

Usage Terms

✅ Free for research and personal use
✅ Free for commercial use (subject to Meta Llama 3 license terms)
✅ Redistribution allowed with attribution
❌ Do not use for generating harmful, misleading, or illegal content
❌ Do not misrepresent the model's outputs as human-written content without disclosure

Attribution

If you use this model in your work, please cite:

@misc{sinllama-gguf,
  title={SinLlama GGUF - Quantized Sinhala Language Model},
  author={SanudaDev},
  year={2025},
  url={https://huggingface.co/SanudaDev/SinLlama-GGUF},
  note={Q4_K_M GGUF quantization of polyglots/SinLlama_v01}
}

Acknowledgments

Meta AI — for the Llama 3 base model
polyglots — for the original SinLlama Sinhala fine-tuning
llama.cpp — for the GGUF quantization toolchain
Ollama — for making local LLM deployment simple

Making Sinhala AI accessible to everyone — not just those with expensive hardware.

Downloads last month: 32

GGUF

Model size

8B params

Architecture

llama

Hardware compatibility

4-bit

Model tree for SanudaDev/SinLlama-GGUF

Base model

meta-llama/Meta-Llama-3-8B

Adapter

polyglots/SinLlama_v01

Quantized

(1)

this model