Instructions to use embedme/lightonai-lateon-code-edge-f16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use embedme/lightonai-lateon-code-edge-f16 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="embedme/lightonai-lateon-code-edge-f16",
	filename="lightonai-lateon-code-edge-f16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use embedme/lightonai-lateon-code-edge-f16 with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf embedme/lightonai-lateon-code-edge-f16:F16
# Run inference directly in the terminal:
llama cli -hf embedme/lightonai-lateon-code-edge-f16:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf embedme/lightonai-lateon-code-edge-f16:F16
# Run inference directly in the terminal:
llama cli -hf embedme/lightonai-lateon-code-edge-f16:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf embedme/lightonai-lateon-code-edge-f16:F16
# Run inference directly in the terminal:
./llama-cli -hf embedme/lightonai-lateon-code-edge-f16:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf embedme/lightonai-lateon-code-edge-f16:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf embedme/lightonai-lateon-code-edge-f16:F16

Use Docker

docker model run hf.co/embedme/lightonai-lateon-code-edge-f16:F16

LM Studio
Jan
Ollama
How to use embedme/lightonai-lateon-code-edge-f16 with Ollama:
```
ollama run hf.co/embedme/lightonai-lateon-code-edge-f16:F16
```

Unsloth Studio

How to use embedme/lightonai-lateon-code-edge-f16 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for embedme/lightonai-lateon-code-edge-f16 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for embedme/lightonai-lateon-code-edge-f16 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for embedme/lightonai-lateon-code-edge-f16 to start chatting

Atomic Chat new
Docker Model Runner
How to use embedme/lightonai-lateon-code-edge-f16 with Docker Model Runner:
```
docker model run hf.co/embedme/lightonai-lateon-code-edge-f16:F16
```

Lemonade

How to use embedme/lightonai-lateon-code-edge-f16 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull embedme/lightonai-lateon-code-edge-f16:F16

Run and chat with the model

lemonade run user.lightonai-lateon-code-edge-f16-F16

List all available models

lemonade list

LateOn-Code-edge (GGUF f16 + Projection)

GGUF conversion of lightonai/LateOn-Code-edge for use with litembeddings.

Model Details

Property	Value
Base model	lightonai/LateOn-Code-edge
Architecture	ModernBERT (17M params)
Output dimensions	48 (after projection)
Context length	8,192 tokens
Quantization	f16
GGUF size	34 MB
Projection	256 → 48 (composed from two PyLate Dense layers: 256→512→48)
Use case	Fast, CPU-friendly code search with late interaction (ColBERT-style)

Variants

Variant	Size	Quality
f32	66 MB	Original precision (lossless)
f16 (this repo)	34 MB	Lossless — 100% top-1 agreement, 240/300 weighted
Q8_0	19 MB	79% weighted score, 96-100% top-1 agreement, 3.5× smaller

Files

File	Size	Description
`lightonai-lateon-code-edge-f16.gguf`	34 MB	ModernBERT encoder in GGUF f16 format
`lightonai-lateon-code-edge-f16.projection`	49 KB	Composed projection matrix (48×256, float32)

Usage with litembeddings

.load ./build/litembeddings

-- Load model with projection
SELECT lembed_model('lightonai-lateon-code-edge-f16.gguf',
    '{"colbert_projection": "lightonai-lateon-code-edge-f16.projection"}');

-- Generate token embeddings for code
SELECT lembed_tokens('async fn get_connection(pool: &Pool) -> Result<Connection>');

-- Code search with MaxSim
SELECT
    id, code,
    lembed_maxsim(lembed_tokens('database connection pool'), token_emb) AS score
FROM code_embeddings
ORDER BY score DESC
LIMIT 10;

Quantization Quality Benchmark

Tested across 3 codebases (jq/C, Rails/Ruby, FastAPI/Python) with 150 questions total (15 easy + 20 medium + 15 hard per codebase). Weighted scoring: easy×1, medium×2, hard×3 = 100 points per codebase, 300 total.

Aggregate Weighted Scores

Variant	Weighted Score	Percentage
f32	240 / 300	80.0%
f16	240 / 300	80.0%
Q8_0	237 / 300	79.0%

Per-Corpus Scores

Corpus	f32	f16	Q8_0
jq (C)	66/100	66/100	63/100
Rails (Ruby)	79/100	79/100	79/100
FastAPI (Python)	95/100	95/100	95/100

Quantization Quality (Top-1 Agreement vs f32)

Corpus	f16	Q8_0
jq	100.0%	96.0%
Rails	100.0%	100.0%
FastAPI	100.0%	98.0%

Key Findings

f16 is lossless — identical weighted score (240/300) and 100% top-1 agreement across all codebases
Q8_0 loses only 1% — 237/300 vs 240/300, drops only on hard queries in jq corpus
Q8_0 is fastest — 2.5s avg query vs 3.4s f32 vs 13.4s f16 (CPU without FP16 hardware)
Easy/medium questions show zero quality difference between all variants

Conversion

Converted using litembeddings' ColBERT converter with PyLate projection support:

python scripts/convert_colbert_to_gguf.py lightonai/LateOn-Code-edge ./models \
    --name lightonai-lateon-code-edge-f16 --quantize f16

The converter automatically detects the PyLate two-layer projection structure (1_Dense + 2_Dense) and composes them into a single projection matrix via W_composed = W2 @ W1.

Downloads last month: 16

GGUF

Model size

16.8M params

Architecture

modern-bert

Hardware compatibility

16-bit

Model tree for embedme/lightonai-lateon-code-edge-f16

Base model

lightonai/LateOn-Code-edge

Quantized

(3)

this model