DanskGPT Tiny Chat ONNX
ONNX export of mhenrichsen/danskgpt-tiny-chat for use with ONNX Runtime, transformers.js, and @browser-ai/transformers-js.
Model Details
| Property | Value |
|---|---|
| Architecture | LlamaForCausalLM |
| Parameters | 1B |
| Hidden size | 2048 |
| Layers | 22 |
| Attention heads | 32 (4 KV) |
| Vocabulary | 32,002 tokens |
| Language | Danish |
| Task | Text Generation (Chat) |
| Chat template | ChatML (<|im_start|>user / <|im_end|>) |
Available Variants
| Variant | Files | Size | dtype |
Description |
|---|---|---|---|---|
| FP32 | model.onnx + .onnx_data |
4401 MB | "fp32" |
Original precision |
| FP16 | model_fp16.onnx + .onnx.data |
2201 MB | "fp16" |
Half precision (GPU) |
| INT8 | model_int8.onnx |
1102 MB | "int8" |
Dynamic INT8 quantization |
| INT4 | model_int4.onnx + .onnx.data |
910 MB | — | INT4 weight-only quantization |
Which variant should I use?
- CPU deployment: INT8. 3.5x faster decode, 4x smaller, self-contained single file.
- GPU deployment: FP16. Near-lossless quality with native FP16 compute.
- Max compression: INT4. ~4.2x faster decode, 4.8x smaller.
- Browser / transformers.js: INT8 recommended. Self-contained (~1.1 GB download, cached after first load).
Benchmarks (CPU, 20 prompt tokens, 50 decode steps)
| Variant | Prompt latency | Decode throughput | Size vs FP32 |
|---|---|---|---|
| FP32 | 137.0 ms | 8.5 tok/s | 1.0x |
| FP16 | 403.4 ms | 3.3 tok/s | 0.50x |
| INT8 | 44.9 ms | 29.5 tok/s | 0.25x |
| INT4 | 125.5 ms | 35.8 tok/s | 0.21x |
FP16 is slower than FP32 on CPU due to lack of native FP16 compute. Use FP16 for GPU only.
Accuracy vs FP32
| Variant | Cosine similarity | Top-1 match | Decode divergence |
|---|---|---|---|
| FP16 | 1.000000 | Yes | 0% |
| INT8 | 0.945775 | Yes | 80% |
| INT4 | 0.982727 | Yes | 97% |
FP16 is near-lossless. INT8/INT4 show greedy decode divergence due to autoregressive error amplification, but with temperature sampling the quality difference is usually acceptable.
File Structure
onnx/
model.onnx # FP32
model.onnx_data
model_fp16.onnx # FP16
model_fp16.onnx.data
model_int8.onnx # INT8 (self-contained)
model_int4.onnx # INT4
model_int4.onnx.data
config.json
generation_config.json
tokenizer.json
tokenizer.model
tokenizer_config.json
special_tokens_map.json
added_tokens.json
chat_template.jinja
Usage
With @browser-ai/transformers-js (Vercel AI SDK)
Use @browser-ai/transformers-js to run this model in the browser or Node.js with the Vercel AI SDK:
npm i @browser-ai/transformers-js @huggingface/transformers ai
import { streamText } from "ai";
import { transformersJS } from "@browser-ai/transformers-js";
const result = streamText({
model: transformersJS("varsan-g/danskgpt-tiny-chat-onnx", {
dtype: "int8", // recommended for browser/CPU (~1.1 GB)
}),
messages: [
{ role: "user", content: "Hvad er Danmark?" },
],
temperature: 0.7,
maxTokens: 200,
});
for await (const chunk of result.textStream) {
process.stdout.write(chunk);
}
With download progress (browser)
const model = transformersJS("varsan-g/danskgpt-tiny-chat-onnx", {
dtype: "int8",
initProgressCallback: (progress) => {
console.log(`Loading: ${Math.round(progress * 100)}%`);
},
});
// Pre-load before generation
await model.createSessionWithProgress((progress) => {
console.log(`Download: ${Math.round(progress * 100)}%`);
});
Web Worker mode (keeps UI responsive)
// worker.ts
import { TransformersJSWorkerHandler } from "@browser-ai/transformers-js";
const handler = new TransformersJSWorkerHandler();
self.onmessage = (msg: MessageEvent) => handler.onmessage(msg);
// main.ts
const model = transformersJS("varsan-g/danskgpt-tiny-chat-onnx", {
dtype: "int8",
worker: new Worker(new URL("./worker.ts", import.meta.url), { type: "module" }),
});
With @huggingface/transformers (direct)
Use @huggingface/transformers directly for lower-level control:
npm i @huggingface/transformers
import { pipeline } from "@huggingface/transformers";
const generator = await pipeline(
"text-generation",
"varsan-g/danskgpt-tiny-chat-onnx",
{ dtype: "int8" }
);
const messages = [
{ role: "user", content: "Fortæl mig om dansk historie." },
];
const result = await generator(messages, {
max_new_tokens: 200,
temperature: 0.7,
do_sample: true,
});
console.log(result[0].generated_text.at(-1).content);
Streaming with TextStreamer
import { AutoTokenizer, AutoModelForCausalLM, TextStreamer } from "@huggingface/transformers";
const tokenizer = await AutoTokenizer.from_pretrained("varsan-g/danskgpt-tiny-chat-onnx");
const model = await AutoModelForCausalLM.from_pretrained("varsan-g/danskgpt-tiny-chat-onnx", {
dtype: "int8",
});
const messages = [{ role: "user", content: "Hvad er Danmark?" }];
const inputs = tokenizer.apply_chat_template(messages, {
add_generation_prompt: true,
return_dict: true,
});
const streamer = new TextStreamer(tokenizer, {
skip_prompt: true,
skip_special_tokens: true,
callback_function: (text) => process.stdout.write(text),
});
await model.generate({ ...inputs, max_new_tokens: 200, do_sample: true, temperature: 0.7, streamer });
Note: This is a 1B parameter model. INT8 is ~1.1 GB which will take some time to download on first load (cached in browser afterward). For faster browser experiences, consider smaller models from the onnx-community.
With Optimum (Python)
from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer
model = ORTModelForCausalLM.from_pretrained("varsan-g/danskgpt-tiny-chat-onnx")
tokenizer = AutoTokenizer.from_pretrained("varsan-g/danskgpt-tiny-chat-onnx", use_fast=False)
prompt = "<|im_start|>user\nHvad er Danmark?<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Tokenizer note (Python): This model uses a SentencePiece
.modelfile. You must passuse_fast=FalsetoAutoTokenizer.from_pretrained()in Python, otherwise it may fail trying to parse the tokenizer as tiktoken. The JavaScript libraries usetokenizer.jsonand work without this workaround.
Changes from Original
This repository contains a format conversion of mhenrichsen/danskgpt-tiny-chat by mhenrichsen. The model weights have been converted from PyTorch (safetensors) to ONNX format, and quantized variants (FP16, INT8, INT4) have been produced. No fine-tuning or architectural modifications were made.
License
This model is distributed under the Apache License 2.0, the same license as the original model. See the LICENSE file in this repository for the full license text.
Original model by mhenrichsen: mhenrichsen/danskgpt-tiny-chat
- Downloads last month
- 27
Model tree for varsan-g/danskgpt-tiny-chat-onnx
Base model
mhenrichsen/danskgpt-tiny