DanskGPT Tiny Chat ONNX

ONNX export of mhenrichsen/danskgpt-tiny-chat for use with ONNX Runtime, transformers.js, and @browser-ai/transformers-js.

Model Details

Property	Value
Architecture	LlamaForCausalLM
Parameters	1B
Hidden size	2048
Layers	22
Attention heads	32 (4 KV)
Vocabulary	32,002 tokens
Language	Danish
Task	Text Generation (Chat)
Chat template	ChatML (`<\|im_start\|>user` / `<\|im_end\|>`)

Available Variants

Variant	Files	Size	`dtype`	Description
FP32	`model.onnx` + `.onnx_data`	4401 MB	`"fp32"`	Original precision
FP16	`model_fp16.onnx` + `.onnx.data`	2201 MB	`"fp16"`	Half precision (GPU)
INT8	`model_int8.onnx`	1102 MB	`"int8"`	Dynamic INT8 quantization
INT4	`model_int4.onnx` + `.onnx.data`	910 MB	—	INT4 weight-only quantization

Which variant should I use?

CPU deployment: INT8. 3.5x faster decode, 4x smaller, self-contained single file.
GPU deployment: FP16. Near-lossless quality with native FP16 compute.
Max compression: INT4. ~4.2x faster decode, 4.8x smaller.
Browser / transformers.js: INT8 recommended. Self-contained (~1.1 GB download, cached after first load).

Benchmarks (CPU, 20 prompt tokens, 50 decode steps)

Variant	Prompt latency	Decode throughput	Size vs FP32
FP32	137.0 ms	8.5 tok/s	1.0x
FP16	403.4 ms	3.3 tok/s	0.50x
INT8	44.9 ms	29.5 tok/s	0.25x
INT4	125.5 ms	35.8 tok/s	0.21x

FP16 is slower than FP32 on CPU due to lack of native FP16 compute. Use FP16 for GPU only.

Accuracy vs FP32

Variant	Cosine similarity	Top-1 match	Decode divergence
FP16	1.000000	Yes	0%
INT8	0.945775	Yes	80%
INT4	0.982727	Yes	97%

FP16 is near-lossless. INT8/INT4 show greedy decode divergence due to autoregressive error amplification, but with temperature sampling the quality difference is usually acceptable.

File Structure

onnx/
  model.onnx               # FP32
  model.onnx_data
  model_fp16.onnx           # FP16
  model_fp16.onnx.data
  model_int8.onnx           # INT8 (self-contained)
  model_int4.onnx           # INT4
  model_int4.onnx.data
config.json
generation_config.json
tokenizer.json
tokenizer.model
tokenizer_config.json
special_tokens_map.json
added_tokens.json
chat_template.jinja

Usage

With @browser-ai/transformers-js (Vercel AI SDK)

Use @browser-ai/transformers-js to run this model in the browser or Node.js with the Vercel AI SDK:

npm i @browser-ai/transformers-js @huggingface/transformers ai

import { streamText } from "ai";
import { transformersJS } from "@browser-ai/transformers-js";

const result = streamText({
  model: transformersJS("varsan-g/danskgpt-tiny-chat-onnx", {
    dtype: "int8",   // recommended for browser/CPU (~1.1 GB)
  }),
  messages: [
    { role: "user", content: "Hvad er Danmark?" },
  ],
  temperature: 0.7,
  maxTokens: 200,
});

for await (const chunk of result.textStream) {
  process.stdout.write(chunk);
}

With download progress (browser)

const model = transformersJS("varsan-g/danskgpt-tiny-chat-onnx", {
  dtype: "int8",
  initProgressCallback: (progress) => {
    console.log(`Loading: ${Math.round(progress * 100)}%`);
  },
});

// Pre-load before generation
await model.createSessionWithProgress((progress) => {
  console.log(`Download: ${Math.round(progress * 100)}%`);
});

Web Worker mode (keeps UI responsive)

// worker.ts
import { TransformersJSWorkerHandler } from "@browser-ai/transformers-js";
const handler = new TransformersJSWorkerHandler();
self.onmessage = (msg: MessageEvent) => handler.onmessage(msg);

// main.ts
const model = transformersJS("varsan-g/danskgpt-tiny-chat-onnx", {
  dtype: "int8",
  worker: new Worker(new URL("./worker.ts", import.meta.url), { type: "module" }),
});

With @huggingface/transformers (direct)

Use @huggingface/transformers directly for lower-level control:

npm i @huggingface/transformers

import { pipeline } from "@huggingface/transformers";

const generator = await pipeline(
  "text-generation",
  "varsan-g/danskgpt-tiny-chat-onnx",
  { dtype: "int8" }
);

const messages = [
  { role: "user", content: "Fortæl mig om dansk historie." },
];

const result = await generator(messages, {
  max_new_tokens: 200,
  temperature: 0.7,
  do_sample: true,
});

console.log(result[0].generated_text.at(-1).content);

Streaming with TextStreamer

import { AutoTokenizer, AutoModelForCausalLM, TextStreamer } from "@huggingface/transformers";

const tokenizer = await AutoTokenizer.from_pretrained("varsan-g/danskgpt-tiny-chat-onnx");
const model = await AutoModelForCausalLM.from_pretrained("varsan-g/danskgpt-tiny-chat-onnx", {
  dtype: "int8",
});

const messages = [{ role: "user", content: "Hvad er Danmark?" }];
const inputs = tokenizer.apply_chat_template(messages, {
  add_generation_prompt: true,
  return_dict: true,
});

const streamer = new TextStreamer(tokenizer, {
  skip_prompt: true,
  skip_special_tokens: true,
  callback_function: (text) => process.stdout.write(text),
});

await model.generate({ ...inputs, max_new_tokens: 200, do_sample: true, temperature: 0.7, streamer });

Note: This is a 1B parameter model. INT8 is ~1.1 GB which will take some time to download on first load (cached in browser afterward). For faster browser experiences, consider smaller models from the onnx-community.

With Optimum (Python)

from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer

model = ORTModelForCausalLM.from_pretrained("varsan-g/danskgpt-tiny-chat-onnx")
tokenizer = AutoTokenizer.from_pretrained("varsan-g/danskgpt-tiny-chat-onnx", use_fast=False)

prompt = "<|im_start|>user\nHvad er Danmark?<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Tokenizer note (Python): This model uses a SentencePiece .model file. You must pass use_fast=False to AutoTokenizer.from_pretrained() in Python, otherwise it may fail trying to parse the tokenizer as tiktoken. The JavaScript libraries use tokenizer.json and work without this workaround.

Changes from Original

This repository contains a format conversion of mhenrichsen/danskgpt-tiny-chat by mhenrichsen. The model weights have been converted from PyTorch (safetensors) to ONNX format, and quantized variants (FP16, INT8, INT4) have been produced. No fine-tuning or architectural modifications were made.

License

This model is distributed under the Apache License 2.0, the same license as the original model. See the LICENSE file in this repository for the full license text.

Original model by mhenrichsen: mhenrichsen/danskgpt-tiny-chat

Downloads last month: 2

Model tree for varsan-g/danskgpt-tiny-chat-onnx

Base model

mhenrichsen/danskgpt-tiny

Finetuned

mhenrichsen/danskgpt-tiny-chat

Quantized

(1)

this model