DanskGPT Tiny Chat ONNX

ONNX export of mhenrichsen/danskgpt-tiny-chat for use with ONNX Runtime, transformers.js, and @browser-ai/transformers-js.

Model Details

Property Value
Architecture LlamaForCausalLM
Parameters 1B
Hidden size 2048
Layers 22
Attention heads 32 (4 KV)
Vocabulary 32,002 tokens
Language Danish
Task Text Generation (Chat)
Chat template ChatML (<|im_start|>user / <|im_end|>)

Available Variants

Variant Files Size dtype Description
FP32 model.onnx + .onnx_data 4401 MB "fp32" Original precision
FP16 model_fp16.onnx + .onnx.data 2201 MB "fp16" Half precision (GPU)
INT8 model_int8.onnx 1102 MB "int8" Dynamic INT8 quantization
INT4 model_int4.onnx + .onnx.data 910 MB INT4 weight-only quantization

Which variant should I use?

  • CPU deployment: INT8. 3.5x faster decode, 4x smaller, self-contained single file.
  • GPU deployment: FP16. Near-lossless quality with native FP16 compute.
  • Max compression: INT4. ~4.2x faster decode, 4.8x smaller.
  • Browser / transformers.js: INT8 recommended. Self-contained (~1.1 GB download, cached after first load).

Benchmarks (CPU, 20 prompt tokens, 50 decode steps)

Variant Prompt latency Decode throughput Size vs FP32
FP32 137.0 ms 8.5 tok/s 1.0x
FP16 403.4 ms 3.3 tok/s 0.50x
INT8 44.9 ms 29.5 tok/s 0.25x
INT4 125.5 ms 35.8 tok/s 0.21x

FP16 is slower than FP32 on CPU due to lack of native FP16 compute. Use FP16 for GPU only.

Accuracy vs FP32

Variant Cosine similarity Top-1 match Decode divergence
FP16 1.000000 Yes 0%
INT8 0.945775 Yes 80%
INT4 0.982727 Yes 97%

FP16 is near-lossless. INT8/INT4 show greedy decode divergence due to autoregressive error amplification, but with temperature sampling the quality difference is usually acceptable.

File Structure

onnx/
  model.onnx               # FP32
  model.onnx_data
  model_fp16.onnx           # FP16
  model_fp16.onnx.data
  model_int8.onnx           # INT8 (self-contained)
  model_int4.onnx           # INT4
  model_int4.onnx.data
config.json
generation_config.json
tokenizer.json
tokenizer.model
tokenizer_config.json
special_tokens_map.json
added_tokens.json
chat_template.jinja

Usage

With @browser-ai/transformers-js (Vercel AI SDK)

Use @browser-ai/transformers-js to run this model in the browser or Node.js with the Vercel AI SDK:

npm i @browser-ai/transformers-js @huggingface/transformers ai
import { streamText } from "ai";
import { transformersJS } from "@browser-ai/transformers-js";

const result = streamText({
  model: transformersJS("varsan-g/danskgpt-tiny-chat-onnx", {
    dtype: "int8",   // recommended for browser/CPU (~1.1 GB)
  }),
  messages: [
    { role: "user", content: "Hvad er Danmark?" },
  ],
  temperature: 0.7,
  maxTokens: 200,
});

for await (const chunk of result.textStream) {
  process.stdout.write(chunk);
}

With download progress (browser)

const model = transformersJS("varsan-g/danskgpt-tiny-chat-onnx", {
  dtype: "int8",
  initProgressCallback: (progress) => {
    console.log(`Loading: ${Math.round(progress * 100)}%`);
  },
});

// Pre-load before generation
await model.createSessionWithProgress((progress) => {
  console.log(`Download: ${Math.round(progress * 100)}%`);
});

Web Worker mode (keeps UI responsive)

// worker.ts
import { TransformersJSWorkerHandler } from "@browser-ai/transformers-js";
const handler = new TransformersJSWorkerHandler();
self.onmessage = (msg: MessageEvent) => handler.onmessage(msg);

// main.ts
const model = transformersJS("varsan-g/danskgpt-tiny-chat-onnx", {
  dtype: "int8",
  worker: new Worker(new URL("./worker.ts", import.meta.url), { type: "module" }),
});

With @huggingface/transformers (direct)

Use @huggingface/transformers directly for lower-level control:

npm i @huggingface/transformers
import { pipeline } from "@huggingface/transformers";

const generator = await pipeline(
  "text-generation",
  "varsan-g/danskgpt-tiny-chat-onnx",
  { dtype: "int8" }
);

const messages = [
  { role: "user", content: "Fortæl mig om dansk historie." },
];

const result = await generator(messages, {
  max_new_tokens: 200,
  temperature: 0.7,
  do_sample: true,
});

console.log(result[0].generated_text.at(-1).content);

Streaming with TextStreamer

import { AutoTokenizer, AutoModelForCausalLM, TextStreamer } from "@huggingface/transformers";

const tokenizer = await AutoTokenizer.from_pretrained("varsan-g/danskgpt-tiny-chat-onnx");
const model = await AutoModelForCausalLM.from_pretrained("varsan-g/danskgpt-tiny-chat-onnx", {
  dtype: "int8",
});

const messages = [{ role: "user", content: "Hvad er Danmark?" }];
const inputs = tokenizer.apply_chat_template(messages, {
  add_generation_prompt: true,
  return_dict: true,
});

const streamer = new TextStreamer(tokenizer, {
  skip_prompt: true,
  skip_special_tokens: true,
  callback_function: (text) => process.stdout.write(text),
});

await model.generate({ ...inputs, max_new_tokens: 200, do_sample: true, temperature: 0.7, streamer });

Note: This is a 1B parameter model. INT8 is ~1.1 GB which will take some time to download on first load (cached in browser afterward). For faster browser experiences, consider smaller models from the onnx-community.

With Optimum (Python)

from optimum.onnxruntime import ORTModelForCausalLM
from transformers import AutoTokenizer

model = ORTModelForCausalLM.from_pretrained("varsan-g/danskgpt-tiny-chat-onnx")
tokenizer = AutoTokenizer.from_pretrained("varsan-g/danskgpt-tiny-chat-onnx", use_fast=False)

prompt = "<|im_start|>user\nHvad er Danmark?<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Tokenizer note (Python): This model uses a SentencePiece .model file. You must pass use_fast=False to AutoTokenizer.from_pretrained() in Python, otherwise it may fail trying to parse the tokenizer as tiktoken. The JavaScript libraries use tokenizer.json and work without this workaround.

Changes from Original

This repository contains a format conversion of mhenrichsen/danskgpt-tiny-chat by mhenrichsen. The model weights have been converted from PyTorch (safetensors) to ONNX format, and quantized variants (FP16, INT8, INT4) have been produced. No fine-tuning or architectural modifications were made.

License

This model is distributed under the Apache License 2.0, the same license as the original model. See the LICENSE file in this repository for the full license text.

Original model by mhenrichsen: mhenrichsen/danskgpt-tiny-chat

Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for varsan-g/danskgpt-tiny-chat-onnx

Quantized
(1)
this model