Cicero LLM

A 100M-parameter Latin language model, trained from scratch โ€” no pretrained backbone, no English/Greek base. It generates Classical Latin in the browser or anywhere ONNX runs.

Live demo (browser inference): https://cicerollm.com

Model

  • Decoder-only transformer, ~111M params (12 layers ร— 12 heads ร— 768 dim, 2048 block size, learned absolute positions, tied embeddings)
  • 32K SentencePiece-BPE tokenizer trained on the same Latin corpus
  • Trained from random init on a ~466M-token Latin corpus (30,000 steps, dropout 0.15), then continued-pretrained on a targeted classical-grammar curriculum (synthetic Cicero-register prose, generated and quality-filtered by a stronger model) mixed 30/70 with clean classical replay for 3,000 steps. The curriculum step pushes generation toward classical register and cuts the medieval/neo-Latin contamination and repetition of the base model.

Evaluation

Cloze accuracy (4-option multiple choice; held-out "blind" pack is the honest cross-model number):

pack accuracy
held-out blind (144 items) 0.72
literary diagnostic 0.82
grammar-probe / weakness (60 items) 0.82
in-distribution textbook 0.77
bits-per-char (held-out) 1.56

Files

  • model.int8.onnx โ€” int8-quantized ONNX (~136 MB; used by the browser demo)
  • model.onnx โ€” fp32 ONNX (~543 MB)
  • checkpoint_step_033000.pt โ€” raw PyTorch weights + optimizer state (~1.3 GB)
  • tokenizer.json, tokenizer.model, tokenizer_config.json โ€” SentencePiece 32K
  • config.json โ€” architecture metadata

Usage (ONNX Runtime)

import onnxruntime as ort, numpy as np, sentencepiece as smp
sp = smp.SentencePieceProcessor(model_file="tokenizer.model")
sess = ort.InferenceSession("model.int8.onnx")
ids = sp.encode("Gallia est omnis divisa", out_type=int)
# forward returns next-token logits at the last position; sample autoregressively
logits = sess.run(None, {"input_ids": np.array([ids], dtype=np.int64)})[0]

Limitations

Research artifact. Autoregressive completion with temperature + top-k sampling; no instruction tuning, no chat behavior. Give it Latin and it continues in Latin. Best results in classical (Caesarian / Ciceronian) register.

License

CC-BY-SA-4.0. The underlying ancient texts are public domain by age; the share-alike condition derives from corpus components (e.g. Perseus digital editions). Attribution + share-alike apply to redistribution.

Downloads last month
39
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support