--- license: cc-by-sa-4.0 language: - la library_name: onnx pipeline_tag: text-generation tags: - latin - gpt - from-scratch - onnx - classical-latin --- # Cicero LLM A 100M-parameter Latin language model, **trained from scratch** — no pretrained backbone, no English/Greek base. It generates Classical Latin in the browser or anywhere ONNX runs. Live demo (browser inference): https://cicerollm.com ## Model - Decoder-only transformer, ~111M params (12 layers × 12 heads × 768 dim, 2048 block size, learned absolute positions, tied embeddings) - 32K SentencePiece-BPE tokenizer trained on the same Latin corpus - Trained from random init on a ~466M-token Latin corpus (30,000 steps, dropout 0.15), then **continued-pretrained on a targeted classical-grammar curriculum** (synthetic Cicero-register prose, generated and quality-filtered by a stronger model) mixed 30/70 with clean classical replay for 3,000 steps. The curriculum step pushes generation toward classical register and cuts the medieval/neo-Latin contamination and repetition of the base model. ## Evaluation Cloze accuracy (4-option multiple choice; held-out "blind" pack is the honest cross-model number): | pack | accuracy | |---|---| | held-out blind (144 items) | 0.72 | | literary diagnostic | 0.82 | | grammar-probe / weakness (60 items) | 0.82 | | in-distribution textbook | 0.77 | | bits-per-char (held-out) | 1.56 | ## Files - `model.int8.onnx` — int8-quantized ONNX (~136 MB; used by the browser demo) - `model.onnx` — fp32 ONNX (~543 MB) - `checkpoint_step_033000.pt` — raw PyTorch weights + optimizer state (~1.3 GB) - `tokenizer.json`, `tokenizer.model`, `tokenizer_config.json` — SentencePiece 32K - `config.json` — architecture metadata ## Usage (ONNX Runtime) ```python import onnxruntime as ort, numpy as np, sentencepiece as smp sp = smp.SentencePieceProcessor(model_file="tokenizer.model") sess = ort.InferenceSession("model.int8.onnx") ids = sp.encode("Gallia est omnis divisa", out_type=int) # forward returns next-token logits at the last position; sample autoregressively logits = sess.run(None, {"input_ids": np.array([ids], dtype=np.int64)})[0] ``` ## Limitations Research artifact. Autoregressive completion with temperature + top-k sampling; no instruction tuning, no chat behavior. Give it Latin and it continues in Latin. Best results in classical (Caesarian / Ciceronian) register. ## License CC-BY-SA-4.0. The underlying ancient texts are public domain by age; the share-alike condition derives from corpus components (e.g. Perseus digital editions). Attribution + share-alike apply to redistribution.