| --- |
| license: cc-by-sa-4.0 |
| language: |
| - la |
| library_name: onnx |
| pipeline_tag: text-generation |
| tags: |
| - latin |
| - gpt |
| - from-scratch |
| - onnx |
| - classical-latin |
| --- |
| |
| # Cicero LLM |
|
|
| A 100M-parameter Latin language model, **trained from scratch** β no pretrained |
| backbone, no English/Greek base. It generates Classical Latin in the browser |
| or anywhere ONNX runs. |
|
|
| Live demo (browser inference): https://cicerollm.com |
|
|
| ## Model |
|
|
| - Decoder-only transformer, ~111M params (12 layers Γ 12 heads Γ 768 dim, |
| 2048 block size, learned absolute positions, tied embeddings) |
| - 32K SentencePiece-BPE tokenizer trained on the same Latin corpus |
| - Trained from random init on a ~466M-token Latin corpus (30,000 steps, |
| dropout 0.15), then **continued-pretrained on a targeted classical-grammar |
| curriculum** (synthetic Cicero-register prose, generated and quality-filtered |
| by a stronger model) mixed 30/70 with clean classical replay for 3,000 steps. |
| The curriculum step pushes generation toward classical register and cuts the |
| medieval/neo-Latin contamination and repetition of the base model. |
|
|
| ## Evaluation |
|
|
| Cloze accuracy (4-option multiple choice; held-out "blind" pack is the honest |
| cross-model number): |
|
|
| | pack | accuracy | |
| |---|---| |
| | held-out blind (144 items) | 0.72 | |
| | literary diagnostic | 0.82 | |
| | grammar-probe / weakness (60 items) | 0.82 | |
| | in-distribution textbook | 0.77 | |
| | bits-per-char (held-out) | 1.56 | |
|
|
| ## Files |
|
|
| - `model.int8.onnx` β int8-quantized ONNX (~136 MB; used by the browser demo) |
| - `model.onnx` β fp32 ONNX (~543 MB) |
| - `checkpoint_step_033000.pt` β raw PyTorch weights + optimizer state (~1.3 GB) |
| - `tokenizer.json`, `tokenizer.model`, `tokenizer_config.json` β SentencePiece 32K |
| - `config.json` β architecture metadata |
|
|
| ## Usage (ONNX Runtime) |
|
|
| ```python |
| import onnxruntime as ort, numpy as np, sentencepiece as smp |
| sp = smp.SentencePieceProcessor(model_file="tokenizer.model") |
| sess = ort.InferenceSession("model.int8.onnx") |
| ids = sp.encode("Gallia est omnis divisa", out_type=int) |
| # forward returns next-token logits at the last position; sample autoregressively |
| logits = sess.run(None, {"input_ids": np.array([ids], dtype=np.int64)})[0] |
| ``` |
|
|
| ## Limitations |
|
|
| Research artifact. Autoregressive completion with temperature + top-k sampling; |
| no instruction tuning, no chat behavior. Give it Latin and it continues in |
| Latin. Best results in classical (Caesarian / Ciceronian) register. |
|
|
| ## License |
|
|
| CC-BY-SA-4.0. The underlying ancient texts are public domain by age; the |
| share-alike condition derives from corpus components (e.g. Perseus digital |
| editions). Attribution + share-alike apply to redistribution. |
|
|