File size: 2,640 Bytes
af55a90
94d35d6
 
 
 
 
 
 
 
 
 
 
af55a90
94d35d6
 
 
 
a284d49
 
94d35d6
a72f10e
94d35d6
 
 
 
 
 
a284d49
 
 
 
 
 
94d35d6
 
 
 
 
 
 
 
a284d49
 
 
 
 
94d35d6
 
 
 
 
a284d49
94d35d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a284d49
94d35d6
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
license: cc-by-sa-4.0
language:
- la
library_name: onnx
pipeline_tag: text-generation
tags:
- latin
- gpt
- from-scratch
- onnx
- classical-latin
---

# Cicero LLM

A 100M-parameter Latin language model, **trained from scratch** — no pretrained
backbone, no English/Greek base. It generates Classical Latin in the browser
or anywhere ONNX runs.

Live demo (browser inference): https://cicerollm.com

## Model

- Decoder-only transformer, ~111M params (12 layers × 12 heads × 768 dim,
  2048 block size, learned absolute positions, tied embeddings)
- 32K SentencePiece-BPE tokenizer trained on the same Latin corpus
- Trained from random init on a ~466M-token Latin corpus (30,000 steps,
  dropout 0.15), then **continued-pretrained on a targeted classical-grammar
  curriculum** (synthetic Cicero-register prose, generated and quality-filtered
  by a stronger model) mixed 30/70 with clean classical replay for 3,000 steps.
  The curriculum step pushes generation toward classical register and cuts the
  medieval/neo-Latin contamination and repetition of the base model.

## Evaluation

Cloze accuracy (4-option multiple choice; held-out "blind" pack is the honest
cross-model number):

| pack | accuracy |
|---|---|
| held-out blind (144 items) | 0.72 |
| literary diagnostic | 0.82 |
| grammar-probe / weakness (60 items) | 0.82 |
| in-distribution textbook | 0.77 |
| bits-per-char (held-out) | 1.56 |

## Files

- `model.int8.onnx` — int8-quantized ONNX (~136 MB; used by the browser demo)
- `model.onnx` — fp32 ONNX (~543 MB)
- `checkpoint_step_033000.pt` — raw PyTorch weights + optimizer state (~1.3 GB)
- `tokenizer.json`, `tokenizer.model`, `tokenizer_config.json` — SentencePiece 32K
- `config.json` — architecture metadata

## Usage (ONNX Runtime)

```python
import onnxruntime as ort, numpy as np, sentencepiece as smp
sp = smp.SentencePieceProcessor(model_file="tokenizer.model")
sess = ort.InferenceSession("model.int8.onnx")
ids = sp.encode("Gallia est omnis divisa", out_type=int)
# forward returns next-token logits at the last position; sample autoregressively
logits = sess.run(None, {"input_ids": np.array([ids], dtype=np.int64)})[0]
```

## Limitations

Research artifact. Autoregressive completion with temperature + top-k sampling;
no instruction tuning, no chat behavior. Give it Latin and it continues in
Latin. Best results in classical (Caesarian / Ciceronian) register.

## License

CC-BY-SA-4.0. The underlying ancient texts are public domain by age; the
share-alike condition derives from corpus components (e.g. Perseus digital
editions). Attribution + share-alike apply to redistribution.