cicero / README.md

Switch back to Run B (curriculum-tuned) — cleaner live generation

a284d49 verified 8 days ago

2.64 kB

	---
	license: cc-by-sa-4.0
	language:
	- la
	library_name: onnx
	pipeline_tag: text-generation
	tags:
	- latin
	- gpt
	- from-scratch
	- onnx
	- classical-latin
	---

	# Cicero LLM

	A 100M-parameter Latin language model, trained from scratch — no pretrained
	backbone, no English/Greek base. It generates Classical Latin in the browser
	or anywhere ONNX runs.

	Live demo (browser inference): https://cicerollm.com

	## Model

	- Decoder-only transformer, ~111M params (12 layers × 12 heads × 768 dim,
	2048 block size, learned absolute positions, tied embeddings)
	- 32K SentencePiece-BPE tokenizer trained on the same Latin corpus
	- Trained from random init on a ~466M-token Latin corpus (30,000 steps,
	dropout 0.15), then **continued-pretrained on a targeted classical-grammar
	curriculum** (synthetic Cicero-register prose, generated and quality-filtered
	by a stronger model) mixed 30/70 with clean classical replay for 3,000 steps.
	The curriculum step pushes generation toward classical register and cuts the
	medieval/neo-Latin contamination and repetition of the base model.

	## Evaluation

	Cloze accuracy (4-option multiple choice; held-out "blind" pack is the honest
	cross-model number):

	\| pack \| accuracy \|
	\|---\|---\|
	\| held-out blind (144 items) \| 0.72 \|
	\| literary diagnostic \| 0.82 \|
	\| grammar-probe / weakness (60 items) \| 0.82 \|
	\| in-distribution textbook \| 0.77 \|
	\| bits-per-char (held-out) \| 1.56 \|

	## Files

	- `model.int8.onnx` — int8-quantized ONNX (~136 MB; used by the browser demo)
	- `model.onnx` — fp32 ONNX (~543 MB)
	- `checkpoint_step_033000.pt` — raw PyTorch weights + optimizer state (~1.3 GB)
	- `tokenizer.json`, `tokenizer.model`, `tokenizer_config.json` — SentencePiece 32K
	- `config.json` — architecture metadata

	## Usage (ONNX Runtime)

	```python
	import onnxruntime as ort, numpy as np, sentencepiece as smp
	sp = smp.SentencePieceProcessor(model_file="tokenizer.model")
	sess = ort.InferenceSession("model.int8.onnx")
	ids = sp.encode("Gallia est omnis divisa", out_type=int)
	# forward returns next-token logits at the last position; sample autoregressively
	logits = sess.run(None, {"input_ids": np.array([ids], dtype=np.int64)})[0]
	```

	## Limitations

	Research artifact. Autoregressive completion with temperature + top-k sampling;
	no instruction tuning, no chat behavior. Give it Latin and it continues in
	Latin. Best results in classical (Caesarian / Ciceronian) register.

	## License

	CC-BY-SA-4.0. The underlying ancient texts are public domain by age; the
	share-alike condition derives from corpus components (e.g. Perseus digital
	editions). Attribution + share-alike apply to redistribution.