Voynich KSimplex Translator

A geometric deep learning system for analyzing and interpreting the Voynich Manuscript using KSimplex similarity assessment trained on Latin Wikipedia.

Model Description

This system combines:

KSimplex Similarity Assessor: A novel geometric architecture using simplex-based routing for similarity computation
Dual Embedding Fusion: Combines SBERT semantic embeddings (384-dim) with character-level TF-IDF (30k features)
Cross-Corpus Transfer: Trained on Latin Wikipedia, applied to Voynich manuscript analysis
Morphological Translator: Rule-based translation using discovered prefix-stem-suffix patterns

Architecture

Input Text
    │
    ├──► SBERT (all-MiniLM-L6-v2) ──► 384-dim
    │                                    │
    ├──► Char TF-IDF (3-5 grams) ──► 30k-dim
    │                                    │
    ▼                                    ▼
┌─────────────────────────────────────────────┐
│         KSimplex Similarity Assessor        │
├─────────────────────────────────────────────┤
│  SBERT Projection ──► 256-dim               │
│  TF-IDF Projection ──► 256-dim              │
│            │                                │
│            ▼                                │
│       Fusion Layer ──► 256-dim              │
│            │                                │
│            ▼                                │
│   SimplexSimilarityLayer × 3 (k=4)          │
│   ┌─────────────────────────────┐           │
│   │  Route Projection (→4 edges)│           │
│   │  Edge Transforms (4×Linear) │           │
│   │  Weighted Sum + LayerNorm   │           │
│   └─────────────────────────────┘           │
│            │                                │
│            ▼                                │
│      Similarity Head ──► 128-dim            │
│      (L2 normalized)                        │
└─────────────────────────────────────────────┘
    │
    ▼
128-dim Similarity Embedding

Training

Corpus: Latin Wikipedia (2000 documents, ~1.09M tokens)
Windows: 200 tokens, stride 100
Method: Contrastive learning with bucket classification auxiliary task
Buckets: Louvain community detection on blended similarity graph (SBERT α=0.6 + TF-IDF)
Loss: Contrastive margin loss + Cross-entropy bucket classification
Performance: 100% accuracy, positive similarity 0.99 vs negative 0.25

Key Findings

Manuscript Structure

Section	Folios	Character	Style Group
Herbal A	f1-f57	Dense prose, plant descriptions	A
Herbal B	f58-f66	Variant herbal style	A
Astronomical	f67-f73	Zodiac, celestial diagrams	A
Biological	f75-f84	Nymph figures, labels	B
Cosmological	f85-f86	Rosette foldouts	C
Pharmaceutical	f87-f102	Recipe format (p...am)	C
Recipes	f103-f116	Cross-references, star labels	B

Discovered Patterns

Structural Markers (Greek-derived):

p = Recipe/paragraph start (π)
m, g = Line-end markers (μ, γ)
s, l, o = Label markers (σ, λ, ο)
-am, -dam, -ram = Recipe terminators (measurement)

Morphological System:

(PREFIX) + STEM + (SUFFIX) + (n)

Prefixes: qok- (the-), ok- (this-), ot- (other-), da- (of-)
Suffixes: -dy (matter), -ey (type), -in (of), -ol (liquid), -ar (part)
Bound 'n': Attaches to -ai- stems (daiin, qokaiin, okaiin)

Section Similarity Matrix:

              Herbal_A  Astro  Bio   Cosmo  Pharma  Recipe
Herbal A         1.00   0.99  0.93   0.77   0.77    0.88
Astronomical     0.99   1.00  0.96   0.83   0.82    0.92
Biological       0.93   0.96  1.00   0.94   0.94    0.99
Cosmological     0.77   0.83  0.94   1.00   0.98    0.97
Pharmaceutical   0.77   0.82  0.94   0.98   1.00    0.95
Recipes          0.88   0.92  0.99   0.97   0.95    1.00

Three Style Groups:

Group A: Herbal + Astronomical (0.988 similarity)
Group B: Biological + Recipes (0.991 similarity)
Group C: Cosmological + Pharmaceutical (0.975 similarity)

Usage

Installation

pip install torch sentence-transformers scikit-learn datasets

Quick Start

from voynich_translator import VoynichTranslator

translator = VoynichTranslator()

# Translate text
result = translator.translate("daiin chedy qokeey shedy chol daiin")
print(result['english'])  # "the herb bloom leaf stem the"
print(result['section'])  # "Herbal A"
print(result['confidence'])  # 1.0

# Translate with verbose analysis
result = translator.translate("p ol shy am", verbose=True)
# Returns word-by-word analysis and similar Latin passages

# Translate entire folio
folio = translator.translate_folio('f75r')
print(folio['full_english'])

# Find similar passages
similar = translator.find_similar_voynich("chedy qokeey")
latin = translator.find_similar_latin("chedy qokeey")

Full Standalone Setup

# Requires: Latin Wikipedia reload for TF-IDF vocabulary
# See standalone cell in repository for complete setup

from datasets import load_dataset

# 1. Load Latin corpus (same as training)
ds = load_dataset("wikimedia/wikipedia", "20231101.la", split="train", streaming=True)
# ... build windows, fit TF-IDF

# 2. Transform Voynich using Latin vectorizer
X_voy_tfidf = vec_lat.transform(voynich_texts)

# 3. Encode through KSimplex model
emb, _ = model(sbert_emb, tfidf_emb)

Lexicon

Core vocabulary mappings based on frequency and morphological analysis:

Voynich	English	Category
daiin	the	Determiner
aiin	this	Determiner
qokaiin	the-said	Determiner
chedy	herb	Plant
shedy	leaf	Plant
qokeedy	blossom	Plant
chol	stem	Plant
ol	oil	Preparation
ar	root	Plant part
or	seed	Plant part
p	¶ (recipe start)	Marker
am	⚗ (measure end)	Marker

Files

voynich-ksimplex-translator/
├── README.md                      # This file
├── voynich_translator.py          # Complete standalone translator
├── ksimplex_model.py              # Model architecture
├── ksimplex_similarity_model.pt   # Trained weights
├── similarity_embeddings.npz      # Pre-computed embeddings
│   ├── voynich_emb                # (N_voy, 128) Voynich embeddings
│   ├── voynich_labels             # Cluster assignments
│   ├── latin_emb                  # (N_lat, 128) Latin embeddings
│   └── latin_labels               # Latin bucket assignments
└── voynich_analysis_results.json  # Statistical analysis

Limitations

⚠️ This is interpretive translation, not decipherment.

The Voynich cipher has not been broken. This system provides:

✅ Structural analysis (recipe patterns, labels, cross-references)
✅ Section classification with high accuracy
✅ Morphological interpretation of word patterns
✅ Similarity-based retrieval across Latin/Voynich corpora
❌ True plaintext recovery
❌ Verified word meanings

The lexicon is based on:

Statistical frequency matching to Latin
Positional grammar analysis
Morphological pattern recognition
Section-context inference

Citation

@software{voynich_ksimplex_2026,
  title={Voynich KSimplex Translator: Geometric Deep Learning for Manuscript Analysis},
  author={AbstractPhil},
  year={2026},
  url={https://huggingface.co/AbstractPhil/sbert-voynich-translation}
}

References

Voynich Manuscript: Beinecke Library MS 408
IVTFF Transcription: Zandbergen-Landini
Latin Wikipedia: Wikimedia

License

MIT License - See LICENSE file for details.

"The Voynich appears to be a practical document (recipes, medical prescriptions) using Greek-derived notation for structure, with verbose cipher encoding the content, and a cross-reference system linking sections."

Downloads last month: -; Downloads are not tracked for this model. How to track

AbstractPhil
/

sbert-voynich-translation